Milvus
Zilliz

Does Qwen 3.5 require GPU hardware for inference?

Qwen 3.5 models run on GPUs for optimal performance, but the compact sizes (0.8B–9B) also enable CPU inference with acceptable latency for cost-constrained deployments.

GPU inference is recommended: a single NVIDIA A100 (80GB) can serve Qwen3-9B embeddings at hundreds of queries/sec. For smaller models like Qwen3-0.8B, even modest GPUs (RTX 3060) work well. CPU inference (using ONNX Runtime or llama.cpp optimizations) is viable for latency-tolerant applications but offers 5-10x slower throughput.

With Milvus, you control hardware placement entirely. Deploy the embedding server alongside Milvus on the same GPU cluster, use a separate inference box, or run CPU embeddings and accept higher latency. Milvus scales independently of embedding infrastructure, so you can add vector storage nodes without touching the embedding pipeline. This flexibility lets you optimize cost and performance for your workload: GPUs for low-latency RAG, CPUs for batch preprocessing.

Like the article? Spread the word