vLLM Production: Conclusion FAQ and Next Steps

2026.02.21
Technology
671 Words
vLLM Production: Conclusion FAQ and Next Steps

Part 6 of 6. This is the final part of the series. Start from Part 1: Introduction and Architecture, or jump to any part using the cross-links below.

Conclusion

Deploying vLLM on Kubernetes means building a system that handles real traffic at real scale. This series walked through GPU-aware scheduling, tensor parallelism, quantization, custom-metric autoscaling, and production observability; all with TTFT, TPOT, and throughput SLIs tracked in Grafana.

Start small: single-GPU deployment, validate SLIs against benchmark tables, then scale to multi-GPU tensor parallelism and HPA as traffic demands.

Next steps:

If this guide helped you, star the vLLM repository on GitHub and subscribe for more production AI infrastructure content.

FAQ

What is vLLM?

vLLM is an open-source inference engine built for high-throughput LLM serving. It leverages PagedAttention, continuous batching, and tensor parallelism to serve models efficiently on NVIDIA GPUs through an OpenAI-compatible API. See the official docs for the full feature list.

How much GPU memory do I need for vLLM?

Depends on model size and precision. Here’s a practical reference for production:

Model SizeFP16/BF16AWQ 4-bitGPTQ 4-bitMinimum GPU
7B~14 GB~5 GB~5 GB1× A10G (24 GB)
13B~26 GB~9 GB~9 GB1× A100 40GB
70B~140 GB~45 GB~45 GB2× A100 80GB
405B~810 GB~270 GB~270 GB10× H100 80GB (FP16); 4× H100 80GB (AWQ)

Add ~20% for the KV cache at your target batch size. When in doubt, go AWQ; it delivers the best quality-per-VRAM ratio in vLLM right now.

How do I quantize a model for vLLM?

vLLM loads pre-quantized models through the --quantization flag. You can’t quantize on-the-fly at runtime. Your options:

  • AWQ: Download from TheBloke on HuggingFace; use --quantization awq --dtype auto
  • GPTQ: Download a GPTQ checkpoint; use --quantization gptq
  • FP8: Use --quantization fp8 --dtype auto on H100/B200 GPUs

Quantization reduces VRAM by 40–70%, enabling larger models on fewer GPUs.

Is vLLM better than Ollama for production?

Yes, for production serving. vLLM delivers 3–5× higher throughput through continuous batching, native multi-GPU tensor parallelism, and an OpenAI-compatible API. Ollama shines for local development but falls short on scheduling, scaling, and observability. See the full comparison for details.

Can I run vLLM on CPUs instead of GPUs?

vLLM has experimental CPU support, but it’s not suitable for serving. CPU throughput runs 50–100× lower than GPU. Stick with NVIDIA GPUs; A10G, A100, H100; for production.

How do I scale vLLM horizontally across multiple nodes?

Reach for the vLLM production stack with Ray, or KServe with pipeline parallelism. Native Kubernetes Deployments cap out at single-node multi-GPU. For 70B+ models at scale, the Production Stack Helm chart handles service discovery and routing across replicas automatically.

What is the best batch size for vLLM?

For interactive chat, aim for 128–256 concurrent sequences (TTFT p99 < 500ms). For offline batch jobs, push 512+ for maximum throughput. Always validate with benchmark_serving.py before locking in a batch size.

How do I monitor vLLM in production?

Prometheus scrapes /metrics and Grafana visualizes the data. Track four SLIs: TTFT p99, TPOT p99, queue depth (vllm_num_requests_waiting), and GPU memory utilization. Part 4 has the exact PromQL queries.

Can I use HPA with vLLM?

Yes, but only with custom metrics. CPU-based HPA is useless for LLM inference. Install Prometheus Adapter, expose vllm_num_requests_running as a custom metric, and configure HPA to scale on queue depth. Set scale-down stabilization to 5+ minutes; model cold-start demands it.


Series Navigation

# Vllm # Kubernetes # AI # Gpu # Llm # Production