vLLM Production: Conclusion FAQ and Next Steps
Table of Contents
Part 6 of 6. This is the final part of the series. Start from Part 1: Introduction and Architecture, or jump to any part using the cross-links below.
Conclusion
Deploying vLLM on Kubernetes means building a system that handles real traffic at real scale. This series walked through GPU-aware scheduling, tensor parallelism, quantization, custom-metric autoscaling, and production observability; all with TTFT, TPOT, and throughput SLIs tracked in Grafana.
Start small: single-GPU deployment, validate SLIs against benchmark tables, then scale to multi-GPU tensor parallelism and HPA as traffic demands.
Next steps:
- Read Ollama vs vLLM: Choosing the Right LLM Inference Engine for a deeper comparison
- Check out LLMOps: Monitoring and Observability for AI Workloads for broader AI observability
- Dive into Kubernetes GPU Scheduling: A Complete Guide for the full GPU story
- Try How to Deploy Ollama on Kubernetes for Local LLMs for a lighter entry point
If this guide helped you, star the vLLM repository on GitHub and subscribe for more production AI infrastructure content.
FAQ
What is vLLM?
vLLM is an open-source inference engine built for high-throughput LLM serving. It leverages PagedAttention, continuous batching, and tensor parallelism to serve models efficiently on NVIDIA GPUs through an OpenAI-compatible API. See the official docs for the full feature list.
How much GPU memory do I need for vLLM?
Depends on model size and precision. Here’s a practical reference for production:
| Model Size | FP16/BF16 | AWQ 4-bit | GPTQ 4-bit | Minimum GPU |
|---|---|---|---|---|
| 7B | ~14 GB | ~5 GB | ~5 GB | 1× A10G (24 GB) |
| 13B | ~26 GB | ~9 GB | ~9 GB | 1× A100 40GB |
| 70B | ~140 GB | ~45 GB | ~45 GB | 2× A100 80GB |
| 405B | ~810 GB | ~270 GB | ~270 GB | 10× H100 80GB (FP16); 4× H100 80GB (AWQ) |
Add ~20% for the KV cache at your target batch size. When in doubt, go AWQ; it delivers the best quality-per-VRAM ratio in vLLM right now.
How do I quantize a model for vLLM?
vLLM loads pre-quantized models through the --quantization flag. You can’t quantize on-the-fly at runtime. Your options:
- AWQ: Download from TheBloke on HuggingFace; use
--quantization awq --dtype auto - GPTQ: Download a GPTQ checkpoint; use
--quantization gptq - FP8: Use
--quantization fp8 --dtype autoon H100/B200 GPUs
Quantization reduces VRAM by 40–70%, enabling larger models on fewer GPUs.
Is vLLM better than Ollama for production?
Yes, for production serving. vLLM delivers 3–5× higher throughput through continuous batching, native multi-GPU tensor parallelism, and an OpenAI-compatible API. Ollama shines for local development but falls short on scheduling, scaling, and observability. See the full comparison for details.
Can I run vLLM on CPUs instead of GPUs?
vLLM has experimental CPU support, but it’s not suitable for serving. CPU throughput runs 50–100× lower than GPU. Stick with NVIDIA GPUs; A10G, A100, H100; for production.
How do I scale vLLM horizontally across multiple nodes?
Reach for the vLLM production stack with Ray, or KServe with pipeline parallelism. Native Kubernetes Deployments cap out at single-node multi-GPU. For 70B+ models at scale, the Production Stack Helm chart handles service discovery and routing across replicas automatically.
What is the best batch size for vLLM?
For interactive chat, aim for 128–256 concurrent sequences (TTFT p99 < 500ms). For offline batch jobs, push 512+ for maximum throughput. Always validate with benchmark_serving.py before locking in a batch size.
How do I monitor vLLM in production?
Prometheus scrapes /metrics and Grafana visualizes the data. Track four SLIs: TTFT p99, TPOT p99, queue depth (vllm_num_requests_waiting), and GPU memory utilization. Part 4 has the exact PromQL queries.
Can I use HPA with vLLM?
Yes, but only with custom metrics. CPU-based HPA is useless for LLM inference. Install Prometheus Adapter, expose vllm_num_requests_running as a custom metric, and configure HPA to scale on queue depth. Set scale-down stabilization to 5+ minutes; model cold-start demands it.