vLLM Tensor Parallelism Quantization Production
Table of Contents
Part 3 of 6. In Part 2 we deployed vLLM on Kubernetes. Now we configure tensor parallelism and quantization for production. Continue to Part 4: Performance Tuning and Monitoring.
Production Configuration: Tensor Parallelism and Quantization
Tensor parallelism and quantization are the two highest-impact levers for production vLLM inference. One spreads models across GPUs, the other shrinks VRAM footprint. Here’s exactly how to configure both.
Quantization for Memory Efficiency
Quantization drops model weight precision so larger models fit on fewer GPUs. vLLM supports several methods through the --quantization flag:
| Method | Bits | VRAM Reduction | Accuracy Loss | Speed Impact | Best For |
|---|---|---|---|---|---|
| AWQ | 4-bit | ~70% | Very low | Faster (Marlin kernel) | Production serving |
| GPTQ | 4-bit | ~70% | Low | Fast | Research, fine-tuned models |
| FP8 | 8-bit | ~40% | Minimal | Fastest (H100) | H100/B200 clusters |
| None (FP16/BF16) | 16-bit | Baseline | None | Baseline | Maximum accuracy |
AWQ example (recommended for most serving workloads):
args: - --model - TheBloke/Llama-2-70B-AWQ - --quantization - awq - --tensor-parallel-size - "2" # 70B AWQ fits on 2× A100 40GB - --dtype - autoNote on
--quantization awq: This flag expects a pre-quantized AWQ checkpoint. You cannot quantize on-the-fly at load time. Grab quantized models from TheBloke on HuggingFace or run AutoAWQ yourself.
FP8 on H100 (fastest path if you have the hardware):
args: - --model - meta-llama/Meta-Llama-3-70B-Instruct - --quantization - fp8 - --dtype - auto - --tensor-parallel-size - "2"Note: FP8 quantization requires Hopper-generation GPUs (H100, H200) or newer. It is not supported on A100 or older architectures. Use AWQ or GPTQ on Ampere and earlier GPUs.
For guidance on quantization trade-offs, see NVIDIA’s deep learning performance documentation.
Batch Size Tuning
--max-num-seqs controls maximum batch size, the single most impactful throughput lever in production.
Start here: For 7B models, set --max-num-seqs 256. For 70B models, start at 128. Fire up benchmark_serving.py at your target request rate. Drop batch size if TTFT p99 exceeds your SLO; raise it if GPU utilization stays below 80%.
A100 80GB benchmarks (Llama 3 8B, 1024 input / 256 output tokens):
| Batch Size | Throughput (tok/s) | TTFT p99 (ms) | TPOT p99 (ms) | GPU Utilization |
|---|---|---|---|---|
| 64 | 1,850 | 180 | 42 | 62% |
| 128 | 2,940 | 320 | 58 | 78% |
| 256 | 3,820 | 580 | 95 | 89% |
| 512 | 4,100 | 1,200 | 180 | 94% |
Source: Benchmarked on vLLM 0.8.4, NVIDIA A100 80GB SXM4, CUDA 12.4, continuous batching enabled.
Recommendation: Target 128–256 for interactive chat (TTFT < 500ms) and 512+ for offline batch inference jobs.
Memory Limits and GPU Utilization
Set --gpu-memory-utilization to 0.90 for dedicated inference nodes, 0.85 if you’re running sidecars. Never push past 0.95: CUDA needs headroom for scratch space. I’ve watched 0.98 crash with CUDA out of memory during peak batch windows.
Graceful Shutdown Handling
vLLM ships without graceful shutdown. You have to wire it up in Kubernetes to prevent dropped requests during rolling updates:
lifecycle: preStop: exec: command: ["/bin/sh", "-c", "sleep 30"]terminationGracePeriodSeconds: 60Use rolling updates with maxUnavailable: 0 for zero-downtime deployments:
strategy: type: RollingUpdate rollingUpdate: maxUnavailable: 0 maxSurge: 1Continue to Part 4: Performance Tuning and Monitoring where we tackle KV cache management, scheduling strategies, and Prometheus observability.