vLLM Tensor Parallelism Quantization Production

Part 3 of 6. In Part 2 we deployed vLLM on Kubernetes. Now we configure tensor parallelism and quantization for production. Continue to Part 4: Performance Tuning and Monitoring.

Production Configuration: Tensor Parallelism and Quantization

Tensor parallelism and quantization are the two highest-impact levers for production vLLM inference. One spreads models across GPUs, the other shrinks VRAM footprint. Here’s exactly how to configure both.

Quantization for Memory Efficiency

Quantization drops model weight precision so larger models fit on fewer GPUs. vLLM supports several methods through the --quantization flag:

Method	Bits	VRAM Reduction	Accuracy Loss	Speed Impact	Best For
AWQ	4-bit	~70%	Very low	Faster (Marlin kernel)	Production serving
GPTQ	4-bit	~70%	Low	Fast	Research, fine-tuned models
FP8	8-bit	~40%	Minimal	Fastest (H100)	H100/B200 clusters
None (FP16/BF16)	16-bit	Baseline	None	Baseline	Maximum accuracy

AWQ example (recommended for most serving workloads):

args:
  - --model
  - TheBloke/Llama-2-70B-AWQ
  - --quantization
  - awq
  - --tensor-parallel-size
  - "2"  # 70B AWQ fits on 2× A100 40GB
  - --dtype
  - auto

Note on --quantization awq: This flag expects a pre-quantized AWQ checkpoint. You cannot quantize on-the-fly at load time. Grab quantized models from TheBloke on HuggingFace or run AutoAWQ yourself.

FP8 on H100 (fastest path if you have the hardware):

args:
  - --model
  - meta-llama/Meta-Llama-3-70B-Instruct
  - --quantization
  - fp8
  - --dtype
  - auto
  - --tensor-parallel-size
  - "2"

Note: FP8 quantization requires Hopper-generation GPUs (H100, H200) or newer. It is not supported on A100 or older architectures. Use AWQ or GPTQ on Ampere and earlier GPUs.

For guidance on quantization trade-offs, see NVIDIA’s deep learning performance documentation.

Batch Size Tuning

--max-num-seqs controls maximum batch size, the single most impactful throughput lever in production.

Start here: For 7B models, set --max-num-seqs 256. For 70B models, start at 128. Fire up benchmark_serving.py at your target request rate. Drop batch size if TTFT p99 exceeds your SLO; raise it if GPU utilization stays below 80%.

A100 80GB benchmarks (Llama 3 8B, 1024 input / 256 output tokens):

Batch Size	Throughput (tok/s)	TTFT p99 (ms)	TPOT p99 (ms)	GPU Utilization
64	1,850	180	42	62%
128	2,940	320	58	78%
256	3,820	580	95	89%
512	4,100	1,200	180	94%

Source: Benchmarked on vLLM 0.8.4, NVIDIA A100 80GB SXM4, CUDA 12.4, continuous batching enabled.

Recommendation: Target 128–256 for interactive chat (TTFT < 500ms) and 512+ for offline batch inference jobs.

Memory Limits and GPU Utilization

Set --gpu-memory-utilization to 0.90 for dedicated inference nodes, 0.85 if you’re running sidecars. Never push past 0.95: CUDA needs headroom for scratch space. I’ve watched 0.98 crash with CUDA out of memory during peak batch windows.

Graceful Shutdown Handling

vLLM ships without graceful shutdown. You have to wire it up in Kubernetes to prevent dropped requests during rolling updates:

lifecycle:
  preStop:
    exec:
      command: ["/bin/sh", "-c", "sleep 30"]
terminationGracePeriodSeconds: 60

Use rolling updates with maxUnavailable: 0 for zero-downtime deployments:

strategy:
  type: RollingUpdate
  rollingUpdate:
    maxUnavailable: 0
    maxSurge: 1

Continue to Part 4: Performance Tuning and Monitoring where we tackle KV cache management, scheduling strategies, and Prometheus observability.