vLLM Performance Tuning Monitoring Production
Table of Contents
Part 4 of 6. In Part 3 we tuned tensor parallelism and quantization. Here we optimize performance and set up monitoring. Continue to Part 5: Comparison, Checklist, and Troubleshooting.
Performance Tuning for vLLM Workloads
Optimizing vLLM comes down to three things: managing the KV cache, tuning scheduling parameters, and warming up GPU kernels before production traffic arrives.
KV Cache Management
Two parameters control KV cache behavior, and you need to understand both:
--max-model-len: Hard cap on total sequence length (input + generated tokens). KV cache memory runs roughly 2 bytes per parameter per token in FP16. Llama 3 8B uses ~128 MB per 1K tokens per sequence; Llama 3 70B uses ~1.1 GB. Formula:2 × num_layers × num_kv_heads × head_dim × 2_bytes × sequence_length.--gpu-memory-utilization: VRAM reserved after model weights load. Whatever remains goes to the KV cache.
For 32K+ contexts, throw more GPUs or aggressive quantization at the problem. A 70B model at 32K context without quantization demands 4× A100 80GB.
Request Scheduling Strategies
| Flag | Recommended Value | Effect |
|---|---|---|
--max-num-batched-tokens | 2048 (chat), 8192+ (batch) | Tokens per forward pass. Lower for latency, higher for throughput. |
--max-paddings | Default | Usually safe to leave at default. |
--scheduling-policy | fcfs or priority | First-come-first-served is simplest; priority for tiered SLOs. |
Warm-Up Procedures
Cold instances suffer high latency from CUDA kernel JIT compilation. Always fire burner requests before production traffic hits:
python -c "import openaiclient = openai.OpenAI(base_url='http://localhost:8000/v1', api_key='dummy')for _ in range(10): client.chat.completions.create( model='meta-llama/Meta-Llama-3-8B-Instruct', messages=[{'role': 'user', 'content': 'Hello'}], max_tokens=50 )"Exclude the first 5–10 requests from SLI calculations. I tag mine with a warmup: true label in load testing scripts.
Monitoring and Observability with PromQL Queries
Track four core SLIs in production: TTFT, TPOT, throughput, and queue depth. Prometheus collects them, Grafana visualizes them.
Key Prometheus Metrics
vLLM exposes metrics at /metrics. Essential alerts for serving:
| Metric | Type | Alert Threshold | Severity |
|---|---|---|---|
vllm_time_to_first_token_seconds | Histogram | p99 > 500ms | P1 |
vllm_time_per_output_token_seconds | Histogram | p99 > 100ms | P1 |
vllm_num_requests_waiting | Gauge | > 50 per pod | P2 |
vllm_gpu_cache_usage_perc | Gauge | > 85% | P2 |
vllm_num_requests_running | Gauge | > 100 per pod | P2 |
Core SLIs: TTFT, TPOT, and Throughput
TTFT (Time to First Token): Latency from request submission to the first response token. Target p99 < 300ms for chat apps. NVIDIA benchmarking guides call TTFT the primary user-perceived latency metric.
TPOT (Time Per Output Token): Inter-token latency during streaming. Target p99 < 80ms for comfortable reading.
Throughput: Total tokens per second across all concurrent requests. This is your cost-efficiency metric for sizing.
Essential PromQL Queries
Import the official vLLM Grafana dashboard (ID: 25043), then add these panels:
TTFT p99 (5-minute window):
histogram_quantile(0.99, rate(vllm_time_to_first_token_seconds_bucket[5m]))TPOT p99 (5-minute window):
histogram_quantile(0.99, rate(vllm_time_per_output_token_seconds_bucket[5m]))Queue depth (requests waiting):
vllm_num_requests_waitingActive requests per pod:
vllm_num_requests_runningGPU memory utilization (%):
(nvidia_gpu_memory_used_bytes / nvidia_gpu_memory_total_bytes) * 100GPU power draw (Watts):
nvidia_gpu_power_usage_milliwatts / 1000Throughput (tokens generated per second):
rate(vllm_generation_tokens_total[5m])HPA scale trigger (average running requests):
avg(vllm_num_requests_running) by (pod)I keep a four-panel at-a-glance row: TTFT, TPOT, queue depth, and GPU utilization. If any of them turns red, something’s broken.
Continue to Part 5: Comparison, Checklist, and Troubleshooting where we pit vLLM against Ollama and run through the production readiness checklist.