vLLM Performance Tuning Monitoring Production

Part 4 of 6. In Part 3 we tuned tensor parallelism and quantization. Here we optimize performance and set up monitoring. Continue to Part 5: Comparison, Checklist, and Troubleshooting.

Performance Tuning for vLLM Workloads

Optimizing vLLM comes down to three things: managing the KV cache, tuning scheduling parameters, and warming up GPU kernels before production traffic arrives.

KV Cache Management

Two parameters control KV cache behavior, and you need to understand both:

--max-model-len: Hard cap on total sequence length (input + generated tokens). KV cache memory runs roughly 2 bytes per parameter per token in FP16. Llama 3 8B uses ~128 MB per 1K tokens per sequence; Llama 3 70B uses ~1.1 GB. Formula: 2 × num_layers × num_kv_heads × head_dim × 2_bytes × sequence_length.
--gpu-memory-utilization: VRAM reserved after model weights load. Whatever remains goes to the KV cache.

For 32K+ contexts, throw more GPUs or aggressive quantization at the problem. A 70B model at 32K context without quantization demands 4× A100 80GB.

Request Scheduling Strategies

Flag	Recommended Value	Effect
`--max-num-batched-tokens`	2048 (chat), 8192+ (batch)	Tokens per forward pass. Lower for latency, higher for throughput.
`--max-paddings`	Default	Usually safe to leave at default.
`--scheduling-policy`	`fcfs` or `priority`	First-come-first-served is simplest; priority for tiered SLOs.

Warm-Up Procedures

Cold instances suffer high latency from CUDA kernel JIT compilation. Always fire burner requests before production traffic hits:

python -c "
import openai
client = openai.OpenAI(base_url='http://localhost:8000/v1', api_key='dummy')
for _ in range(10):
    client.chat.completions.create(
        model='meta-llama/Meta-Llama-3-8B-Instruct',
        messages=[{'role': 'user', 'content': 'Hello'}],
        max_tokens=50
    )
"

Exclude the first 5–10 requests from SLI calculations. I tag mine with a warmup: true label in load testing scripts.

Monitoring and Observability with PromQL Queries

Track four core SLIs in production: TTFT, TPOT, throughput, and queue depth. Prometheus collects them, Grafana visualizes them.

Key Prometheus Metrics

vLLM exposes metrics at /metrics. Essential alerts for serving:

Metric	Type	Alert Threshold	Severity
`vllm_time_to_first_token_seconds`	Histogram	p99 > 500ms	P1
`vllm_time_per_output_token_seconds`	Histogram	p99 > 100ms	P1
`vllm_num_requests_waiting`	Gauge	> 50 per pod	P2
`vllm_gpu_cache_usage_perc`	Gauge	> 85%	P2
`vllm_num_requests_running`	Gauge	> 100 per pod	P2

Core SLIs: TTFT, TPOT, and Throughput

TTFT (Time to First Token): Latency from request submission to the first response token. Target p99 < 300ms for chat apps. NVIDIA benchmarking guides call TTFT the primary user-perceived latency metric.

TPOT (Time Per Output Token): Inter-token latency during streaming. Target p99 < 80ms for comfortable reading.

Throughput: Total tokens per second across all concurrent requests. This is your cost-efficiency metric for sizing.

Essential PromQL Queries

Import the official vLLM Grafana dashboard (ID: 25043), then add these panels:

TTFT p99 (5-minute window):

histogram_quantile(0.99, rate(vllm_time_to_first_token_seconds_bucket[5m]))

TPOT p99 (5-minute window):

histogram_quantile(0.99, rate(vllm_time_per_output_token_seconds_bucket[5m]))

Queue depth (requests waiting):

vllm_num_requests_waiting

Active requests per pod:

vllm_num_requests_running

GPU memory utilization (%):

(nvidia_gpu_memory_used_bytes / nvidia_gpu_memory_total_bytes) * 100

GPU power draw (Watts):

nvidia_gpu_power_usage_milliwatts / 1000

Throughput (tokens generated per second):

rate(vllm_generation_tokens_total[5m])

HPA scale trigger (average running requests):

avg(vllm_num_requests_running) by (pod)

I keep a four-panel at-a-glance row: TTFT, TPOT, queue depth, and GPU utilization. If any of them turns red, something’s broken.

Continue to Part 5: Comparison, Checklist, and Troubleshooting where we pit vLLM against Ollama and run through the production readiness checklist.