vLLM Performance Tuning Monitoring Production

2026.02.15
Technology
337 Words
vLLM Performance Tuning Monitoring Production

Part 4 of 6. In Part 3 we tuned tensor parallelism and quantization. Here we optimize performance and set up monitoring. Continue to Part 5: Comparison, Checklist, and Troubleshooting.

Performance Tuning for vLLM Workloads

Optimizing vLLM comes down to three things: managing the KV cache, tuning scheduling parameters, and warming up GPU kernels before production traffic arrives.

KV Cache Management

Two parameters control KV cache behavior, and you need to understand both:

  • --max-model-len: Hard cap on total sequence length (input + generated tokens). KV cache memory runs roughly 2 bytes per parameter per token in FP16. Llama 3 8B uses ~128 MB per 1K tokens per sequence; Llama 3 70B uses ~1.1 GB. Formula: 2 × num_layers × num_kv_heads × head_dim × 2_bytes × sequence_length.
  • --gpu-memory-utilization: VRAM reserved after model weights load. Whatever remains goes to the KV cache.

For 32K+ contexts, throw more GPUs or aggressive quantization at the problem. A 70B model at 32K context without quantization demands 4× A100 80GB.

Request Scheduling Strategies

FlagRecommended ValueEffect
--max-num-batched-tokens2048 (chat), 8192+ (batch)Tokens per forward pass. Lower for latency, higher for throughput.
--max-paddingsDefaultUsually safe to leave at default.
--scheduling-policyfcfs or priorityFirst-come-first-served is simplest; priority for tiered SLOs.

Warm-Up Procedures

Cold instances suffer high latency from CUDA kernel JIT compilation. Always fire burner requests before production traffic hits:

Terminal window
python -c "
import openai
client = openai.OpenAI(base_url='http://localhost:8000/v1', api_key='dummy')
for _ in range(10):
client.chat.completions.create(
model='meta-llama/Meta-Llama-3-8B-Instruct',
messages=[{'role': 'user', 'content': 'Hello'}],
max_tokens=50
)
"

Exclude the first 5–10 requests from SLI calculations. I tag mine with a warmup: true label in load testing scripts.

Monitoring and Observability with PromQL Queries

Track four core SLIs in production: TTFT, TPOT, throughput, and queue depth. Prometheus collects them, Grafana visualizes them.

Key Prometheus Metrics

vLLM exposes metrics at /metrics. Essential alerts for serving:

MetricTypeAlert ThresholdSeverity
vllm_time_to_first_token_secondsHistogramp99 > 500msP1
vllm_time_per_output_token_secondsHistogramp99 > 100msP1
vllm_num_requests_waitingGauge> 50 per podP2
vllm_gpu_cache_usage_percGauge> 85%P2
vllm_num_requests_runningGauge> 100 per podP2

Core SLIs: TTFT, TPOT, and Throughput

TTFT (Time to First Token): Latency from request submission to the first response token. Target p99 < 300ms for chat apps. NVIDIA benchmarking guides call TTFT the primary user-perceived latency metric.

TPOT (Time Per Output Token): Inter-token latency during streaming. Target p99 < 80ms for comfortable reading.

Throughput: Total tokens per second across all concurrent requests. This is your cost-efficiency metric for sizing.

Essential PromQL Queries

Import the official vLLM Grafana dashboard (ID: 25043), then add these panels:

TTFT p99 (5-minute window):

Terminal window
histogram_quantile(0.99, rate(vllm_time_to_first_token_seconds_bucket[5m]))

TPOT p99 (5-minute window):

Terminal window
histogram_quantile(0.99, rate(vllm_time_per_output_token_seconds_bucket[5m]))

Queue depth (requests waiting):

Terminal window
vllm_num_requests_waiting

Active requests per pod:

Terminal window
vllm_num_requests_running

GPU memory utilization (%):

Terminal window
(nvidia_gpu_memory_used_bytes / nvidia_gpu_memory_total_bytes) * 100

GPU power draw (Watts):

Terminal window
nvidia_gpu_power_usage_milliwatts / 1000

Throughput (tokens generated per second):

Terminal window
rate(vllm_generation_tokens_total[5m])

HPA scale trigger (average running requests):

Terminal window
avg(vllm_num_requests_running) by (pod)

I keep a four-panel at-a-glance row: TTFT, TPOT, queue depth, and GPU utilization. If any of them turns red, something’s broken.

Continue to Part 5: Comparison, Checklist, and Troubleshooting where we pit vLLM against Ollama and run through the production readiness checklist.

# Vllm # Kubernetes # AI # Gpu # Llm # Production