vLLM Production: Checklist Comparison and Fixes

2026.02.18
Technology
496 Words
vLLM Production: Checklist Comparison and Fixes

Part 5 of 6. In Part 4 we covered performance tuning and monitoring. Here we compare vLLM with Ollama, review the production checklist, and troubleshoot common issues. Continue to Part 6: Conclusion and FAQ.

vLLM vs Ollama: Inference Engine Comparison

Use vLLM for serving external users. Use Ollama for local prototyping. Both excel, but their architectures target completely different environments.

FactorvLLMOllama
Use caseProduction inference at scaleLocal development
Throughput3–5× higher (continuous batching)Moderate
APIOpenAI-compatibleCustom REST API
Multi-GPUNative tensor parallelismLimited
QuantizationAWQ, GPTQ, FP8GGUF (Q4_0, Q4_K_M)
KubernetesFirst-class supportManual orchestration
Best for100+ concurrent usersLaptop prototyping

Here’s my rule: Serving external users? Run vLLM. Prototyping locally? Reach for Ollama. I run both: Ollama on my workstation, vLLM in the cluster.

For a deeper architectural comparison, read Ollama vs vLLM: Choosing the Right LLM Inference Engine.

Production Checklist: 10 Items Before Going Live

Tick every item on this checklist before exposing your vLLM deployment to real traffic:

#ItemVerification
1GPU nodes labeled with nvidia.com/gpu.present=truekubectl get nodes -l nvidia.com/gpu.present=true
2Model weights pre-downloaded to a local PVCkubectl get pvc -n llm-serving + check mount
3Resource limits match --tensor-parallel-sizenvidia.com/gpu == --tensor-parallel-size
4Liveness and readiness probes configuredkubectl describe pod shows both probes
5Graceful shutdown with preStop hook + 60s grace periodkubectl get pod -o yaml | grep preStop
6Quantization enabled if model exceeds single-GPU VRAMCheck --quantization flag in manifest
7HPA custom metrics wired to Prometheus Adapterkubectl get hpa shows TARGET value
8Prometheus scraping annotations on Pod templateprometheus.io/scrape: "true" present
9NGINX proxy timeouts extended to 3600skubectl get ingress -o yaml | grep timeout
10Warm-up script executed before trafficFirst 5–10 requests excluded from SLIs

Print this checklist. Tick every box. Only then route production traffic. Item #3 (tensor parallelism mismatch) is the single most common cause of multi-GPU deployment failures I see.

Troubleshooting Common vLLM Issues

This table maps common symptoms straight to root causes and fixes. These are the failures I hit most often in production vLLM deployments.

Error / SymptomRoot CauseSolution
CUDA out of memory--gpu-memory-utilization too high or batch size too largeReduce to 0.85, lower --max-num-seqs, or enable quantization
NCCL error during startup--tensor-parallel-size mismatch with GPU countEnsure nvidia.com/gpu equals --tensor-parallel-size
Model download hangs on startupHuggingFace rate limit or no internet accessPre-download to PVC, set HF_HUB_OFFLINE=1
Very slow first requestCUDA kernel JIT compilationRun warm-up requests before production traffic
High TTFT, low GPU utilBatch size too smallIncrease --max-num-seqs or request rate
High TPOT, high GPU utilKV cache full or model too large for GPUReduce --max-model-len, enable AWQ/GPTQ, or add GPUs
HPA not scalingPrometheus Adapter misconfiguredVerify metric name matches adapter rule; check kubectl describe hpa
Ingress 504 Gateway TimeoutNGINX proxy timeout too lowSet proxy-read-timeout and proxy-send-timeout to 3600s
Pod stuck TerminatingNo graceful shutdown handlerAdd preStop sleep hook and terminationGracePeriodSeconds: 60
FP8 kernel failure on H100CUDA version mismatchUpgrade to CUDA 12.4+; verify vLLM 0.8.4+

Debug mode: Flip VLLM_LOGGING_LEVEL=DEBUG and NCCL_DEBUG=INFO for verbose startup logs. For NCCL topology gremlins, add NCCL_DEBUG_SUBSYS=GRAPH.

Continue to Part 6: Conclusion and FAQ for the conclusion and frequently asked questions.

# Vllm # Kubernetes # AI # Gpu # Llm # Production