vLLM Production: Checklist Comparison and Fixes
Table of Contents
Part 5 of 6. In Part 4 we covered performance tuning and monitoring. Here we compare vLLM with Ollama, review the production checklist, and troubleshoot common issues. Continue to Part 6: Conclusion and FAQ.
vLLM vs Ollama: Inference Engine Comparison
Use vLLM for serving external users. Use Ollama for local prototyping. Both excel, but their architectures target completely different environments.
| Factor | vLLM | Ollama |
|---|---|---|
| Use case | Production inference at scale | Local development |
| Throughput | 3–5× higher (continuous batching) | Moderate |
| API | OpenAI-compatible | Custom REST API |
| Multi-GPU | Native tensor parallelism | Limited |
| Quantization | AWQ, GPTQ, FP8 | GGUF (Q4_0, Q4_K_M) |
| Kubernetes | First-class support | Manual orchestration |
| Best for | 100+ concurrent users | Laptop prototyping |
Here’s my rule: Serving external users? Run vLLM. Prototyping locally? Reach for Ollama. I run both: Ollama on my workstation, vLLM in the cluster.
For a deeper architectural comparison, read Ollama vs vLLM: Choosing the Right LLM Inference Engine.
Production Checklist: 10 Items Before Going Live
Tick every item on this checklist before exposing your vLLM deployment to real traffic:
| # | Item | Verification |
|---|---|---|
| 1 | GPU nodes labeled with nvidia.com/gpu.present=true | kubectl get nodes -l nvidia.com/gpu.present=true |
| 2 | Model weights pre-downloaded to a local PVC | kubectl get pvc -n llm-serving + check mount |
| 3 | Resource limits match --tensor-parallel-size | nvidia.com/gpu == --tensor-parallel-size |
| 4 | Liveness and readiness probes configured | kubectl describe pod shows both probes |
| 5 | Graceful shutdown with preStop hook + 60s grace period | kubectl get pod -o yaml | grep preStop |
| 6 | Quantization enabled if model exceeds single-GPU VRAM | Check --quantization flag in manifest |
| 7 | HPA custom metrics wired to Prometheus Adapter | kubectl get hpa shows TARGET value |
| 8 | Prometheus scraping annotations on Pod template | prometheus.io/scrape: "true" present |
| 9 | NGINX proxy timeouts extended to 3600s | kubectl get ingress -o yaml | grep timeout |
| 10 | Warm-up script executed before traffic | First 5–10 requests excluded from SLIs |
Print this checklist. Tick every box. Only then route production traffic. Item #3 (tensor parallelism mismatch) is the single most common cause of multi-GPU deployment failures I see.
Troubleshooting Common vLLM Issues
This table maps common symptoms straight to root causes and fixes. These are the failures I hit most often in production vLLM deployments.
| Error / Symptom | Root Cause | Solution |
|---|---|---|
CUDA out of memory | --gpu-memory-utilization too high or batch size too large | Reduce to 0.85, lower --max-num-seqs, or enable quantization |
NCCL error during startup | --tensor-parallel-size mismatch with GPU count | Ensure nvidia.com/gpu equals --tensor-parallel-size |
| Model download hangs on startup | HuggingFace rate limit or no internet access | Pre-download to PVC, set HF_HUB_OFFLINE=1 |
| Very slow first request | CUDA kernel JIT compilation | Run warm-up requests before production traffic |
| High TTFT, low GPU util | Batch size too small | Increase --max-num-seqs or request rate |
| High TPOT, high GPU util | KV cache full or model too large for GPU | Reduce --max-model-len, enable AWQ/GPTQ, or add GPUs |
| HPA not scaling | Prometheus Adapter misconfigured | Verify metric name matches adapter rule; check kubectl describe hpa |
| Ingress 504 Gateway Timeout | NGINX proxy timeout too low | Set proxy-read-timeout and proxy-send-timeout to 3600s |
Pod stuck Terminating | No graceful shutdown handler | Add preStop sleep hook and terminationGracePeriodSeconds: 60 |
FP8 kernel failure on H100 | CUDA version mismatch | Upgrade to CUDA 12.4+; verify vLLM 0.8.4+ |
Debug mode: Flip VLLM_LOGGING_LEVEL=DEBUG and NCCL_DEBUG=INFO for verbose startup logs. For NCCL topology gremlins, add NCCL_DEBUG_SUBSYS=GRAPH.
Continue to Part 6: Conclusion and FAQ for the conclusion and frequently asked questions.