vLLM Production: Checklist Comparison and Fixes

Part 5 of 6. In Part 4 we covered performance tuning and monitoring. Here we compare vLLM with Ollama, review the production checklist, and troubleshoot common issues. Continue to Part 6: Conclusion and FAQ.

vLLM vs Ollama: Inference Engine Comparison

Use vLLM for serving external users. Use Ollama for local prototyping. Both excel, but their architectures target completely different environments.

Factor	vLLM	Ollama
Use case	Production inference at scale	Local development
Throughput	3–5× higher (continuous batching)	Moderate
API	OpenAI-compatible	Custom REST API
Multi-GPU	Native tensor parallelism	Limited
Quantization	AWQ, GPTQ, FP8	GGUF (Q4_0, Q4_K_M)
Kubernetes	First-class support	Manual orchestration
Best for	100+ concurrent users	Laptop prototyping

Here’s my rule: Serving external users? Run vLLM. Prototyping locally? Reach for Ollama. I run both: Ollama on my workstation, vLLM in the cluster.

For a deeper architectural comparison, read Ollama vs vLLM: Choosing the Right LLM Inference Engine.

Production Checklist: 10 Items Before Going Live

Tick every item on this checklist before exposing your vLLM deployment to real traffic:

#	Item	Verification
1	GPU nodes labeled with `nvidia.com/gpu.present=true`	`kubectl get nodes -l nvidia.com/gpu.present=true`
2	Model weights pre-downloaded to a local PVC	`kubectl get pvc -n llm-serving` + check mount
3	Resource limits match `--tensor-parallel-size`	`nvidia.com/gpu` == `--tensor-parallel-size`
4	Liveness and readiness probes configured	`kubectl describe pod` shows both probes
5	Graceful shutdown with `preStop` hook + 60s grace period	`kubectl get pod -o yaml \| grep preStop`
6	Quantization enabled if model exceeds single-GPU VRAM	Check `--quantization` flag in manifest
7	HPA custom metrics wired to Prometheus Adapter	`kubectl get hpa` shows TARGET value
8	Prometheus scraping annotations on Pod template	`prometheus.io/scrape: "true"` present
9	NGINX proxy timeouts extended to 3600s	`kubectl get ingress -o yaml \| grep timeout`
10	Warm-up script executed before traffic	First 5–10 requests excluded from SLIs

Print this checklist. Tick every box. Only then route production traffic. Item #3 (tensor parallelism mismatch) is the single most common cause of multi-GPU deployment failures I see.

Troubleshooting Common vLLM Issues

This table maps common symptoms straight to root causes and fixes. These are the failures I hit most often in production vLLM deployments.

Error / Symptom	Root Cause	Solution
`CUDA out of memory`	`--gpu-memory-utilization` too high or batch size too large	Reduce to 0.85, lower `--max-num-seqs`, or enable quantization
`NCCL error` during startup	`--tensor-parallel-size` mismatch with GPU count	Ensure `nvidia.com/gpu` equals `--tensor-parallel-size`
Model download hangs on startup	HuggingFace rate limit or no internet access	Pre-download to PVC, set `HF_HUB_OFFLINE=1`
Very slow first request	CUDA kernel JIT compilation	Run warm-up requests before production traffic
High TTFT, low GPU util	Batch size too small	Increase `--max-num-seqs` or request rate
High TPOT, high GPU util	KV cache full or model too large for GPU	Reduce `--max-model-len`, enable AWQ/GPTQ, or add GPUs
HPA not scaling	Prometheus Adapter misconfigured	Verify metric name matches adapter rule; check `kubectl describe hpa`
Ingress 504 Gateway Timeout	NGINX proxy timeout too low	Set `proxy-read-timeout` and `proxy-send-timeout` to 3600s
Pod stuck `Terminating`	No graceful shutdown handler	Add `preStop` sleep hook and `terminationGracePeriodSeconds: 60`
`FP8` kernel failure on H100	CUDA version mismatch	Upgrade to CUDA 12.4+; verify vLLM 0.8.4+

Debug mode: Flip VLLM_LOGGING_LEVEL=DEBUG and NCCL_DEBUG=INFO for verbose startup logs. For NCCL topology gremlins, add NCCL_DEBUG_SUBSYS=GRAPH.

Continue to Part 6: Conclusion and FAQ for the conclusion and frequently asked questions.