Deploy vLLM on Kubernetes: Architecture and Setup
Table of Contents
Part 1 of 6. This series covers production-grade vLLM deployment on Kubernetes. Meet the Engineer | Part 2 →
vLLM production needs GPU scheduling, tensor parallelism, quantization, and observability, not just a container. This guide provides hardened manifests, benchmark data, and my pre-flight checklist. For engine comparisons, see Ollama vs vLLM and Benchmarking Local LLM.
At a Glance
| Attribute | Details |
|---|---|
| Best for | 100+ concurrent LLM users, <500ms latency SLOs |
| Minimum requirements | NVIDIA A10G 24GB, K8s 1.28+, CUDA 12.1+ |
| Recommended setup | 2x NVIDIA A100 80GB, K8s 1.29+, CUDA 12.4+ |
| Complexity | Advanced: GPU operator, Prometheus Adapter, NCCL debugging |
| Time to first deployment | 45–90 minutes |
| Estimated cost | $2–8/hour per GPU node |
What Is vLLM and Why Use It for Production Inference?
vLLM is an open-source inference engine for LLM serving on NVIDIA GPUs. Three innovations set it apart:
- PagedAttention: Breaks the KV cache into non-contiguous blocks, improving GPU memory utilization by 2-4×.
- Continuous batching: Adds requests to running batches as slots free up, delivering 3-5× higher throughput versus static batching.
- Tensor parallelism: Shards model layers across GPUs via NCCL, enabling models larger than single-GPU VRAM.
vLLM production beats general-purpose servers on throughput per dollar. See the vLLM docs and NVIDIA’s performance docs.
Prerequisites for vLLM on Kubernetes
Your cluster needs GPU-ready drivers, operators, and networking. A missing GPU operator or mismatched CUDA version is the #1 reason production vLLM deployments fail on first boot.
Hardware and Cluster Requirements
| Requirement | Minimum | Recommended | Verify Command |
|---|---|---|---|
| Kubernetes | 1.28 | 1.29+ | kubectl version |
| GPU Nodes | 1× NVIDIA A10G (24 GB) | 2× NVIDIA A100 80GB | kubectl get nodes -l nvidia.com/gpu.present=true |
| CUDA | 12.1 | 12.4+ | nvidia-smi |
| GPU Operator | v23.9.1 | v24.x | helm list -n gpu-operator |
| Container Runtime | containerd 1.7+ | containerd 1.7+ | kubectl get nodes -o yaml | grep containerRuntimeVersion |
| Storage | 100 GB ephemeral | 500 GB+ NVMe local SSD | df -h on GPU node |
| Network | 10 Gbps | 25 Gbps+ (InfiniBand for multi-node) | iperf3 |
Software Verification
# Verify GPU operator and device plugin are runningkubectl get pods -n gpu-operator
# Confirm GPU capacity is advertised on nodeskubectl get nodes -o json | jq '.items[].status.capacity | select(has("nvidia.com/gpu"))'You’ll also need Prometheus Adapter for HPA custom metrics (covered in Part 2). For a lighter deployment, see Deploy Ollama on Kubernetes or Self-Hosted AI vs. OpenAI API.
Model Storage Planning
vLLM pulls models from Hugging Face by default. For production, pre-stage weights to a PersistentVolume. A 70B model in FP16 is roughly 140 GB: downloading on restart destroys your SLOs.
Version Compatibility Matrix
Version mismatches cause cryptic runtime errors. Reference this matrix:
| vLLM Version | CUDA | Kubernetes | GPU Operator | Python | transformers | Notes |
|---|---|---|---|---|---|---|
| 0.8.4 | 12.4 | 1.29+ | v24.3+ | 3.10–3.12 | 4.48.0 | Recommended stable |
| 0.8.3 | 12.4 | 1.28+ | v24.3+ | 3.10–3.12 | 4.47.0 | Good stable |
| 0.8.2 | 12.1 | 1.28+ | v24.0+ | 3.10–3.11 | 4.46.0 | Legacy |
| 0.7.x | 12.1 | 1.27+ | v23.9+ | 3.9–3.11 | 4.44.0 | Not recommended for new deployments |
Hard rule: Upgrade vLLM and CUDA together. vLLM 0.8.x + CUDA 12.1 causes FP8 kernel failures on H100 GPUs.
Architecture Overview
Ingress routes client requests to GPU-backed Pods. Prometheus scrapes metrics; HPA scales by inference queue depth.
graph LR A[Client] --> B[Ingress / NGINX] B --> C[vLLM Service] C --> D[vLLM Pod] D --> E[GPU 0] D --> F[GPU 1] D --> G[GPU 2] D --> H[GPU 3] D --> I[Model PVC] J[Prometheus] --> D K[Grafana] --> J L[HPA Controller] --> DData flow:
- Client sends a
chat/completionsrequest through Ingress. - Service forwards to the vLLM Pod.
- PagedAttention slots the request into the batch.
- Tensor parallelism shards across GPUs.
- Prometheus polls
/metrics. - HPA scales when
vllm_num_requests_runningexceeds the threshold.
FAQ
What is vLLM?
vLLM is an open-source inference engine for high-throughput LLM serving on NVIDIA GPUs. See the official docs.
What GPU do I need to run vLLM in production?
For 7B models, a single A10G (24 GB) suffices. For 70B, use 2× A100 80GB (FP16) or 2× A100 40GB (AWQ). Add ~20% VRAM headroom.
Can I run vLLM without Kubernetes?
Yes. Any Linux machine with NVIDIA drivers and CUDA 12.1+. But you lose autoscaling, rolling updates, and observability. Kubernetes is recommended for multi-user production.
What’s the difference between vLLM and Ollama?
vLLM targets production inference, delivering 3–5× higher throughput than Ollama. See the full comparison.
How do vLLM and LiteLLM work together?
vLLM serves a single model backend. LiteLLM routes across multiple vLLM instances and providers with load balancing, fallbacks, and cost tracking. See the LiteLLM series.
Continue to Part 2: Step-by-Step Kubernetes Deployment where we set up the namespace, ConfigMaps, Secrets, and deploy vLLM on Kubernetes.