Deploy vLLM on Kubernetes: Architecture and Setup

Part 1 of 6. This series covers production-grade vLLM deployment on Kubernetes. Meet the Engineer | Part 2 →

vLLM production needs GPU scheduling, tensor parallelism, quantization, and observability, not just a container. This guide provides hardened manifests, benchmark data, and my pre-flight checklist. For engine comparisons, see Ollama vs vLLM and Benchmarking Local LLM.

At a Glance

Attribute	Details
Best for	100+ concurrent LLM users, <500ms latency SLOs
Minimum requirements	NVIDIA A10G 24GB, K8s 1.28+, CUDA 12.1+
Recommended setup	2x NVIDIA A100 80GB, K8s 1.29+, CUDA 12.4+
Complexity	Advanced: GPU operator, Prometheus Adapter, NCCL debugging
Time to first deployment	45–90 minutes
Estimated cost	$2–8/hour per GPU node

What Is vLLM and Why Use It for Production Inference?

vLLM is an open-source inference engine for LLM serving on NVIDIA GPUs. Three innovations set it apart:

PagedAttention: Breaks the KV cache into non-contiguous blocks, improving GPU memory utilization by 2-4×.
Continuous batching: Adds requests to running batches as slots free up, delivering 3-5× higher throughput versus static batching.
Tensor parallelism: Shards model layers across GPUs via NCCL, enabling models larger than single-GPU VRAM.

vLLM production beats general-purpose servers on throughput per dollar. See the vLLM docs and NVIDIA’s performance docs.

Prerequisites for vLLM on Kubernetes

Your cluster needs GPU-ready drivers, operators, and networking. A missing GPU operator or mismatched CUDA version is the #1 reason production vLLM deployments fail on first boot.

Hardware and Cluster Requirements

Requirement	Minimum	Recommended	Verify Command
Kubernetes	1.28	1.29+	`kubectl version`
GPU Nodes	1× NVIDIA A10G (24 GB)	2× NVIDIA A100 80GB	`kubectl get nodes -l nvidia.com/gpu.present=true`
CUDA	12.1	12.4+	`nvidia-smi`
GPU Operator	v23.9.1	v24.x	`helm list -n gpu-operator`
Container Runtime	containerd 1.7+	containerd 1.7+	`kubectl get nodes -o yaml \| grep containerRuntimeVersion`
Storage	100 GB ephemeral	500 GB+ NVMe local SSD	`df -h` on GPU node
Network	10 Gbps	25 Gbps+ (InfiniBand for multi-node)	`iperf3`

Software Verification

# Verify GPU operator and device plugin are running
kubectl get pods -n gpu-operator

# Confirm GPU capacity is advertised on nodes
kubectl get nodes -o json | jq '.items[].status.capacity | select(has("nvidia.com/gpu"))'

You’ll also need Prometheus Adapter for HPA custom metrics (covered in Part 2). For a lighter deployment, see Deploy Ollama on Kubernetes or Self-Hosted AI vs. OpenAI API.

Model Storage Planning

vLLM pulls models from Hugging Face by default. For production, pre-stage weights to a PersistentVolume. A 70B model in FP16 is roughly 140 GB: downloading on restart destroys your SLOs.

Version Compatibility Matrix

Version mismatches cause cryptic runtime errors. Reference this matrix:

vLLM Version	CUDA	Kubernetes	GPU Operator	Python	transformers	Notes
0.8.4	12.4	1.29+	v24.3+	3.10–3.12	4.48.0	Recommended stable
0.8.3	12.4	1.28+	v24.3+	3.10–3.12	4.47.0	Good stable
0.8.2	12.1	1.28+	v24.0+	3.10–3.11	4.46.0	Legacy
0.7.x	12.1	1.27+	v23.9+	3.9–3.11	4.44.0	Not recommended for new deployments

Hard rule: Upgrade vLLM and CUDA together. vLLM 0.8.x + CUDA 12.1 causes FP8 kernel failures on H100 GPUs.

Architecture Overview

Ingress routes client requests to GPU-backed Pods. Prometheus scrapes metrics; HPA scales by inference queue depth.

graph LR
    A[Client] --> B[Ingress / NGINX]
    B --> C[vLLM Service]
    C --> D[vLLM Pod]
    D --> E[GPU 0]
    D --> F[GPU 1]
    D --> G[GPU 2]
    D --> H[GPU 3]
    D --> I[Model PVC]
    J[Prometheus] --> D
    K[Grafana] --> J
    L[HPA Controller] --> D

Data flow:

Client sends a chat/completions request through Ingress.
Service forwards to the vLLM Pod.
PagedAttention slots the request into the batch.
Tensor parallelism shards across GPUs.
Prometheus polls /metrics.
HPA scales when vllm_num_requests_running exceeds the threshold.

FAQ

What is vLLM?

vLLM is an open-source inference engine for high-throughput LLM serving on NVIDIA GPUs. See the official docs.

What GPU do I need to run vLLM in production?

For 7B models, a single A10G (24 GB) suffices. For 70B, use 2× A100 80GB (FP16) or 2× A100 40GB (AWQ). Add ~20% VRAM headroom.

Can I run vLLM without Kubernetes?

Yes. Any Linux machine with NVIDIA drivers and CUDA 12.1+. But you lose autoscaling, rolling updates, and observability. Kubernetes is recommended for multi-user production.

What’s the difference between vLLM and Ollama?

vLLM targets production inference, delivering 3–5× higher throughput than Ollama. See the full comparison.

How do vLLM and LiteLLM work together?

vLLM serves a single model backend. LiteLLM routes across multiple vLLM instances and providers with load balancing, fallbacks, and cost tracking. See the LiteLLM series.

Continue to Part 2: Step-by-Step Kubernetes Deployment where we set up the namespace, ConfigMaps, Secrets, and deploy vLLM on Kubernetes.