Deploy vLLM on Kubernetes: Architecture and Setup

2026.02.06
Technology
619 Words
Deploy vLLM on Kubernetes: Architecture and Setup

Part 1 of 6. This series covers production-grade vLLM deployment on Kubernetes. Meet the Engineer | Part 2 →

vLLM production needs GPU scheduling, tensor parallelism, quantization, and observability, not just a container. This guide provides hardened manifests, benchmark data, and my pre-flight checklist. For engine comparisons, see Ollama vs vLLM and Benchmarking Local LLM.

At a Glance

AttributeDetails
Best for100+ concurrent LLM users, <500ms latency SLOs
Minimum requirementsNVIDIA A10G 24GB, K8s 1.28+, CUDA 12.1+
Recommended setup2x NVIDIA A100 80GB, K8s 1.29+, CUDA 12.4+
ComplexityAdvanced: GPU operator, Prometheus Adapter, NCCL debugging
Time to first deployment45–90 minutes
Estimated cost$2–8/hour per GPU node

What Is vLLM and Why Use It for Production Inference?

vLLM is an open-source inference engine for LLM serving on NVIDIA GPUs. Three innovations set it apart:

  • PagedAttention: Breaks the KV cache into non-contiguous blocks, improving GPU memory utilization by 2-4×.
  • Continuous batching: Adds requests to running batches as slots free up, delivering 3-5× higher throughput versus static batching.
  • Tensor parallelism: Shards model layers across GPUs via NCCL, enabling models larger than single-GPU VRAM.

vLLM production beats general-purpose servers on throughput per dollar. See the vLLM docs and NVIDIA’s performance docs.

Prerequisites for vLLM on Kubernetes

Your cluster needs GPU-ready drivers, operators, and networking. A missing GPU operator or mismatched CUDA version is the #1 reason production vLLM deployments fail on first boot.

Hardware and Cluster Requirements

RequirementMinimumRecommendedVerify Command
Kubernetes1.281.29+kubectl version
GPU Nodes1× NVIDIA A10G (24 GB)2× NVIDIA A100 80GBkubectl get nodes -l nvidia.com/gpu.present=true
CUDA12.112.4+nvidia-smi
GPU Operatorv23.9.1v24.xhelm list -n gpu-operator
Container Runtimecontainerd 1.7+containerd 1.7+kubectl get nodes -o yaml | grep containerRuntimeVersion
Storage100 GB ephemeral500 GB+ NVMe local SSDdf -h on GPU node
Network10 Gbps25 Gbps+ (InfiniBand for multi-node)iperf3

Software Verification

Terminal window
# Verify GPU operator and device plugin are running
kubectl get pods -n gpu-operator
# Confirm GPU capacity is advertised on nodes
kubectl get nodes -o json | jq '.items[].status.capacity | select(has("nvidia.com/gpu"))'

You’ll also need Prometheus Adapter for HPA custom metrics (covered in Part 2). For a lighter deployment, see Deploy Ollama on Kubernetes or Self-Hosted AI vs. OpenAI API.

Model Storage Planning

vLLM pulls models from Hugging Face by default. For production, pre-stage weights to a PersistentVolume. A 70B model in FP16 is roughly 140 GB: downloading on restart destroys your SLOs.

Version Compatibility Matrix

Version mismatches cause cryptic runtime errors. Reference this matrix:

vLLM VersionCUDAKubernetesGPU OperatorPythontransformersNotes
0.8.412.41.29+v24.3+3.10–3.124.48.0Recommended stable
0.8.312.41.28+v24.3+3.10–3.124.47.0Good stable
0.8.212.11.28+v24.0+3.10–3.114.46.0Legacy
0.7.x12.11.27+v23.9+3.9–3.114.44.0Not recommended for new deployments

Hard rule: Upgrade vLLM and CUDA together. vLLM 0.8.x + CUDA 12.1 causes FP8 kernel failures on H100 GPUs.

Architecture Overview

Ingress routes client requests to GPU-backed Pods. Prometheus scrapes metrics; HPA scales by inference queue depth.

graph LR
A[Client] --> B[Ingress / NGINX]
B --> C[vLLM Service]
C --> D[vLLM Pod]
D --> E[GPU 0]
D --> F[GPU 1]
D --> G[GPU 2]
D --> H[GPU 3]
D --> I[Model PVC]
J[Prometheus] --> D
K[Grafana] --> J
L[HPA Controller] --> D

Data flow:

  1. Client sends a chat/completions request through Ingress.
  2. Service forwards to the vLLM Pod.
  3. PagedAttention slots the request into the batch.
  4. Tensor parallelism shards across GPUs.
  5. Prometheus polls /metrics.
  6. HPA scales when vllm_num_requests_running exceeds the threshold.

FAQ

What is vLLM?

vLLM is an open-source inference engine for high-throughput LLM serving on NVIDIA GPUs. See the official docs.

What GPU do I need to run vLLM in production?

For 7B models, a single A10G (24 GB) suffices. For 70B, use 2× A100 80GB (FP16) or 2× A100 40GB (AWQ). Add ~20% VRAM headroom.

Can I run vLLM without Kubernetes?

Yes. Any Linux machine with NVIDIA drivers and CUDA 12.1+. But you lose autoscaling, rolling updates, and observability. Kubernetes is recommended for multi-user production.

What’s the difference between vLLM and Ollama?

vLLM targets production inference, delivering 3–5× higher throughput than Ollama. See the full comparison.

How do vLLM and LiteLLM work together?

vLLM serves a single model backend. LiteLLM routes across multiple vLLM instances and providers with load balancing, fallbacks, and cost tracking. See the LiteLLM series.

Continue to Part 2: Step-by-Step Kubernetes Deployment where we set up the namespace, ConfigMaps, Secrets, and deploy vLLM on Kubernetes.

# Vllm # Kubernetes # AI # Gpu # Llm # Production