vLLM Kubernetes Deployment: Complete Guide and Tips

2026.02.09
Technology
387 Words
vLLM Kubernetes Deployment: Complete Guide and Tips

Part 2 of 6. In Part 1 we covered the architecture and prerequisites. Here, we deploy vLLM on Kubernetes step by step. Continue to Part 3: Tensor Parallelism and Quantization.

Step-by-Step: Run vLLM on Kubernetes

These six numbered steps take you from a single-GPU development setup to a full multi-GPU inference server on Kubernetes.

Step 1: Select the vLLM Container Image

Always pin to a specific version; latest has no place in vLLM production.

Terminal window
docker pull vllm/vllm-openai:v0.8.4

For security-hardened deployments:

FROM vllm/vllm-openai:v0.8.4
USER root
RUN pip install --no-cache-dir transformers==4.48.0 accelerate==1.3.0
USER vllm

Step 2: Create the Deployment Manifest with GPU Resources

This single-GPU deployment is suitable for models like Mistral 7B or Llama 3 8B:

vllm-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-inference
namespace: llm-serving
labels:
app: vllm
model: llama-3-8b
spec:
replicas: 1
selector:
matchLabels:
app: vllm
template:
metadata:
labels:
app: vllm
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8000"
prometheus.io/path: "/metrics"
spec:
nodeSelector:
nvidia.com/gpu.present: "true"
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
containers:
- name: vllm
image: vllm/vllm-openai:v0.8.4
command:
- python
- -m
- vllm.entrypoints.openai.api_server
args:
- --model
- meta-llama/Meta-Llama-3-8B-Instruct
- --dtype
- bfloat16
- --max-model-len
- "8192"
- --gpu-memory-utilization
- "0.9"
- --max-num-seqs
- "256"
- --port
- "8000"
ports:
- containerPort: 8000
name: http
resources:
limits:
nvidia.com/gpu: "1"
memory: "48Gi"
cpu: "8"
requests:
nvidia.com/gpu: "1"
memory: "32Gi"
cpu: "4"
env:
- name: HF_HOME
value: "/models"
- name: VLLM_LOGGING_LEVEL
value: "INFO"
volumeMounts:
- name: model-cache
mountPath: /models
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 120
periodSeconds: 30
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 60
periodSeconds: 10
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: vllm-model-pvc

Key flags explained:

FlagValuePurpose
--dtype bfloat16bfloat16Balanced precision and memory. Use float16 for older GPUs without bf16 support.
--max-model-len 81928192Hard limit on total sequence length (input + output).
--gpu-memory-utilization 0.90.9Reserves 90% of GPU VRAM for vLLM. Leave headroom for CUDA scratch space.
--max-num-seqs 256256Maximum concurrent sequences. This is your batch size ceiling.

Apply the manifest:

Terminal window
kubectl create namespace llm-serving
kubectl apply -f vllm-deployment.yaml

Step 3: Configure Multi-GPU Setup with Tensor Parallelism

Entity definition (Tensor parallelism): A model-parallel technique that splits individual transformer layers across multiple GPUs. Each GPU handles a shard of the attention and MLP layers while NCCL all-reduce syncs activations between them.

When your model exceeds single-GPU memory, add --tensor-parallel-size. This value must match nvidia.com/gpu in resources.limits:

vllm-deployment-multi-gpu.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-inference-70b
namespace: llm-serving
spec:
replicas: 1
selector:
matchLabels:
app: vllm-70b
template:
metadata:
labels:
app: vllm-70b
spec:
nodeSelector:
node-type: gpu
containers:
- name: vllm
image: vllm/vllm-openai:v0.8.4
command:
- python
- -m
- vllm.entrypoints.openai.api_server
args:
- --model
- meta-llama/Meta-Llama-3-70B-Instruct
- --tensor-parallel-size
- "4"
- --dtype
- bfloat16
- --max-model-len
- "32768"
- --gpu-memory-utilization
- "0.92"
- --max-num-seqs
- "128"
- --port
- "8000"
ports:
- containerPort: 8000
resources:
limits:
nvidia.com/gpu: "4"
memory: "384Gi"
cpu: "32"
requests:
nvidia.com/gpu: "4"
memory: "256Gi"
cpu: "16"
env:
- name: NCCL_IB_DISABLE
value: "1"
- name: HF_HOME
value: "/models"

Critical: --tensor-parallel-size must equal nvidia.com/gpu in resources.limits exactly. Any mismatch triggers cryptic NCCL initialization errors that waste hours debugging.

Step 4: Mount Models from Local Storage or HuggingFace

Option A: HuggingFace Hub (development only)

Terminal window
kubectl create secret generic hf-token \
--from-literal=token=$HF_TOKEN -n llm-serving

Option B: Local PV (recommended for production)

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: vllm-model-pvc
namespace: llm-serving
spec:
accessModes: [ReadWriteOnce]
resources:
requests:
storage: 500Gi
storageClassName: fast-local-nvme

Pre-download via a Kubernetes Job:

Terminal window
kubectl apply -f - <<EOF
apiVersion: batch/v1
kind: Job
metadata:
name: model-downloader
namespace: llm-serving
spec:
template:
spec:
containers:
- name: dl
image: vllm/vllm-openai:v0.8.4
command: [huggingface-cli, download, meta-llama/Meta-Llama-3-8B-Instruct, --local-dir, /models/llama-3-8b]
env:
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: hf-token
key: token
volumeMounts: [{name: cache, mountPath: /models}]
volumes: [{name: cache, persistentVolumeClaim: {claimName: vllm-model-pvc}}]
restartPolicy: OnFailure
EOF

Step 5: Expose the OpenAI-Compatible API Server

apiVersion: v1
kind: Service
metadata:
name: vllm-service
namespace: llm-serving
spec:
selector:
app: vllm
ports:
- port: 8000
targetPort: 8000
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: vllm-ingress
namespace: llm-serving
annotations:
nginx.ingress.kubernetes.io/proxy-read-timeout: "3600"
nginx.ingress.kubernetes.io/proxy-send-timeout: "3600"
cert-manager.io/cluster-issuer: "letsencrypt-prod"
spec:
ingressClassName: nginx
tls:
- hosts: [llm-api.yourdomain.com]
secretName: vllm-tls
rules:
- host: llm-api.yourdomain.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: vllm-service
port:
number: 8000

Note: Set proxy timeouts to 3600 seconds. Default NGINX timeouts clock in at 60s, which kills long-running LLM requests mid-generation.

Test connectivity:

Terminal window
kubectl port-forward svc/vllm-service 8000:8000 -n llm-serving
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "meta-llama/Meta-Llama-3-8B-Instruct", "messages": [{"role": "user", "content": "Hello"}], "max_tokens": 50}'

Step 6: Configure Horizontal Pod Autoscaler with Custom Metrics

CPU-based HPA is useless for LLM inference. GPU utilization has zero correlation with CPU metrics, so you must scale on vLLM’s Prometheus metrics instead.

Install Prometheus Adapter:

Terminal window
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm upgrade --install prometheus-adapter prometheus-community/prometheus-adapter \
--namespace monitoring --set prometheus.url=http://prometheus.monitoring.svc

Configure a custom metric rule:

apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-adapter
namespace: monitoring
data:
config.yaml: |
rules:
- seriesQuery: 'vllm_num_requests_running'
resources:
overrides:
namespace: {resource: namespace}
pod: {resource: pod}
metricsQuery: 'avg(vllm_num_requests_running{<<.LabelMatchers>>}) by (<<.GroupBy>>)'

Create the HPA:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: vllm-hpa
namespace: llm-serving
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: vllm-inference
minReplicas: 1
maxReplicas: 5
metrics:
- type: Pods
pods:
metric:
name: vllm_num_requests_running
target:
type: AverageValue
averageValue: "50"
behavior:
scaleUp:
stabilizationWindowSeconds: 60
policies: [{type: Pods, value: 1, periodSeconds: 120}]
scaleDown:
stabilizationWindowSeconds: 300
policies: [{type: Pods, value: 1, periodSeconds: 300}]

Production tip: Scale-down stabilization sits at 5 minutes for good reason; cold-starting a 70B model takes 3–5 minutes. Aggressive scale-down will demolish your latency during traffic spikes.

Continue to Part 3: Tensor Parallelism and Quantization for production configuration tuning that maximizes throughput.

# Vllm # Kubernetes # AI # Gpu # Llm # Production