vLLM Kubernetes Deployment: Complete Guide and Tips
Table of Contents
Part 2 of 6. In Part 1 we covered the architecture and prerequisites. Here, we deploy vLLM on Kubernetes step by step. Continue to Part 3: Tensor Parallelism and Quantization.
Step-by-Step: Run vLLM on Kubernetes
These six numbered steps take you from a single-GPU development setup to a full multi-GPU inference server on Kubernetes.
Step 1: Select the vLLM Container Image
Always pin to a specific version; latest has no place in vLLM production.
docker pull vllm/vllm-openai:v0.8.4For security-hardened deployments:
FROM vllm/vllm-openai:v0.8.4USER rootRUN pip install --no-cache-dir transformers==4.48.0 accelerate==1.3.0USER vllmStep 2: Create the Deployment Manifest with GPU Resources
This single-GPU deployment is suitable for models like Mistral 7B or Llama 3 8B:
apiVersion: apps/v1kind: Deploymentmetadata: name: vllm-inference namespace: llm-serving labels: app: vllm model: llama-3-8bspec: replicas: 1 selector: matchLabels: app: vllm template: metadata: labels: app: vllm annotations: prometheus.io/scrape: "true" prometheus.io/port: "8000" prometheus.io/path: "/metrics" spec: nodeSelector: nvidia.com/gpu.present: "true" tolerations: - key: nvidia.com/gpu operator: Exists effect: NoSchedule containers: - name: vllm image: vllm/vllm-openai:v0.8.4 command: - python - -m - vllm.entrypoints.openai.api_server args: - --model - meta-llama/Meta-Llama-3-8B-Instruct - --dtype - bfloat16 - --max-model-len - "8192" - --gpu-memory-utilization - "0.9" - --max-num-seqs - "256" - --port - "8000" ports: - containerPort: 8000 name: http resources: limits: nvidia.com/gpu: "1" memory: "48Gi" cpu: "8" requests: nvidia.com/gpu: "1" memory: "32Gi" cpu: "4" env: - name: HF_HOME value: "/models" - name: VLLM_LOGGING_LEVEL value: "INFO" volumeMounts: - name: model-cache mountPath: /models livenessProbe: httpGet: path: /health port: 8000 initialDelaySeconds: 120 periodSeconds: 30 readinessProbe: httpGet: path: /health port: 8000 initialDelaySeconds: 60 periodSeconds: 10 volumes: - name: model-cache persistentVolumeClaim: claimName: vllm-model-pvcKey flags explained:
| Flag | Value | Purpose |
|---|---|---|
--dtype bfloat16 | bfloat16 | Balanced precision and memory. Use float16 for older GPUs without bf16 support. |
--max-model-len 8192 | 8192 | Hard limit on total sequence length (input + output). |
--gpu-memory-utilization 0.9 | 0.9 | Reserves 90% of GPU VRAM for vLLM. Leave headroom for CUDA scratch space. |
--max-num-seqs 256 | 256 | Maximum concurrent sequences. This is your batch size ceiling. |
Apply the manifest:
kubectl create namespace llm-servingkubectl apply -f vllm-deployment.yamlStep 3: Configure Multi-GPU Setup with Tensor Parallelism
Entity definition (Tensor parallelism): A model-parallel technique that splits individual transformer layers across multiple GPUs. Each GPU handles a shard of the attention and MLP layers while NCCL all-reduce syncs activations between them.
When your model exceeds single-GPU memory, add --tensor-parallel-size. This value must match nvidia.com/gpu in resources.limits:
apiVersion: apps/v1kind: Deploymentmetadata: name: vllm-inference-70b namespace: llm-servingspec: replicas: 1 selector: matchLabels: app: vllm-70b template: metadata: labels: app: vllm-70b spec: nodeSelector: node-type: gpu containers: - name: vllm image: vllm/vllm-openai:v0.8.4 command: - python - -m - vllm.entrypoints.openai.api_server args: - --model - meta-llama/Meta-Llama-3-70B-Instruct - --tensor-parallel-size - "4" - --dtype - bfloat16 - --max-model-len - "32768" - --gpu-memory-utilization - "0.92" - --max-num-seqs - "128" - --port - "8000" ports: - containerPort: 8000 resources: limits: nvidia.com/gpu: "4" memory: "384Gi" cpu: "32" requests: nvidia.com/gpu: "4" memory: "256Gi" cpu: "16" env: - name: NCCL_IB_DISABLE value: "1" - name: HF_HOME value: "/models"Critical:
--tensor-parallel-sizemust equalnvidia.com/gpuinresources.limitsexactly. Any mismatch triggers cryptic NCCL initialization errors that waste hours debugging.
Step 4: Mount Models from Local Storage or HuggingFace
Option A: HuggingFace Hub (development only)
kubectl create secret generic hf-token \ --from-literal=token=$HF_TOKEN -n llm-servingOption B: Local PV (recommended for production)
apiVersion: v1kind: PersistentVolumeClaimmetadata: name: vllm-model-pvc namespace: llm-servingspec: accessModes: [ReadWriteOnce] resources: requests: storage: 500Gi storageClassName: fast-local-nvmePre-download via a Kubernetes Job:
kubectl apply -f - <<EOFapiVersion: batch/v1kind: Jobmetadata: name: model-downloader namespace: llm-servingspec: template: spec: containers: - name: dl image: vllm/vllm-openai:v0.8.4 command: [huggingface-cli, download, meta-llama/Meta-Llama-3-8B-Instruct, --local-dir, /models/llama-3-8b] env: - name: HF_TOKEN valueFrom: secretKeyRef: name: hf-token key: token volumeMounts: [{name: cache, mountPath: /models}] volumes: [{name: cache, persistentVolumeClaim: {claimName: vllm-model-pvc}}] restartPolicy: OnFailureEOFStep 5: Expose the OpenAI-Compatible API Server
apiVersion: v1kind: Servicemetadata: name: vllm-service namespace: llm-servingspec: selector: app: vllm ports: - port: 8000 targetPort: 8000---apiVersion: networking.k8s.io/v1kind: Ingressmetadata: name: vllm-ingress namespace: llm-serving annotations: nginx.ingress.kubernetes.io/proxy-read-timeout: "3600" nginx.ingress.kubernetes.io/proxy-send-timeout: "3600" cert-manager.io/cluster-issuer: "letsencrypt-prod"spec: ingressClassName: nginx tls: - hosts: [llm-api.yourdomain.com] secretName: vllm-tls rules: - host: llm-api.yourdomain.com http: paths: - path: / pathType: Prefix backend: service: name: vllm-service port: number: 8000Note: Set proxy timeouts to 3600 seconds. Default NGINX timeouts clock in at 60s, which kills long-running LLM requests mid-generation.
Test connectivity:
kubectl port-forward svc/vllm-service 8000:8000 -n llm-serving
curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{"model": "meta-llama/Meta-Llama-3-8B-Instruct", "messages": [{"role": "user", "content": "Hello"}], "max_tokens": 50}'Step 6: Configure Horizontal Pod Autoscaler with Custom Metrics
CPU-based HPA is useless for LLM inference. GPU utilization has zero correlation with CPU metrics, so you must scale on vLLM’s Prometheus metrics instead.
Install Prometheus Adapter:
helm repo add prometheus-community https://prometheus-community.github.io/helm-chartshelm upgrade --install prometheus-adapter prometheus-community/prometheus-adapter \ --namespace monitoring --set prometheus.url=http://prometheus.monitoring.svcConfigure a custom metric rule:
apiVersion: v1kind: ConfigMapmetadata: name: prometheus-adapter namespace: monitoringdata: config.yaml: | rules: - seriesQuery: 'vllm_num_requests_running' resources: overrides: namespace: {resource: namespace} pod: {resource: pod} metricsQuery: 'avg(vllm_num_requests_running{<<.LabelMatchers>>}) by (<<.GroupBy>>)'Create the HPA:
apiVersion: autoscaling/v2kind: HorizontalPodAutoscalermetadata: name: vllm-hpa namespace: llm-servingspec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: vllm-inference minReplicas: 1 maxReplicas: 5 metrics: - type: Pods pods: metric: name: vllm_num_requests_running target: type: AverageValue averageValue: "50" behavior: scaleUp: stabilizationWindowSeconds: 60 policies: [{type: Pods, value: 1, periodSeconds: 120}] scaleDown: stabilizationWindowSeconds: 300 policies: [{type: Pods, value: 1, periodSeconds: 300}]Production tip: Scale-down stabilization sits at 5 minutes for good reason; cold-starting a 70B model takes 3–5 minutes. Aggressive scale-down will demolish your latency during traffic spikes.
Continue to Part 3: Tensor Parallelism and Quantization for production configuration tuning that maximizes throughput.