vLLM Kubernetes Deployment: Complete Guide and Tips

Part 2 of 6. In Part 1 we covered the architecture and prerequisites. Here, we deploy vLLM on Kubernetes step by step. Continue to Part 3: Tensor Parallelism and Quantization.

Step-by-Step: Run vLLM on Kubernetes

These six numbered steps take you from a single-GPU development setup to a full multi-GPU inference server on Kubernetes.

Step 1: Select the vLLM Container Image

Always pin to a specific version; latest has no place in vLLM production.

docker pull vllm/vllm-openai:v0.8.4

For security-hardened deployments:

FROM vllm/vllm-openai:v0.8.4
USER root
RUN pip install --no-cache-dir transformers==4.48.0 accelerate==1.3.0
USER vllm

Step 2: Create the Deployment Manifest with GPU Resources

This single-GPU deployment is suitable for models like Mistral 7B or Llama 3 8B:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-inference
  namespace: llm-serving
  labels:
    app: vllm
    model: llama-3-8b
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm
  template:
    metadata:
      labels:
        app: vllm
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8000"
        prometheus.io/path: "/metrics"
    spec:
      nodeSelector:
        nvidia.com/gpu.present: "true"
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
      containers:
        - name: vllm
          image: vllm/vllm-openai:v0.8.4
          command:
            - python
            - -m
            - vllm.entrypoints.openai.api_server
          args:
            - --model
            - meta-llama/Meta-Llama-3-8B-Instruct
            - --dtype
            - bfloat16
            - --max-model-len
            - "8192"
            - --gpu-memory-utilization
            - "0.9"
            - --max-num-seqs
            - "256"
            - --port
            - "8000"
          ports:
            - containerPort: 8000
              name: http
          resources:
            limits:
              nvidia.com/gpu: "1"
              memory: "48Gi"
              cpu: "8"
            requests:
              nvidia.com/gpu: "1"
              memory: "32Gi"
              cpu: "4"
          env:
            - name: HF_HOME
              value: "/models"
            - name: VLLM_LOGGING_LEVEL
              value: "INFO"
          volumeMounts:
            - name: model-cache
              mountPath: /models
          livenessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 120
            periodSeconds: 30
          readinessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 60
            periodSeconds: 10
      volumes:
        - name: model-cache
          persistentVolumeClaim:
            claimName: vllm-model-pvc

Key flags explained:

Flag	Value	Purpose
`--dtype bfloat16`	bfloat16	Balanced precision and memory. Use `float16` for older GPUs without bf16 support.
`--max-model-len 8192`	8192	Hard limit on total sequence length (input + output).
`--gpu-memory-utilization 0.9`	0.9	Reserves 90% of GPU VRAM for vLLM. Leave headroom for CUDA scratch space.
`--max-num-seqs 256`	256	Maximum concurrent sequences. This is your batch size ceiling.

Apply the manifest:

kubectl create namespace llm-serving
kubectl apply -f vllm-deployment.yaml

Step 3: Configure Multi-GPU Setup with Tensor Parallelism

Entity definition (Tensor parallelism): A model-parallel technique that splits individual transformer layers across multiple GPUs. Each GPU handles a shard of the attention and MLP layers while NCCL all-reduce syncs activations between them.

When your model exceeds single-GPU memory, add --tensor-parallel-size. This value must match nvidia.com/gpu in resources.limits:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-inference-70b
  namespace: llm-serving
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm-70b
  template:
    metadata:
      labels:
        app: vllm-70b
    spec:
      nodeSelector:
        node-type: gpu
      containers:
        - name: vllm
          image: vllm/vllm-openai:v0.8.4
          command:
            - python
            - -m
            - vllm.entrypoints.openai.api_server
          args:
            - --model
            - meta-llama/Meta-Llama-3-70B-Instruct
            - --tensor-parallel-size
            - "4"
            - --dtype
            - bfloat16
            - --max-model-len
            - "32768"
            - --gpu-memory-utilization
            - "0.92"
            - --max-num-seqs
            - "128"
            - --port
            - "8000"
          ports:
            - containerPort: 8000
          resources:
            limits:
              nvidia.com/gpu: "4"
              memory: "384Gi"
              cpu: "32"
            requests:
              nvidia.com/gpu: "4"
              memory: "256Gi"
              cpu: "16"
          env:
            - name: NCCL_IB_DISABLE
              value: "1"
            - name: HF_HOME
              value: "/models"

Critical: --tensor-parallel-size must equal nvidia.com/gpu in resources.limits exactly. Any mismatch triggers cryptic NCCL initialization errors that waste hours debugging.

Step 4: Mount Models from Local Storage or HuggingFace

Option A: HuggingFace Hub (development only)

kubectl create secret generic hf-token \
  --from-literal=token=$HF_TOKEN -n llm-serving

Option B: Local PV (recommended for production)

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: vllm-model-pvc
  namespace: llm-serving
spec:
  accessModes: [ReadWriteOnce]
  resources:
    requests:
      storage: 500Gi
  storageClassName: fast-local-nvme

Pre-download via a Kubernetes Job:

kubectl apply -f - <<EOF
apiVersion: batch/v1
kind: Job
metadata:
  name: model-downloader
  namespace: llm-serving
spec:
  template:
    spec:
      containers:
      - name: dl
        image: vllm/vllm-openai:v0.8.4
        command: [huggingface-cli, download, meta-llama/Meta-Llama-3-8B-Instruct, --local-dir, /models/llama-3-8b]
        env:
          - name: HF_TOKEN
            valueFrom:
              secretKeyRef:
                name: hf-token
                key: token
        volumeMounts: [{name: cache, mountPath: /models}]
      volumes: [{name: cache, persistentVolumeClaim: {claimName: vllm-model-pvc}}]
      restartPolicy: OnFailure
EOF

Step 5: Expose the OpenAI-Compatible API Server

apiVersion: v1
kind: Service
metadata:
  name: vllm-service
  namespace: llm-serving
spec:
  selector:
    app: vllm
  ports:
    - port: 8000
      targetPort: 8000
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: vllm-ingress
  namespace: llm-serving
  annotations:
    nginx.ingress.kubernetes.io/proxy-read-timeout: "3600"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "3600"
    cert-manager.io/cluster-issuer: "letsencrypt-prod"
spec:
  ingressClassName: nginx
  tls:
    - hosts: [llm-api.yourdomain.com]
      secretName: vllm-tls
  rules:
    - host: llm-api.yourdomain.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: vllm-service
                port:
                  number: 8000

Note: Set proxy timeouts to 3600 seconds. Default NGINX timeouts clock in at 60s, which kills long-running LLM requests mid-generation.

Test connectivity:

kubectl port-forward svc/vllm-service 8000:8000 -n llm-serving

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "meta-llama/Meta-Llama-3-8B-Instruct", "messages": [{"role": "user", "content": "Hello"}], "max_tokens": 50}'

Step 6: Configure Horizontal Pod Autoscaler with Custom Metrics

CPU-based HPA is useless for LLM inference. GPU utilization has zero correlation with CPU metrics, so you must scale on vLLM’s Prometheus metrics instead.

Install Prometheus Adapter:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm upgrade --install prometheus-adapter prometheus-community/prometheus-adapter \
  --namespace monitoring --set prometheus.url=http://prometheus.monitoring.svc

Configure a custom metric rule:

apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-adapter
  namespace: monitoring
data:
  config.yaml: |
    rules:
      - seriesQuery: 'vllm_num_requests_running'
        resources:
          overrides:
            namespace: {resource: namespace}
            pod: {resource: pod}
        metricsQuery: 'avg(vllm_num_requests_running{<<.LabelMatchers>>}) by (<<.GroupBy>>)'

Create the HPA:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: vllm-hpa
  namespace: llm-serving
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: vllm-inference
  minReplicas: 1
  maxReplicas: 5
  metrics:
    - type: Pods
      pods:
        metric:
          name: vllm_num_requests_running
        target:
          type: AverageValue
          averageValue: "50"
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies: [{type: Pods, value: 1, periodSeconds: 120}]
    scaleDown:
      stabilizationWindowSeconds: 300
      policies: [{type: Pods, value: 1, periodSeconds: 300}]

Production tip: Scale-down stabilization sits at 5 minutes for good reason; cold-starting a 70B model takes 3–5 minutes. Aggressive scale-down will demolish your latency during traffic spikes.

Continue to Part 3: Tensor Parallelism and Quantization for production configuration tuning that maximizes throughput.