Ollama Kubernetes Production Manifests and Deploy

Part 3 of 4: Part 1 | Part 2 | Part 3 | Part 4*

Step 3: Deploy Ollama with Production-Ready Manifests

This manifest bundles the Deployment, PVC, and ConfigMap into a single YAML file I’ve used across six production clusters.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: ollama-models
  namespace: ollama
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 100Gi
  storageClassName: fast-ssd  # Replace with your SSD-backed StorageClass
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: ollama-config
  namespace: ollama
data:
  OLLAMA_HOST: "0.0.0.0"
  OLLAMA_PORT: "11434"
  OLLAMA_NUM_PARALLEL: "4"
  OLLAMA_MAX_LOADED_MODELS: "2"
  OLLAMA_KEEP_ALIVE: "30m"
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ollama
  namespace: ollama
  labels:
    app.kubernetes.io/name: ollama
    app.kubernetes.io/component: inference-server
spec:
  replicas: 1
  strategy:
    type: Recreate  # Required for RWO PVCs
  selector:
    matchLabels:
      app.kubernetes.io/name: ollama
  template:
    metadata:
      labels:
        app.kubernetes.io/name: ollama
    spec:
      nodeSelector:
        gpu-type: nvidia
      tolerations:
        - key: "nvidia.com/gpu"
          operator: "Equal"
          value: "true"
          effect: "NoSchedule"
      # Uncomment if your cluster requires explicit GPU runtime class
      # runtimeClassName: nvidia
      terminationGracePeriodSeconds: 60
      containers:
        - name: ollama
          image: ollama/ollama:0.5.7
          ports:
            - containerPort: 11434
              name: http
              protocol: TCP
          envFrom:
            - configMapRef:
                name: ollama-config
          resources:
            requests:
              cpu: "4"
              memory: "16Gi"
              nvidia.com/gpu: "1"
            limits:
              cpu: "8"
              memory: "64Gi"  # 32Gi for 7B-13B models, 64Gi for 70B models
              nvidia.com/gpu: "1"
          volumeMounts:
            - name: models
              mountPath: /root/.ollama
          lifecycle:
            preStop:
              exec:
                command: ["/bin/sh", "-c", "sleep 15"]
          livenessProbe:
            httpGet:
              path: /api/tags
              port: 11434
            initialDelaySeconds: 60
            periodSeconds: 30
            timeoutSeconds: 10
            failureThreshold: 3
          readinessProbe:
            httpGet:
              path: /api/tags
              port: 11434
            initialDelaySeconds: 10
            periodSeconds: 10
            timeoutSeconds: 5
            failureThreshold: 3
          securityContext:
            allowPrivilegeEscalation: false
            readOnlyRootFilesystem: false  # Ollama writes models to /root/.ollama
            runAsNonRoot: false
            capabilities:
              drop:
                - ALL
      volumes:
        - name: models
          persistentVolumeClaim:
            claimName: ollama-models

Key choices I’ve validated in production:

strategy: Recreate: Required because the ReadWriteOnce PVC can’t mount to two pods simultaneously.
Resource limits: 64Gi supports 70B parameter models with quantization. Use 32Gi for 7B-13B models.
OLLAMA_NUM_PARALLEL: "4": Batches concurrent requests. Increase only if you have spare VRAM.
lifecycle.preStop + terminationGracePeriodSeconds: 60: Gives in-flight inference requests 15 seconds to complete before SIGKILL during pod termination.
OLLAMA_KEEP_ALIVE: "30m": Keeps models loaded in VRAM for 30 minutes after the last request, reducing cold-start latency.

Apply the manifest:

kubectl apply -f 02-ollama-deployment.yaml

Wait for the pod to reach Running:

kubectl wait --for=condition=ready pod -l app.kubernetes.io/name=ollama -n ollama --timeout=300s

Step 4: Expose Ollama with Service and Ingress

Create a ClusterIP Service for internal communication and an Ingress for external access.

apiVersion: v1
kind: Service
metadata:
  name: ollama
  namespace: ollama
  labels:
    app.kubernetes.io/name: ollama
spec:
  type: ClusterIP
  selector:
    app.kubernetes.io/name: ollama
  ports:
    - port: 11434
      targetPort: 11434
      name: http
      protocol: TCP
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: ollama
  namespace: ollama
  annotations:
    nginx.ingress.kubernetes.io/proxy-body-size: "512m"
    nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "300"
spec:
  ingressClassName: nginx
  rules:
    - host: ollama.<your-cluster-domain>.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: ollama
                port:
                  number: 11434

Warning: Ollama has no built-in authentication. I expose via Ingress only with an auth proxy (oauth2-proxy, Authelia) or internal VPN.

Tip: The proxy-body-size: "512m" annotation is critical if you plan to upload custom models via the API. Ollama’s default push endpoint can send large blobs.

Apply and verify:

kubectl apply -f 03-service-ingress.yaml
kubectl get ingress -n ollama

Step 5: Verification: Pull a Model and Run Inference

With the pod running, pull a model and test local LLM inference to confirm GPU acceleration is working.

Exec into the pod to pull your first model:

kubectl exec -it deployment/ollama -n ollama -- ollama pull llama3.2

Expected output:

pulling manifest
pulling dde5aa3fc5ff... 100% ▕████████████████▏ 2.0 GB
verifying sha256 digest
writing manifest
success

Test inference via port-forward:

kubectl port-forward svc/ollama 11434:11434 -n ollama

In another terminal:

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "Why is Kubernetes the right platform for self-hosted AI?",
  "stream": false
}'

You should receive a JSON response with generated text. If you see {"done":true,"response":"..."}, your local LLM inference stack is live.

For programmatic access, verify the OpenAI-compatible endpoint:

curl http://localhost:11434/v1/chat/completions -H "Content-Type: application/json" -d '{
  "model": "llama3.2",
  "messages": [{"role": "user", "content": "Hello!"}]
}'

FAQ

Why use `strategy: Recreate` instead of `RollingUpdate`?

ReadWriteOnce PVCs can only mount to one pod at a time. A rolling update would create a new pod before terminating the old one, and the new pod would fail to mount the PVC. Recreate terminates the old pod first, then starts the new one.

What do the liveness and readiness probes check?

Both probes hit the /api/tags endpoint. The liveness probe (checking every 30s after a 60s delay) restarts the pod if Ollama stops responding. The readiness probe (every 10s after a 10s delay) removes the pod from the Service endpoint until Ollama is ready to serve requests. I’ve tuned these delays to account for model loading time.

How do I verify GPU acceleration is actually working?

Run kubectl exec -it deployment/ollama -n ollama -- nvidia-smi inside the pod. If you see GPU processes listed with Ollama’s PID, the GPU is being used. During inference, GPU utilization should spike to 90-100%.

What does `OLLAMA_KEEP_ALIVE: "30m"` do?

It keeps the loaded model in VRAM for 30 minutes after the last request. Without this, Ollama unloads the model immediately after each inference, adding 5-30 seconds of cold-start latency to every request. For chat applications, I recommend 30m. For batch processing, set it higher or to -1 (never unload).

Can I run multiple models simultaneously with this setup?

Yes, but only up to OLLAMA_MAX_LOADED_MODELS: "2". Each loaded model consumes VRAM. On a single A100 with 80 GB, you can run two 7B models simultaneously. For production multi-model serving, I recommend the per-model Deployment pattern in Part 4.

Next Steps

Your Ollama instance is running on Kubernetes with GPU acceleration, persistent storage, and ingress exposure. In Part 4, I cover production hardening; HPA alternatives, NetworkPolicies, monitoring, troubleshooting common errors, and when NOT to use Kubernetes for Ollama.

Parts in this series: ← Part 2 | Part 4 →