Ollama Kubernetes Production Manifests and Deploy
Table of Contents
Part 3 of 4: Part 1 | Part 2 | Part 3 | Part 4*
Step 3: Deploy Ollama with Production-Ready Manifests
This manifest bundles the Deployment, PVC, and ConfigMap into a single YAML file Iβve used across six production clusters.
apiVersion: v1kind: PersistentVolumeClaimmetadata: name: ollama-models namespace: ollamaspec: accessModes: - ReadWriteOnce resources: requests: storage: 100Gi storageClassName: fast-ssd # Replace with your SSD-backed StorageClass---apiVersion: v1kind: ConfigMapmetadata: name: ollama-config namespace: ollamadata: OLLAMA_HOST: "0.0.0.0" OLLAMA_PORT: "11434" OLLAMA_NUM_PARALLEL: "4" OLLAMA_MAX_LOADED_MODELS: "2" OLLAMA_KEEP_ALIVE: "30m"---apiVersion: apps/v1kind: Deploymentmetadata: name: ollama namespace: ollama labels: app.kubernetes.io/name: ollama app.kubernetes.io/component: inference-serverspec: replicas: 1 strategy: type: Recreate # Required for RWO PVCs selector: matchLabels: app.kubernetes.io/name: ollama template: metadata: labels: app.kubernetes.io/name: ollama spec: nodeSelector: gpu-type: nvidia tolerations: - key: "nvidia.com/gpu" operator: "Equal" value: "true" effect: "NoSchedule" # Uncomment if your cluster requires explicit GPU runtime class # runtimeClassName: nvidia terminationGracePeriodSeconds: 60 containers: - name: ollama image: ollama/ollama:0.5.7 ports: - containerPort: 11434 name: http protocol: TCP envFrom: - configMapRef: name: ollama-config resources: requests: cpu: "4" memory: "16Gi" nvidia.com/gpu: "1" limits: cpu: "8" memory: "64Gi" # 32Gi for 7B-13B models, 64Gi for 70B models nvidia.com/gpu: "1" volumeMounts: - name: models mountPath: /root/.ollama lifecycle: preStop: exec: command: ["/bin/sh", "-c", "sleep 15"] livenessProbe: httpGet: path: /api/tags port: 11434 initialDelaySeconds: 60 periodSeconds: 30 timeoutSeconds: 10 failureThreshold: 3 readinessProbe: httpGet: path: /api/tags port: 11434 initialDelaySeconds: 10 periodSeconds: 10 timeoutSeconds: 5 failureThreshold: 3 securityContext: allowPrivilegeEscalation: false readOnlyRootFilesystem: false # Ollama writes models to /root/.ollama runAsNonRoot: false capabilities: drop: - ALL volumes: - name: models persistentVolumeClaim: claimName: ollama-modelsKey choices Iβve validated in production:
strategy: Recreate: Required because theReadWriteOncePVC canβt mount to two pods simultaneously.- Resource limits:
64Gisupports 70B parameter models with quantization. Use32Gifor 7B-13B models. OLLAMA_NUM_PARALLEL: "4": Batches concurrent requests. Increase only if you have spare VRAM.lifecycle.preStop+terminationGracePeriodSeconds: 60: Gives in-flight inference requests 15 seconds to complete before SIGKILL during pod termination.OLLAMA_KEEP_ALIVE: "30m": Keeps models loaded in VRAM for 30 minutes after the last request, reducing cold-start latency.
Apply the manifest:
kubectl apply -f 02-ollama-deployment.yamlWait for the pod to reach Running:
kubectl wait --for=condition=ready pod -l app.kubernetes.io/name=ollama -n ollama --timeout=300sStep 4: Expose Ollama with Service and Ingress
Create a ClusterIP Service for internal communication and an Ingress for external access.
apiVersion: v1kind: Servicemetadata: name: ollama namespace: ollama labels: app.kubernetes.io/name: ollamaspec: type: ClusterIP selector: app.kubernetes.io/name: ollama ports: - port: 11434 targetPort: 11434 name: http protocol: TCP---apiVersion: networking.k8s.io/v1kind: Ingressmetadata: name: ollama namespace: ollama annotations: nginx.ingress.kubernetes.io/proxy-body-size: "512m" nginx.ingress.kubernetes.io/proxy-read-timeout: "300" nginx.ingress.kubernetes.io/proxy-send-timeout: "300"spec: ingressClassName: nginx rules: - host: ollama.<your-cluster-domain>.com http: paths: - path: / pathType: Prefix backend: service: name: ollama port: number: 11434Warning: Ollama has no built-in authentication. I expose via Ingress only with an auth proxy (oauth2-proxy, Authelia) or internal VPN.
Tip: The
proxy-body-size: "512m"annotation is critical if you plan to upload custom models via the API. Ollamaβs default push endpoint can send large blobs.
Apply and verify:
kubectl apply -f 03-service-ingress.yamlkubectl get ingress -n ollamaStep 5: Verification: Pull a Model and Run Inference
With the pod running, pull a model and test local LLM inference to confirm GPU acceleration is working.
Exec into the pod to pull your first model:
kubectl exec -it deployment/ollama -n ollama -- ollama pull llama3.2Expected output:
pulling manifestpulling dde5aa3fc5ff... 100% ββββββββββββββββββ 2.0 GBverifying sha256 digestwriting manifestsuccessTest inference via port-forward:
kubectl port-forward svc/ollama 11434:11434 -n ollamaIn another terminal:
curl http://localhost:11434/api/generate -d '{ "model": "llama3.2", "prompt": "Why is Kubernetes the right platform for self-hosted AI?", "stream": false}'You should receive a JSON response with generated text. If you see {"done":true,"response":"..."}, your local LLM inference stack is live.
For programmatic access, verify the OpenAI-compatible endpoint:
curl http://localhost:11434/v1/chat/completions -H "Content-Type: application/json" -d '{ "model": "llama3.2", "messages": [{"role": "user", "content": "Hello!"}]}'FAQ
Why use strategy: Recreate instead of RollingUpdate?
ReadWriteOnce PVCs can only mount to one pod at a time. A rolling update would create a new pod before terminating the old one, and the new pod would fail to mount the PVC. Recreate terminates the old pod first, then starts the new one.
What do the liveness and readiness probes check?
Both probes hit the /api/tags endpoint. The liveness probe (checking every 30s after a 60s delay) restarts the pod if Ollama stops responding. The readiness probe (every 10s after a 10s delay) removes the pod from the Service endpoint until Ollama is ready to serve requests. Iβve tuned these delays to account for model loading time.
How do I verify GPU acceleration is actually working?
Run kubectl exec -it deployment/ollama -n ollama -- nvidia-smi inside the pod. If you see GPU processes listed with Ollamaβs PID, the GPU is being used. During inference, GPU utilization should spike to 90-100%.
What does OLLAMA_KEEP_ALIVE: "30m" do?
It keeps the loaded model in VRAM for 30 minutes after the last request. Without this, Ollama unloads the model immediately after each inference, adding 5-30 seconds of cold-start latency to every request. For chat applications, I recommend 30m. For batch processing, set it higher or to -1 (never unload).
Can I run multiple models simultaneously with this setup?
Yes, but only up to OLLAMA_MAX_LOADED_MODELS: "2". Each loaded model consumes VRAM. On a single A100 with 80 GB, you can run two 7B models simultaneously. For production multi-model serving, I recommend the per-model Deployment pattern in Part 4.
Next Steps
Your Ollama instance is running on Kubernetes with GPU acceleration, persistent storage, and ingress exposure. In Part 4, I cover production hardening; HPA alternatives, NetworkPolicies, monitoring, troubleshooting common errors, and when NOT to use Kubernetes for Ollama.
Parts in this series: β Part 2 | Part 4 β