Ollama Kubernetes Security Hardening Best Practices
Table of Contents
Part 4 of 4: Part 1 | Part 2 | Part 3 | Part 4*
Production Considerations
Running Ollama in production requires hardening beyond the basic deployment. Hereâs what Iâve learned from running Ollama across six production clusters.
Resource Limits and OOM Prevention
Ollamaâs memory footprint spikes when loading models. A 70B parameter model quantized to Q4 needs ~40 GB of system RAM to load, plus VRAM for active layers. I set limits.memory at least 1.5x your largest modelâs RAM requirement to avoid OOMKilled interruptions.
kubectl top pod -n ollamaHorizontal Pod Autoscaling (HPA)
HPA is incompatible with this Deployment architecture. The
ReadWriteOncePVC andRecreatestrategy prevent horizontal scaling to multiple replicas. HPA creates additional pods that fail to mount the volume.
For multi-replica scaling, use one of these alternatives:
- RWX storage: Switch to
ReadWriteMany(NFS/EFS) and removestrategy: Recreate - One model per Deployment: Deploy separate Deployments per model, each with 1 replica
Scaled Deployment pattern (one model per Deployment):
apiVersion: apps/v1kind: Deploymentmetadata: name: ollama-llama3 namespace: ollamaspec: replicas: 1 selector: matchLabels: app: ollama-llama3 template: metadata: labels: app: ollama-llama3 spec: nodeSelector: gpu-type: nvidia containers: - name: ollama image: ollama/ollama:0.5.7 env: - name: OLLAMA_KEEP_ALIVE value: "30m" resources: limits: nvidia.com/gpu: "1" memory: "64Gi" volumeMounts: - name: models mountPath: /root/.ollama lifecycle: preStop: exec: command: ["/bin/sh", "-c", "sleep 15"] terminationGracePeriodSeconds: 60 volumes: - name: models emptyDir: {} # Or use a dedicated PVC per model---apiVersion: v1kind: Servicemetadata: name: ollama-llama3 namespace: ollamaspec: selector: app: ollama-llama3 ports: - port: 11434 targetPort: 11434This pattern isolates models, avoids PVC conflicts, and lets you scale each model independently.
Network Policies
Restrict ingress to Ollama from trusted namespaces only:
apiVersion: networking.k8s.io/v1kind: NetworkPolicymetadata: name: ollama-netpol namespace: ollamaspec: podSelector: matchLabels: app.kubernetes.io/name: ollama policyTypes: - Ingress ingress: - from: - namespaceSelector: matchLabels: name: api-gateway ports: - protocol: TCP port: 11434Monitoring Basics
Monitor these key signals:
| Metric | Alert Threshold | Why It Matters |
|---|---|---|
container_memory_working_set_bytes | > 85% of limit | Prevents OOMKilled |
container_gpu_memory_usage | > 90% of GPU VRAM | Prevents model load failures |
kube_pod_status_ready | 0 for > 2 min | Pod health |
Add GPU utilization dashboards with the NVIDIA DCGM exporter.
Ollama K8s vs. Bare Metal: Why Use Kubernetes?
| Feature | Bare Metal / VM | Kubernetes (This Guide) |
|---|---|---|
| Scheduling | Manual | Automatic GPU node affinity |
| Resource isolation | cgroups/systemd | Container limits + quotas |
| Storage | Local disk | PVC with snapshot/backup support |
| Scaling | Manual VM resize | HPA + cluster autoscaler |
| Service discovery | Hardcoded IPs | DNS + Ingress |
| Network security | Firewall rules | NetworkPolicy + TLS termination |
| Monitoring | Custom scripts | Prometheus + Grafana ecosystem |
| Multi-model serving | Single instance | Multiple Deployments per model |
For teams already on Kubernetes, this approach reduces toil by fitting AI inference into existing platform patterns. Self-hosted AI becomes just another workload in your cluster, governed by the same RBAC, monitoring, and backup policies.
When NOT to Deploy Ollama on Kubernetes
Kubernetes adds operational complexity that isnât always justified.
Reconsider this approach if:
- You only need a single model on one GPU: A standalone VM or Docker Compose setup is simpler.
- Your team lacks Kubernetes expertise: Debugging GPU scheduling failures, PVC binding issues, and ingress misconfigurations requires platform engineering skills.
- Latency is your absolute top priority: The container runtime and Kubernetes networking layers add microseconds of overhead.
- You need multi-GPU model parallelism: Ollama doesnât natively shard models across multiple GPUs. For that, use vLLM for production inference or TensorRT-LLM.
- Your cluster doesnât have GPU nodes: CPU inference through Kubernetes is 10-50x slower.
Troubleshooting Common Issues
Error: âGPU not detectedâ or CPU fallback
Symptom: Ollama falls back to CPU. Inference is extremely slow.
Diagnosis:
kubectl exec -it deployment/ollama -n ollama -- nvidia-smiSolution: Confirm the device plugin is running (kubectl get daemonset -n gpu-operator), check your runtime class, and verify the pod requests nvidia.com/gpu: "1". For a complete diagnostic walkthrough, see how to fix GPU not detected in Kubernetes.
Error: âconnection resetâ or model download hangs at 0%
Symptom: ollama pull stalls at 0% or errors with connection reset.
Solution: Ensure the PVC has enough free space (~40 GB for a 70B model). Check if a proxy blocks ollama.com. Reduce OLLAMA_MAX_LOADED_MODELS if the timeout is too aggressive. Verify your StorageClass supports dynamic provisioning:
kubectl get storageclasskubectl get pvc ollama-models -n ollamaError: âOOMKilledâ during model load
Symptom: Pod status shows OOMKilled after pulling a large model.
Solution: Increase limits.memory in the Deployment to at least 1.5x the modelâs RAM footprint, use a smaller quantization (Q4 instead of Q8), or reduce OLLAMA_NUM_PARALLEL to free VRAM.
Error: âReadWriteOnce volume already mountedâ
Symptom: New pod stays in ContainerCreating with a PVC mount error.
Solution: The old pod hasnât fully terminated. Check kubectl get pods -n ollama and wait for the previous pod to reach Terminated status. If stuck, force-delete: kubectl delete pod <pod-name> -n ollama --force.
Error: âIngress returns 502 Bad Gatewayâ
Symptom: Requests to the Ollama ingress return 502.
Solution: Verify the Service selector matches the pod labels, check that the pod is Ready (kubectl get pods -n ollama), and confirm the ingress controller is running.
For deeper troubleshooting, see how to fix OOMKilled errors in GPU AI workloads.
FAQ
Can I deploy Ollama on Kubernetes without a GPU?
Yes, but I donât recommend it for production workloads. Omit the nvidia.com/gpu resource request for CPU inference. A 7B model runs 10-20x slower on CPU. CPU-only deployment is viable for development or running 1-3B parameter models. For production self-hosted AI, GPU acceleration is essential.
How much storage do I need for Ollama models in Kubernetes?
Start with a 100 GB PVC on SSD-backed storage. Model sizes vary by parameters and quantization:
| Model | Quantization | Disk Size |
|---|---|---|
| Llama 3.2 (3B) | Q4_0 | ~2.0 GB |
| Llama 3.1 (8B) | Q4_0 | ~4.7 GB |
| Llama 3.1 (70B) | Q4_0 | ~39 GB |
| Mixtral 8x7B | Q4_K_M | ~31 GB |
Model loading from spinning disk is painfully slow. Always use SSD-backed StorageClasses for Ollama K8s deployments.
What is the difference between Ollama and vLLM for Kubernetes deployments?
Ollama optimizes for developer experience and multi-model management, while vLLM optimizes for throughput with PagedAttention and continuous batching. Choose Ollama for flexibility and ease of use; choose vLLM for high-QPS single-model serving. See my Ollama vs vLLM comparison for a detailed breakdown.
How do I update Ollama without losing downloaded models?
Models persist on the PVC at /root/.ollama. Change the image tag and reapply the Deployment. Kubernetes recreates the pod, mounts the existing PVC, and all models remain intact. Use kubectl rollout restart deployment/ollama -n ollama to restart without editing manifests.
Is Ollama suitable for production multi-tenant AI platforms?
Ollama works well for internal teams and moderate-traffic APIs, but lacks native multi-tenancy features. For a true multi-tenant self-hosted AI platform with per-user rate limiting, quota enforcement, and request isolation, add an API gateway like Kong or Envoy in front of Ollama, or evaluate vLLM for production inference.
What Kubernetes storage class works best for Ollama model caching?
Use a locally-attached SSD StorageClass (e.g., local-ssd, fast-ssd, or a topologically-bound CSI driver). Network-attached storage (NFS, EFS) introduces latency during model loading that negates the benefits of GPU inference. NVMe local volumes are ideal for Ollama on Kubernetes workloads.
Conclusion
You now know how to deploy Ollama on Kubernetes with GPU scheduling, persistent model storage, and ingress exposure. This architecture serves as the foundation for a broader self-hosted AI platform that keeps your data private, reduces API costs, and integrates with your existing Kubernetes tooling.
Natural next steps:
- Add a chat interface: Deploy OpenWebUI as your AI chat interface to give your team a ChatGPT-like experience backed by your own infrastructure.
- Scale inference performance: For high-throughput production APIs, consider deploying vLLM for production inference.
- Harden the stack: Read my securing self-hosted LLM infrastructure guide to add authentication, TLS, and audit logging.
- Automate deployments: Explore GitOps patterns for AI infrastructure to manage Ollama manifests with ArgoCD or Flux.
Questions or edge cases I didnât cover? Drop them in the comments. I test every suggestion on a live cluster before updating the guide.
Parts in this series: â Part 3