Ollama Kubernetes Security Hardening Best Practices

Part 4 of 4: Part 1 | Part 2 | Part 3 | Part 4*

Production Considerations

Running Ollama in production requires hardening beyond the basic deployment. Here’s what I’ve learned from running Ollama across six production clusters.

Resource Limits and OOM Prevention

Ollama’s memory footprint spikes when loading models. A 70B parameter model quantized to Q4 needs ~40 GB of system RAM to load, plus VRAM for active layers. I set limits.memory at least 1.5x your largest model’s RAM requirement to avoid OOMKilled interruptions.

kubectl top pod -n ollama

Horizontal Pod Autoscaling (HPA)

HPA is incompatible with this Deployment architecture. The ReadWriteOnce PVC and Recreate strategy prevent horizontal scaling to multiple replicas. HPA creates additional pods that fail to mount the volume.

For multi-replica scaling, use one of these alternatives:

RWX storage: Switch to ReadWriteMany (NFS/EFS) and remove strategy: Recreate
One model per Deployment: Deploy separate Deployments per model, each with 1 replica

Scaled Deployment pattern (one model per Deployment):

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ollama-llama3
  namespace: ollama
spec:
  replicas: 1
  selector:
    matchLabels:
      app: ollama-llama3
  template:
    metadata:
      labels:
        app: ollama-llama3
    spec:
      nodeSelector:
        gpu-type: nvidia
      containers:
        - name: ollama
          image: ollama/ollama:0.5.7
          env:
            - name: OLLAMA_KEEP_ALIVE
              value: "30m"
          resources:
            limits:
              nvidia.com/gpu: "1"
              memory: "64Gi"
          volumeMounts:
            - name: models
              mountPath: /root/.ollama
          lifecycle:
            preStop:
              exec:
                command: ["/bin/sh", "-c", "sleep 15"]
      terminationGracePeriodSeconds: 60
      volumes:
        - name: models
          emptyDir: {}  # Or use a dedicated PVC per model
---
apiVersion: v1
kind: Service
metadata:
  name: ollama-llama3
  namespace: ollama
spec:
  selector:
    app: ollama-llama3
  ports:
    - port: 11434
      targetPort: 11434

This pattern isolates models, avoids PVC conflicts, and lets you scale each model independently.

Network Policies

Restrict ingress to Ollama from trusted namespaces only:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: ollama-netpol
  namespace: ollama
spec:
  podSelector:
    matchLabels:
      app.kubernetes.io/name: ollama
  policyTypes:
    - Ingress
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              name: api-gateway
      ports:
        - protocol: TCP
          port: 11434

Monitoring Basics

Monitor these key signals:

Metric	Alert Threshold	Why It Matters
`container_memory_working_set_bytes`	> 85% of limit	Prevents OOMKilled
`container_gpu_memory_usage`	> 90% of GPU VRAM	Prevents model load failures
`kube_pod_status_ready`	`0` for > 2 min	Pod health

Add GPU utilization dashboards with the NVIDIA DCGM exporter.

Ollama K8s vs. Bare Metal: Why Use Kubernetes?

Feature	Bare Metal / VM	Kubernetes (This Guide)
Scheduling	Manual	Automatic GPU node affinity
Resource isolation	cgroups/systemd	Container limits + quotas
Storage	Local disk	PVC with snapshot/backup support
Scaling	Manual VM resize	HPA + cluster autoscaler
Service discovery	Hardcoded IPs	DNS + Ingress
Network security	Firewall rules	NetworkPolicy + TLS termination
Monitoring	Custom scripts	Prometheus + Grafana ecosystem
Multi-model serving	Single instance	Multiple Deployments per model

For teams already on Kubernetes, this approach reduces toil by fitting AI inference into existing platform patterns. Self-hosted AI becomes just another workload in your cluster, governed by the same RBAC, monitoring, and backup policies.

When NOT to Deploy Ollama on Kubernetes

Kubernetes adds operational complexity that isn’t always justified.

Reconsider this approach if:

You only need a single model on one GPU: A standalone VM or Docker Compose setup is simpler.
Your team lacks Kubernetes expertise: Debugging GPU scheduling failures, PVC binding issues, and ingress misconfigurations requires platform engineering skills.
Latency is your absolute top priority: The container runtime and Kubernetes networking layers add microseconds of overhead.
You need multi-GPU model parallelism: Ollama doesn’t natively shard models across multiple GPUs. For that, use vLLM for production inference or TensorRT-LLM.
Your cluster doesn’t have GPU nodes: CPU inference through Kubernetes is 10-50x slower.

Troubleshooting Common Issues

Error: “GPU not detected” or CPU fallback

Symptom: Ollama falls back to CPU. Inference is extremely slow.

Diagnosis:

kubectl exec -it deployment/ollama -n ollama -- nvidia-smi

Solution: Confirm the device plugin is running (kubectl get daemonset -n gpu-operator), check your runtime class, and verify the pod requests nvidia.com/gpu: "1". For a complete diagnostic walkthrough, see how to fix GPU not detected in Kubernetes.

Error: “connection reset” or model download hangs at 0%

Symptom: ollama pull stalls at 0% or errors with connection reset.

Solution: Ensure the PVC has enough free space (~40 GB for a 70B model). Check if a proxy blocks ollama.com. Reduce OLLAMA_MAX_LOADED_MODELS if the timeout is too aggressive. Verify your StorageClass supports dynamic provisioning:

kubectl get storageclass
kubectl get pvc ollama-models -n ollama

Error: “OOMKilled” during model load

Symptom: Pod status shows OOMKilled after pulling a large model.

Solution: Increase limits.memory in the Deployment to at least 1.5x the model’s RAM footprint, use a smaller quantization (Q4 instead of Q8), or reduce OLLAMA_NUM_PARALLEL to free VRAM.

Error: “ReadWriteOnce volume already mounted”

Symptom: New pod stays in ContainerCreating with a PVC mount error.

Solution: The old pod hasn’t fully terminated. Check kubectl get pods -n ollama and wait for the previous pod to reach Terminated status. If stuck, force-delete: kubectl delete pod <pod-name> -n ollama --force.

Error: “Ingress returns 502 Bad Gateway”

Symptom: Requests to the Ollama ingress return 502.

Solution: Verify the Service selector matches the pod labels, check that the pod is Ready (kubectl get pods -n ollama), and confirm the ingress controller is running.

For deeper troubleshooting, see how to fix OOMKilled errors in GPU AI workloads.

FAQ

Can I deploy Ollama on Kubernetes without a GPU?

Yes, but I don’t recommend it for production workloads. Omit the nvidia.com/gpu resource request for CPU inference. A 7B model runs 10-20x slower on CPU. CPU-only deployment is viable for development or running 1-3B parameter models. For production self-hosted AI, GPU acceleration is essential.

How much storage do I need for Ollama models in Kubernetes?

Start with a 100 GB PVC on SSD-backed storage. Model sizes vary by parameters and quantization:

Model	Quantization	Disk Size
Llama 3.2 (3B)	Q4_0	~2.0 GB
Llama 3.1 (8B)	Q4_0	~4.7 GB
Llama 3.1 (70B)	Q4_0	~39 GB
Mixtral 8x7B	Q4_K_M	~31 GB

Model loading from spinning disk is painfully slow. Always use SSD-backed StorageClasses for Ollama K8s deployments.

What is the difference between Ollama and vLLM for Kubernetes deployments?

Ollama optimizes for developer experience and multi-model management, while vLLM optimizes for throughput with PagedAttention and continuous batching. Choose Ollama for flexibility and ease of use; choose vLLM for high-QPS single-model serving. See my Ollama vs vLLM comparison for a detailed breakdown.

How do I update Ollama without losing downloaded models?

Models persist on the PVC at /root/.ollama. Change the image tag and reapply the Deployment. Kubernetes recreates the pod, mounts the existing PVC, and all models remain intact. Use kubectl rollout restart deployment/ollama -n ollama to restart without editing manifests.

Is Ollama suitable for production multi-tenant AI platforms?

Ollama works well for internal teams and moderate-traffic APIs, but lacks native multi-tenancy features. For a true multi-tenant self-hosted AI platform with per-user rate limiting, quota enforcement, and request isolation, add an API gateway like Kong or Envoy in front of Ollama, or evaluate vLLM for production inference.

What Kubernetes storage class works best for Ollama model caching?

Use a locally-attached SSD StorageClass (e.g., local-ssd, fast-ssd, or a topologically-bound CSI driver). Network-attached storage (NFS, EFS) introduces latency during model loading that negates the benefits of GPU inference. NVMe local volumes are ideal for Ollama on Kubernetes workloads.

Conclusion

You now know how to deploy Ollama on Kubernetes with GPU scheduling, persistent model storage, and ingress exposure. This architecture serves as the foundation for a broader self-hosted AI platform that keeps your data private, reduces API costs, and integrates with your existing Kubernetes tooling.

Natural next steps:

Add a chat interface: Deploy OpenWebUI as your AI chat interface to give your team a ChatGPT-like experience backed by your own infrastructure.
Scale inference performance: For high-throughput production APIs, consider deploying vLLM for production inference.
Harden the stack: Read my securing self-hosted LLM infrastructure guide to add authentication, TLS, and audit logging.
Automate deployments: Explore GitOps patterns for AI infrastructure to manage Ollama manifests with ArgoCD or Flux.

Questions or edge cases I didn’t cover? Drop them in the comments. I test every suggestion on a live cluster before updating the guide.

Parts in this series: ← Part 3