Ollama Kubernetes Security Hardening Best Practices

2026.04.04
Technology
1192 Words
Ollama Kubernetes Security Hardening Best Practices

Part 4 of 4: Part 1 | Part 2 | Part 3 | Part 4*

Production Considerations

Running Ollama in production requires hardening beyond the basic deployment. Here’s what I’ve learned from running Ollama across six production clusters.

Resource Limits and OOM Prevention

Ollama’s memory footprint spikes when loading models. A 70B parameter model quantized to Q4 needs ~40 GB of system RAM to load, plus VRAM for active layers. I set limits.memory at least 1.5x your largest model’s RAM requirement to avoid OOMKilled interruptions.

Terminal window
kubectl top pod -n ollama

Horizontal Pod Autoscaling (HPA)

HPA is incompatible with this Deployment architecture. The ReadWriteOnce PVC and Recreate strategy prevent horizontal scaling to multiple replicas. HPA creates additional pods that fail to mount the volume.

For multi-replica scaling, use one of these alternatives:

  1. RWX storage: Switch to ReadWriteMany (NFS/EFS) and remove strategy: Recreate
  2. One model per Deployment: Deploy separate Deployments per model, each with 1 replica

Scaled Deployment pattern (one model per Deployment):

04-ollama-llama3-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: ollama-llama3
namespace: ollama
spec:
replicas: 1
selector:
matchLabels:
app: ollama-llama3
template:
metadata:
labels:
app: ollama-llama3
spec:
nodeSelector:
gpu-type: nvidia
containers:
- name: ollama
image: ollama/ollama:0.5.7
env:
- name: OLLAMA_KEEP_ALIVE
value: "30m"
resources:
limits:
nvidia.com/gpu: "1"
memory: "64Gi"
volumeMounts:
- name: models
mountPath: /root/.ollama
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 15"]
terminationGracePeriodSeconds: 60
volumes:
- name: models
emptyDir: {} # Or use a dedicated PVC per model
---
apiVersion: v1
kind: Service
metadata:
name: ollama-llama3
namespace: ollama
spec:
selector:
app: ollama-llama3
ports:
- port: 11434
targetPort: 11434

This pattern isolates models, avoids PVC conflicts, and lets you scale each model independently.

Network Policies

Restrict ingress to Ollama from trusted namespaces only:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: ollama-netpol
namespace: ollama
spec:
podSelector:
matchLabels:
app.kubernetes.io/name: ollama
policyTypes:
- Ingress
ingress:
- from:
- namespaceSelector:
matchLabels:
name: api-gateway
ports:
- protocol: TCP
port: 11434

Monitoring Basics

Monitor these key signals:

MetricAlert ThresholdWhy It Matters
container_memory_working_set_bytes> 85% of limitPrevents OOMKilled
container_gpu_memory_usage> 90% of GPU VRAMPrevents model load failures
kube_pod_status_ready0 for > 2 minPod health

Add GPU utilization dashboards with the NVIDIA DCGM exporter.

Ollama K8s vs. Bare Metal: Why Use Kubernetes?

FeatureBare Metal / VMKubernetes (This Guide)
SchedulingManualAutomatic GPU node affinity
Resource isolationcgroups/systemdContainer limits + quotas
StorageLocal diskPVC with snapshot/backup support
ScalingManual VM resizeHPA + cluster autoscaler
Service discoveryHardcoded IPsDNS + Ingress
Network securityFirewall rulesNetworkPolicy + TLS termination
MonitoringCustom scriptsPrometheus + Grafana ecosystem
Multi-model servingSingle instanceMultiple Deployments per model

For teams already on Kubernetes, this approach reduces toil by fitting AI inference into existing platform patterns. Self-hosted AI becomes just another workload in your cluster, governed by the same RBAC, monitoring, and backup policies.

When NOT to Deploy Ollama on Kubernetes

Kubernetes adds operational complexity that isn’t always justified.

Reconsider this approach if:

  • You only need a single model on one GPU: A standalone VM or Docker Compose setup is simpler.
  • Your team lacks Kubernetes expertise: Debugging GPU scheduling failures, PVC binding issues, and ingress misconfigurations requires platform engineering skills.
  • Latency is your absolute top priority: The container runtime and Kubernetes networking layers add microseconds of overhead.
  • You need multi-GPU model parallelism: Ollama doesn’t natively shard models across multiple GPUs. For that, use vLLM for production inference or TensorRT-LLM.
  • Your cluster doesn’t have GPU nodes: CPU inference through Kubernetes is 10-50x slower.

Troubleshooting Common Issues

Error: “GPU not detected” or CPU fallback

Symptom: Ollama falls back to CPU. Inference is extremely slow.

Diagnosis:

Terminal window
kubectl exec -it deployment/ollama -n ollama -- nvidia-smi

Solution: Confirm the device plugin is running (kubectl get daemonset -n gpu-operator), check your runtime class, and verify the pod requests nvidia.com/gpu: "1". For a complete diagnostic walkthrough, see how to fix GPU not detected in Kubernetes.

Error: “connection reset” or model download hangs at 0%

Symptom: ollama pull stalls at 0% or errors with connection reset.

Solution: Ensure the PVC has enough free space (~40 GB for a 70B model). Check if a proxy blocks ollama.com. Reduce OLLAMA_MAX_LOADED_MODELS if the timeout is too aggressive. Verify your StorageClass supports dynamic provisioning:

Terminal window
kubectl get storageclass
kubectl get pvc ollama-models -n ollama

Error: “OOMKilled” during model load

Symptom: Pod status shows OOMKilled after pulling a large model.

Solution: Increase limits.memory in the Deployment to at least 1.5x the model’s RAM footprint, use a smaller quantization (Q4 instead of Q8), or reduce OLLAMA_NUM_PARALLEL to free VRAM.

Error: “ReadWriteOnce volume already mounted”

Symptom: New pod stays in ContainerCreating with a PVC mount error.

Solution: The old pod hasn’t fully terminated. Check kubectl get pods -n ollama and wait for the previous pod to reach Terminated status. If stuck, force-delete: kubectl delete pod <pod-name> -n ollama --force.

Error: “Ingress returns 502 Bad Gateway”

Symptom: Requests to the Ollama ingress return 502.

Solution: Verify the Service selector matches the pod labels, check that the pod is Ready (kubectl get pods -n ollama), and confirm the ingress controller is running.

For deeper troubleshooting, see how to fix OOMKilled errors in GPU AI workloads.

FAQ

Can I deploy Ollama on Kubernetes without a GPU?

Yes, but I don’t recommend it for production workloads. Omit the nvidia.com/gpu resource request for CPU inference. A 7B model runs 10-20x slower on CPU. CPU-only deployment is viable for development or running 1-3B parameter models. For production self-hosted AI, GPU acceleration is essential.

How much storage do I need for Ollama models in Kubernetes?

Start with a 100 GB PVC on SSD-backed storage. Model sizes vary by parameters and quantization:

ModelQuantizationDisk Size
Llama 3.2 (3B)Q4_0~2.0 GB
Llama 3.1 (8B)Q4_0~4.7 GB
Llama 3.1 (70B)Q4_0~39 GB
Mixtral 8x7BQ4_K_M~31 GB

Model loading from spinning disk is painfully slow. Always use SSD-backed StorageClasses for Ollama K8s deployments.

What is the difference between Ollama and vLLM for Kubernetes deployments?

Ollama optimizes for developer experience and multi-model management, while vLLM optimizes for throughput with PagedAttention and continuous batching. Choose Ollama for flexibility and ease of use; choose vLLM for high-QPS single-model serving. See my Ollama vs vLLM comparison for a detailed breakdown.

How do I update Ollama without losing downloaded models?

Models persist on the PVC at /root/.ollama. Change the image tag and reapply the Deployment. Kubernetes recreates the pod, mounts the existing PVC, and all models remain intact. Use kubectl rollout restart deployment/ollama -n ollama to restart without editing manifests.

Is Ollama suitable for production multi-tenant AI platforms?

Ollama works well for internal teams and moderate-traffic APIs, but lacks native multi-tenancy features. For a true multi-tenant self-hosted AI platform with per-user rate limiting, quota enforcement, and request isolation, add an API gateway like Kong or Envoy in front of Ollama, or evaluate vLLM for production inference.

What Kubernetes storage class works best for Ollama model caching?

Use a locally-attached SSD StorageClass (e.g., local-ssd, fast-ssd, or a topologically-bound CSI driver). Network-attached storage (NFS, EFS) introduces latency during model loading that negates the benefits of GPU inference. NVMe local volumes are ideal for Ollama on Kubernetes workloads.

Conclusion

You now know how to deploy Ollama on Kubernetes with GPU scheduling, persistent model storage, and ingress exposure. This architecture serves as the foundation for a broader self-hosted AI platform that keeps your data private, reduces API costs, and integrates with your existing Kubernetes tooling.

Natural next steps:

  1. Add a chat interface: Deploy OpenWebUI as your AI chat interface to give your team a ChatGPT-like experience backed by your own infrastructure.
  2. Scale inference performance: For high-throughput production APIs, consider deploying vLLM for production inference.
  3. Harden the stack: Read my securing self-hosted LLM infrastructure guide to add authentication, TLS, and audit logging.
  4. Automate deployments: Explore GitOps patterns for AI infrastructure to manage Ollama manifests with ArgoCD or Flux.

Questions or edge cases I didn’t cover? Drop them in the comments. I test every suggestion on a live cluster before updating the guide.

Parts in this series: ← Part 3

# Ollama # Kubernetes # self-hosted-ai # gpu-inference # nvidia # DevOps # Llm