How I Actually Deploy Ollama on Kubernetes (And the GPU Headaches I Fixed)
Table of Contents
Running Ollama on Kubernetes is not hard. Keeping it useful at scale is where the opinions start. Getting it to see your GPU is where the weekend disappears.
I run Ollama on my homelab Kubernetes cluster. It is a three-node K3s setup. One node has a GTX 1080 for GPU workloads. Getting Ollama to actually use that GPU took longer than I expected.
This post is what I actually deployed and what actually broke.
Why Kubernetes for Ollama?
For a single user, a laptop or a small VM is enough. I use Kubernetes because I already use Kubernetes for everything else. My homelab runs my blog, my monitoring, my automation, and now my inference. Keeping it all in one place means I only have one place to break.
If you do not already run Kubernetes, this is probably overkill. A Docker container on a Linux box with a GPU is faster to set up and easier to debug.
The GPU Problem That Ate My Saturday
The most common failure I see is the pod landing on a GPU node but not seeing the GPU. This happened to me. The pod started, Ollama loaded, but it ran on CPU. A simple prompt took 30 seconds instead of 2. I thought Ollama was broken. It was not. The NVIDIA device plugin was not installed.
The fix is the NVIDIA GPU Operator. But installing it is not just helm install. You need to make sure your nodes have the right kernel headers, that the container toolkit is configured, and that the runtime class is set correctly in your container runtime. I missed the runtime class step. The pod scheduled to the GPU node but used the default runtime instead of the NVIDIA runtime. It saw the GPU in nvidia-smi on the host but not inside the container.
I fixed it by adding this to my K3s config:
# /etc/rancher/k3s/registries.yaml does not help here# You need the runtime class
apiVersion: node.k8s.io/v1kind: RuntimeClassmetadata: name: nvidiahandler: nvidiaThen I added runtimeClassName: nvidia to my Ollama pod spec. That was the missing piece. The GPU Operator docs mention this, but they assume you know what a runtime class is. I did not. I learned.
The Deployment I Actually Run
Here is my actual Ollama Deployment. It is not theoretical. It is running right now on my homelab.
apiVersion: apps/v1kind: Deploymentmetadata: name: ollama namespace: ollamaspec: replicas: 1 strategy: type: Recreate selector: matchLabels: app.kubernetes.io/name: ollama template: metadata: labels: app.kubernetes.io/name: ollama spec: runtimeClassName: nvidia nodeSelector: gpu-type: nvidia containers: - name: ollama image: ollama/ollama:0.5.7 ports: - containerPort: 11434 env: - name: OLLAMA_KEEP_ALIVE value: "30m" - name: OLLAMA_NUM_PARALLEL value: "4" - name: OLLAMA_MAX_LOADED_MODELS value: "2" resources: requests: cpu: "4" memory: 16Gi nvidia.com/gpu: "1" limits: cpu: "8" memory: 64Gi nvidia.com/gpu: "1" volumeMounts: - name: models mountPath: /root/.ollama livenessProbe: httpGet: path: /api/tags port: 11434 initialDelaySeconds: 60 periodSeconds: 30 readinessProbe: httpGet: path: /api/tags port: 11434 initialDelaySeconds: 10 periodSeconds: 10 volumes: - name: models persistentVolumeClaim: claimName: ollama-modelsKey choices that matter:
Recreatestrategy because the PVC isReadWriteOnce. Two pods cannot mount it.runtimeClassName: nvidiabecause without this, the GPU is invisible.OLLAMA_KEEP_ALIVE: 30mbecause I got tired of waiting for model reload on every request.OLLAMA_MAX_LOADED_MODELS: 2because my GTX 1080 can fit one 8B model comfortably, or two smaller ones.
What I Actually Watch
I track three things in Grafana:
- GPU utilization during inference. It should spike to near 100% when a request is active. If it stays at 0%, the model is running on CPU. I check this every time I deploy a new model.
- Memory working set against the limit. I had one OOMKill when I tried to load a 13B model. The pod restarted, the model was gone, and I had to pull it again. That took 20 minutes.
- API latency from the ingress. I expose Ollama through an internal ingress with basic auth. Latency should be under 2 seconds for a short prompt. If it is higher, I check if the model is still loaded or if the GPU is busy.
What I Would Not Do (Because I Tried)
- Do not run Ollama on CPU and expect it to be useful. I tried this as a fallback. It works for tiny models. For anything useful, it is painfully slow.
- Do not expose Ollama directly to the internet. I use an internal ingress with basic auth. No external access. I do not trust the auth layer enough for public exposure.
- Do not expect horizontal scaling. The RWO PVC limits you to one pod. If you need multiple replicas, you need multiple PVCs or a different inference engine. I tried a StatefulSet with per-pod PVCs. It worked but was not worth the complexity for my use case.
The Model Loading Problem
Ollama downloads models on first use. For a 70B model, this takes 20-30 minutes on my connection. If the pod restarts, the model is still there because of the PVC. But if the PVC gets deleted or corrupted, you wait again.
I learned to set OLLAMA_KEEP_ALIVE high enough that the model stays resident. I also keep a local copy of my most-used models on my NAS as a backup. If the PVC dies, I can restore from the NAS instead of re-downloading.
When I Actually Use This
I use Ollama on Kubernetes for:
- Quick experiments with new models. Pull, test, decide if it is worth keeping.
- My internal automation that needs local inference. No API keys, no rate limits.
- My Continue.dev setup, which points to the Ollama endpoint for local models.
I do not use it for:
- Production serving with SLAs. Ollama is not built for that.
- High concurrency. One user at a time is fine. Two users starts to get slow.
- Models larger than 8B on my GTX 1080. It might fit, but generation is slow and risks OOM.
Conclusion
Ollama on Kubernetes is a solid pattern for personal or small-team use. The main risks are GPU visibility, PVC strategy, and model loading time. Get those right and the rest is mostly standard Kubernetes operations.
The GPU visibility issue cost me a Saturday. I hope this post saves you that time. The key is runtimeClassName: nvidia. Do not skip it.
Use Ollama to prove the use case. Move to vLLM when the use case proves itself and you need more than one concurrent user.