Deploy Ollama on Kubernetes: Self-Hosted AI Guide
Table of Contents
Part 1 of 4: Part 1 | Part 2 | Part 3 | Part 4*
If you’re running AI workloads in production, you’ve probably asked yourself: should I really be sending proprietary data to a third-party API? For most platform teams, the answer is no. When you deploy Ollama on Kubernetes, you get a fully self-hosted AI stack that keeps your data on-premise, cuts inference costs, and integrates cleanly with your existing platform tooling.
Ollama is an open-source tool that simplifies running large language models (LLMs) locally by wrapping model management, inference serving, and a REST API into a single binary. Kubernetes is the industry-standard container orchestration platform for automating deployment, scaling, and management of containerized applications. Self-hosted AI means running inference on hardware you control: on-premise, in a private cloud, or on dedicated GPU nodes; rather than relying on third-party APIs like OpenAI or Anthropic. GPU inference uses NVIDIA graphics processing units to accelerate LLM workloads, delivering 10-50x faster response times than CPU-only execution.
I’m a Certified Kubernetes Administrator with experience running this exact Ollama K8s setup on clusters ranging from homelab RTX rigs to multi-node A100 pools at companies like EduarD3V. By the end, you’ll have a namespace-isolated Ollama instance with GPU scheduling, persistent model storage, and ingress exposure: a working local LLM inference endpoint inside your cluster.
Version note: These instructions apply to Ollama 0.5.x and Kubernetes 1.28+. YAML manifests use stable API versions.
Quick Reference
| Command / Value | What It Does |
|---|---|
helm install gpu-operator nvidia/gpu-operator --namespace gpu-operator --create-namespace | Installs NVIDIA GPU Operator for GPU node scheduling |
kubectl label nodes <node> gpu-type=nvidia accelerator=nvidia-gpu | Labels GPU nodes for Ollama pod affinity |
ollama/ollama:0.5.7 | Recommended Ollama image tag for this guide |
100Gi PVC on fast-ssd StorageClass | Minimum persistent storage for model files |
nvidia.com/gpu: "1" | GPU resource request for the Ollama container |
OLLAMA_KEEP_ALIVE: "30m" | Keeps models loaded in VRAM for 30 minutes |
strategy: Recreate | Required Deployment strategy for ReadWriteOnce PVCs |
Port 11434 | Default Ollama HTTP API endpoint |
What Is Ollama and Why Deploy It on Kubernetes?
Ollama is an open-source tool for running large language models locally. It wraps model management, inference serving, and a REST API into a single binary: think Docker for LLMs. Run ollama pull llama3.2 and seconds later you’re hitting a local endpoint compatible with OpenAI’s API format.
Running Ollama on a single VM works for development. In production, you need scheduling, resource isolation, and failover. When you deploy Ollama on Kubernetes, you get GPU workload scheduling, resource quotas to prevent model downloads from starving your cluster, and ingress exposure without managing another load balancer. The result is a self-hosted AI platform that behaves like any other microservice in your cluster.
For teams already invested in Kubernetes, an Ollama K8s deployment eliminates the need to manage separate VM infrastructure for AI inference. You reuse existing monitoring, logging, and security patterns while keeping sensitive data within your network perimeter.
Prerequisites and Cluster Requirements
Before you deploy Ollama on Kubernetes, verify your environment:
| Requirement | Minimum | Recommended | Verify Command |
|---|---|---|---|
| Kubernetes | v1.28 | v1.30+ | kubectl version --short |
| kubectl | v1.28 | v1.30+ | kubectl version --client |
| NVIDIA GPU | 1x RTX 3090 (24 GB) | 1x A100 (40/80 GB) | nvidia-smi on node |
| Node OS | Ubuntu 22.04 LTS | Ubuntu 24.04 LTS | lsb_release -a |
| Helm | v3.12+ | v3.14+ | helm version |
| Container runtime | containerd 1.7+ | containerd 1.7+ | kubectl get nodes -o wide |
Critical: NVIDIA Device Plugin
Ollama requires NVIDIA GPUs for acceptable inference performance. Your cluster nodes must have:
- NVIDIA drivers installed on the host
- NVIDIA Device Plugin for Kubernetes running as a DaemonSet
nvidia-container-toolkitconfigured for your container runtime
The easiest path is the NVIDIA GPU Operator, which automates all three:
helm install gpu-operator nvidia/gpu-operator \ --namespace gpu-operator \ --create-namespace \ --waitVerify GPU nodes are schedulable:
kubectl get nodes -o json | jq '.items[].status.capacity | select(."nvidia.com/gpu")'# Expected output shows nvidia.com/gpu: "1" (or higher)Warning: Without the device plugin, Kubernetes schedules your Ollama pod onto a GPU node but the container won’t see the GPU. Ollama falls back to CPU inference, which is 10-50x slower for most models.
FAQ
What is Ollama on Kubernetes?
Ollama on Kubernetes is a deployment pattern where the Ollama inference server runs as a containerized workload inside a Kubernetes cluster. This gives you GPU scheduling, persistent model storage via PVCs, ingress-based API exposure, and integration with existing Kubernetes monitoring and security tooling.
Do I need a GPU to run Ollama on Kubernetes?
Technically no, but practically yes for any real workload. A 7B parameter model runs 10-20x slower on CPU. Without GPU acceleration, inference times jump from milliseconds to multiple seconds per token. For development and testing of 1-3B models, CPU-only can work; just don’t expect production-grade response times.
What’s the difference between running Ollama on a VM versus Kubernetes?
A VM setup is simpler to start but harder to operate at scale. Kubernetes gives you automated GPU scheduling, resource quotas, health checks with liveness and readiness probes, and ingress integration. The trade-off is operational complexity: Kubernetes requires platform engineering skills to debug GPU scheduling or PVC binding issues. I cover these trade-offs in depth in Part 4: Production Considerations.
Which NVIDIA GPU Operator version do I need?
Use the latest stable release from NVIDIA’s Helm repository. As of March 2026, operator version 24.9+ works with Kubernetes 1.28+. The operator automatically deploys the device plugin, container toolkit, and driver installer as a DaemonSet; no manual configuration needed.
How much storage do I need for Ollama models?
Start with 100 GB on an SSD-backed StorageClass. A quantized Llama 3.1 70B model takes ~39 GB. Loading models from spinning disk is painfully slow. Always use SSD or NVMe-backed storage for Ollama PVCs. See Part 4 for detailed model size tables.
Next Steps
Your cluster now has the GPU Operator installed and nodes labeled for GPU workloads. In Part 2, I cover the full architecture design; namespace isolation, resource quotas, and GPU node affinity configuration.
Parts in this series: Part 2 →