Ollama on Kubernetes: Architecture and Components
Table of Contents
Part 2 of 4: Part 1 | Part 2 | Part 3 | Part 4*
Architecture Overview
Hereβs the architecture youβll deploy when you run Ollama on Kubernetes:
βββββββββββββββ ββββββββββββββββ ββββββββββββββββββββ Client ββββββΆβ Ingress ββββββΆβ Ollama Service ββ (curl/API) β β (nginx/haproxy) β (ClusterIP) ββββββββββββββββ ββββββββββββββββ βββββββββββββββββββ β βββββββββββββββββββββββββββββββ βΌ βββββββββββββββββββ β Ollama Pod β β (GPU-enabled) β βββββββββββββββββββ β ββββββββββ΄βββββββββ βΌ βΌ βββββββββββββββ ββββββββββββββββ β NVIDIA GPU β β PVC (models)β β (Inference)β β (Local SSD) β βββββββββββββββ ββββββββββββββββKey design decisions:
- Single-replica Deployment with persistent storage. Ollama keeps models on disk, so a PVC prevents re-downloads on pod restarts.
- Node affinity targets GPU nodes via a custom label (
gpu-type: nvidia). - Resource limits prevent OOMKilled during large model loads.
- ConfigMap externalizes Ollama settings without rebuilding the image.
Step 1: Create the Namespace and Resource Quota
Namespace isolation with resource guards prevents runaway model downloads from consuming cluster memory.
Isolate Ollama into its own namespace:
apiVersion: v1kind: Namespacemetadata: name: ollama labels: app.kubernetes.io/name: ollama app.kubernetes.io/part-of: self-hosted-ai---apiVersion: v1kind: ResourceQuotametadata: name: ollama-quota namespace: ollamaspec: hard: requests.cpu: "8" requests.memory: 64Gi limits.cpu: "16" limits.memory: 128Gi requests.nvidia.com/gpu: "2" limits.nvidia.com/gpu: "2" persistentvolumeclaims: "5"Apply it:
kubectl apply -f 01-namespace.yamlStep 2: Configure GPU Node Selection
Use node labels with affinity or taints with tolerations to ensure Ollama pods always land on GPU-enabled nodes.
Label your GPU nodes:
kubectl label nodes <gpu-node-name> gpu-type=nvidia accelerator=nvidia-gpuI reference these labels in the Deployment manifest in Part 3. If your cluster uses taints to repel non-GPU workloads, add the matching toleration shown in the production Deployment manifest.
For a deeper dive into GPU scheduling patterns, see my guide on Kubernetes AI infrastructure setup.
FAQ
What is namespace isolation in Ollama Kubernetes deployments?
Namespace isolation means running the Ollama pod and its resources in a dedicated Kubernetes namespace with its own ResourceQuota. This prevents model downloads or inference workloads from consuming cluster resources allocated to other teams or services.
Why use ResourceQuota for Ollama?
ResourceQuotas prevent a single model download or inference spike from starving other workloads in the cluster. Iβve seen a 70B model download consume 40 GB of RAM during loading; without a quota, that can trigger OOMKilled events on the node and crash other running pods.
How does GPU node affinity work with Ollama?
GPU node affinity uses Kubernetes node labels and the Deploymentβs nodeSelector field to ensure Ollama pods only schedule on nodes with NVIDIA GPUs. Without this, Kubernetes might place your Ollama pod on a CPU-only node, causing 10-50x slower inference or complete failure to load models.
Should I use node affinity or taints/tolerations for GPU scheduling?
I use both. Node affinity via nodeSelector ensures the pod targets GPU nodes. Tolerations handle nodes with NoSchedule taints, which is common in GPU clusters where non-GPU workloads should be repelled. The manifest in Part 3 includes both patterns.
What happens if my GPU node goes down?
Since Ollama uses a single-replica Deployment with a ReadWriteOnce PVC, the pod stays in Pending until the node recovers or you manually intervene. For multi-node GPU clusters, switch to ReadWriteMany storage (NFS, EFS) and remove the Recreate strategy. I cover this in Part 4: Production Considerations.
Next Steps
Your namespace and GPU node selection are configured. In Part 3, I walk through the full production-ready Deployment manifest with PVC, ConfigMap, liveness probes, and security context.
Parts in this series: β Part 1 | Part 3 β