Ollama on Kubernetes: Architecture and Components

2026.03.29
Technology
469 Words
Ollama on Kubernetes: Architecture and Components

Part 2 of 4: Part 1 | Part 2 | Part 3 | Part 4*

Architecture Overview

Here’s the architecture you’ll deploy when you run Ollama on Kubernetes:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Client │────▢│ Ingress │────▢│ Ollama Service β”‚
β”‚ (curl/API) β”‚ β”‚ (nginx/haproxy) β”‚ (ClusterIP) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Ollama Pod β”‚
β”‚ (GPU-enabled) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”
β–Ό β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ NVIDIA GPU β”‚ β”‚ PVC (models)β”‚
β”‚ (Inference)β”‚ β”‚ (Local SSD) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Key design decisions:

  • Single-replica Deployment with persistent storage. Ollama keeps models on disk, so a PVC prevents re-downloads on pod restarts.
  • Node affinity targets GPU nodes via a custom label (gpu-type: nvidia).
  • Resource limits prevent OOMKilled during large model loads.
  • ConfigMap externalizes Ollama settings without rebuilding the image.

Step 1: Create the Namespace and Resource Quota

Namespace isolation with resource guards prevents runaway model downloads from consuming cluster memory.

Isolate Ollama into its own namespace:

01-namespace.yaml
apiVersion: v1
kind: Namespace
metadata:
name: ollama
labels:
app.kubernetes.io/name: ollama
app.kubernetes.io/part-of: self-hosted-ai
---
apiVersion: v1
kind: ResourceQuota
metadata:
name: ollama-quota
namespace: ollama
spec:
hard:
requests.cpu: "8"
requests.memory: 64Gi
limits.cpu: "16"
limits.memory: 128Gi
requests.nvidia.com/gpu: "2"
limits.nvidia.com/gpu: "2"
persistentvolumeclaims: "5"

Apply it:

Terminal window
kubectl apply -f 01-namespace.yaml

Step 2: Configure GPU Node Selection

Use node labels with affinity or taints with tolerations to ensure Ollama pods always land on GPU-enabled nodes.

Label your GPU nodes:

Terminal window
kubectl label nodes <gpu-node-name> gpu-type=nvidia accelerator=nvidia-gpu

I reference these labels in the Deployment manifest in Part 3. If your cluster uses taints to repel non-GPU workloads, add the matching toleration shown in the production Deployment manifest.

For a deeper dive into GPU scheduling patterns, see my guide on Kubernetes AI infrastructure setup.

FAQ

What is namespace isolation in Ollama Kubernetes deployments?

Namespace isolation means running the Ollama pod and its resources in a dedicated Kubernetes namespace with its own ResourceQuota. This prevents model downloads or inference workloads from consuming cluster resources allocated to other teams or services.

Why use ResourceQuota for Ollama?

ResourceQuotas prevent a single model download or inference spike from starving other workloads in the cluster. I’ve seen a 70B model download consume 40 GB of RAM during loading; without a quota, that can trigger OOMKilled events on the node and crash other running pods.

How does GPU node affinity work with Ollama?

GPU node affinity uses Kubernetes node labels and the Deployment’s nodeSelector field to ensure Ollama pods only schedule on nodes with NVIDIA GPUs. Without this, Kubernetes might place your Ollama pod on a CPU-only node, causing 10-50x slower inference or complete failure to load models.

Should I use node affinity or taints/tolerations for GPU scheduling?

I use both. Node affinity via nodeSelector ensures the pod targets GPU nodes. Tolerations handle nodes with NoSchedule taints, which is common in GPU clusters where non-GPU workloads should be repelled. The manifest in Part 3 includes both patterns.

What happens if my GPU node goes down?

Since Ollama uses a single-replica Deployment with a ReadWriteOnce PVC, the pod stays in Pending until the node recovers or you manually intervene. For multi-node GPU clusters, switch to ReadWriteMany storage (NFS, EFS) and remove the Recreate strategy. I cover this in Part 4: Production Considerations.

Next Steps

Your namespace and GPU node selection are configured. In Part 3, I walk through the full production-ready Deployment manifest with PVC, ConfigMap, liveness probes, and security context.

Parts in this series: ← Part 1 | Part 3 β†’

# Ollama # Kubernetes # self-hosted-ai # gpu-inference # nvidia # DevOps # Llm