Deploy Ollama on Kubernetes: Self-Hosted AI Guide

2026.03.26
Technology
940 Words
Deploy Ollama on Kubernetes: Self-Hosted AI Guide

Part 1 of 4: Part 1 | Part 2 | Part 3 | Part 4*

If you’re running AI workloads in production, you’ve probably asked yourself: should I really be sending proprietary data to a third-party API? For most platform teams, the answer is no. When you deploy Ollama on Kubernetes, you get a fully self-hosted AI stack that keeps your data on-premise, cuts inference costs, and integrates cleanly with your existing platform tooling.

Ollama is an open-source tool that simplifies running large language models (LLMs) locally by wrapping model management, inference serving, and a REST API into a single binary. Kubernetes is the industry-standard container orchestration platform for automating deployment, scaling, and management of containerized applications. Self-hosted AI means running inference on hardware you control: on-premise, in a private cloud, or on dedicated GPU nodes; rather than relying on third-party APIs like OpenAI or Anthropic. GPU inference uses NVIDIA graphics processing units to accelerate LLM workloads, delivering 10-50x faster response times than CPU-only execution.

I’m a Certified Kubernetes Administrator with experience running this exact Ollama K8s setup on clusters ranging from homelab RTX rigs to multi-node A100 pools at companies like EduarD3V. By the end, you’ll have a namespace-isolated Ollama instance with GPU scheduling, persistent model storage, and ingress exposure: a working local LLM inference endpoint inside your cluster.

Version note: These instructions apply to Ollama 0.5.x and Kubernetes 1.28+. YAML manifests use stable API versions.

Quick Reference

Command / ValueWhat It Does
helm install gpu-operator nvidia/gpu-operator --namespace gpu-operator --create-namespaceInstalls NVIDIA GPU Operator for GPU node scheduling
kubectl label nodes <node> gpu-type=nvidia accelerator=nvidia-gpuLabels GPU nodes for Ollama pod affinity
ollama/ollama:0.5.7Recommended Ollama image tag for this guide
100Gi PVC on fast-ssd StorageClassMinimum persistent storage for model files
nvidia.com/gpu: "1"GPU resource request for the Ollama container
OLLAMA_KEEP_ALIVE: "30m"Keeps models loaded in VRAM for 30 minutes
strategy: RecreateRequired Deployment strategy for ReadWriteOnce PVCs
Port 11434Default Ollama HTTP API endpoint

What Is Ollama and Why Deploy It on Kubernetes?

Ollama is an open-source tool for running large language models locally. It wraps model management, inference serving, and a REST API into a single binary: think Docker for LLMs. Run ollama pull llama3.2 and seconds later you’re hitting a local endpoint compatible with OpenAI’s API format.

Running Ollama on a single VM works for development. In production, you need scheduling, resource isolation, and failover. When you deploy Ollama on Kubernetes, you get GPU workload scheduling, resource quotas to prevent model downloads from starving your cluster, and ingress exposure without managing another load balancer. The result is a self-hosted AI platform that behaves like any other microservice in your cluster.

For teams already invested in Kubernetes, an Ollama K8s deployment eliminates the need to manage separate VM infrastructure for AI inference. You reuse existing monitoring, logging, and security patterns while keeping sensitive data within your network perimeter.

Prerequisites and Cluster Requirements

Before you deploy Ollama on Kubernetes, verify your environment:

RequirementMinimumRecommendedVerify Command
Kubernetesv1.28v1.30+kubectl version --short
kubectlv1.28v1.30+kubectl version --client
NVIDIA GPU1x RTX 3090 (24 GB)1x A100 (40/80 GB)nvidia-smi on node
Node OSUbuntu 22.04 LTSUbuntu 24.04 LTSlsb_release -a
Helmv3.12+v3.14+helm version
Container runtimecontainerd 1.7+containerd 1.7+kubectl get nodes -o wide

Critical: NVIDIA Device Plugin

Ollama requires NVIDIA GPUs for acceptable inference performance. Your cluster nodes must have:

  1. NVIDIA drivers installed on the host
  2. NVIDIA Device Plugin for Kubernetes running as a DaemonSet
  3. nvidia-container-toolkit configured for your container runtime

The easiest path is the NVIDIA GPU Operator, which automates all three:

Terminal window
helm install gpu-operator nvidia/gpu-operator \
--namespace gpu-operator \
--create-namespace \
--wait

Verify GPU nodes are schedulable:

Terminal window
kubectl get nodes -o json | jq '.items[].status.capacity | select(."nvidia.com/gpu")'
# Expected output shows nvidia.com/gpu: "1" (or higher)

Warning: Without the device plugin, Kubernetes schedules your Ollama pod onto a GPU node but the container won’t see the GPU. Ollama falls back to CPU inference, which is 10-50x slower for most models.

FAQ

What is Ollama on Kubernetes?

Ollama on Kubernetes is a deployment pattern where the Ollama inference server runs as a containerized workload inside a Kubernetes cluster. This gives you GPU scheduling, persistent model storage via PVCs, ingress-based API exposure, and integration with existing Kubernetes monitoring and security tooling.

Do I need a GPU to run Ollama on Kubernetes?

Technically no, but practically yes for any real workload. A 7B parameter model runs 10-20x slower on CPU. Without GPU acceleration, inference times jump from milliseconds to multiple seconds per token. For development and testing of 1-3B models, CPU-only can work; just don’t expect production-grade response times.

What’s the difference between running Ollama on a VM versus Kubernetes?

A VM setup is simpler to start but harder to operate at scale. Kubernetes gives you automated GPU scheduling, resource quotas, health checks with liveness and readiness probes, and ingress integration. The trade-off is operational complexity: Kubernetes requires platform engineering skills to debug GPU scheduling or PVC binding issues. I cover these trade-offs in depth in Part 4: Production Considerations.

Which NVIDIA GPU Operator version do I need?

Use the latest stable release from NVIDIA’s Helm repository. As of March 2026, operator version 24.9+ works with Kubernetes 1.28+. The operator automatically deploys the device plugin, container toolkit, and driver installer as a DaemonSet; no manual configuration needed.

How much storage do I need for Ollama models?

Start with 100 GB on an SSD-backed StorageClass. A quantized Llama 3.1 70B model takes ~39 GB. Loading models from spinning disk is painfully slow. Always use SSD or NVMe-backed storage for Ollama PVCs. See Part 4 for detailed model size tables.

Next Steps

Your cluster now has the GPU Operator installed and nodes labeled for GPU workloads. In Part 2, I cover the full architecture design; namespace isolation, resource quotas, and GPU node affinity configuration.

Parts in this series: Part 2 →

# Ollama # Kubernetes # self-hosted-ai # gpu-inference # nvidia # DevOps # Llm