Ollama Kubernetes Production Manifests and Deploy

2026.04.01
Technology
568 Words
Ollama Kubernetes Production Manifests and Deploy

Part 3 of 4: Part 1 | Part 2 | Part 3 | Part 4*

Step 3: Deploy Ollama with Production-Ready Manifests

This manifest bundles the Deployment, PVC, and ConfigMap into a single YAML file I’ve used across six production clusters.

02-ollama-deployment.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: ollama-models
namespace: ollama
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 100Gi
storageClassName: fast-ssd # Replace with your SSD-backed StorageClass
---
apiVersion: v1
kind: ConfigMap
metadata:
name: ollama-config
namespace: ollama
data:
OLLAMA_HOST: "0.0.0.0"
OLLAMA_PORT: "11434"
OLLAMA_NUM_PARALLEL: "4"
OLLAMA_MAX_LOADED_MODELS: "2"
OLLAMA_KEEP_ALIVE: "30m"
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: ollama
namespace: ollama
labels:
app.kubernetes.io/name: ollama
app.kubernetes.io/component: inference-server
spec:
replicas: 1
strategy:
type: Recreate # Required for RWO PVCs
selector:
matchLabels:
app.kubernetes.io/name: ollama
template:
metadata:
labels:
app.kubernetes.io/name: ollama
spec:
nodeSelector:
gpu-type: nvidia
tolerations:
- key: "nvidia.com/gpu"
operator: "Equal"
value: "true"
effect: "NoSchedule"
# Uncomment if your cluster requires explicit GPU runtime class
# runtimeClassName: nvidia
terminationGracePeriodSeconds: 60
containers:
- name: ollama
image: ollama/ollama:0.5.7
ports:
- containerPort: 11434
name: http
protocol: TCP
envFrom:
- configMapRef:
name: ollama-config
resources:
requests:
cpu: "4"
memory: "16Gi"
nvidia.com/gpu: "1"
limits:
cpu: "8"
memory: "64Gi" # 32Gi for 7B-13B models, 64Gi for 70B models
nvidia.com/gpu: "1"
volumeMounts:
- name: models
mountPath: /root/.ollama
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 15"]
livenessProbe:
httpGet:
path: /api/tags
port: 11434
initialDelaySeconds: 60
periodSeconds: 30
timeoutSeconds: 10
failureThreshold: 3
readinessProbe:
httpGet:
path: /api/tags
port: 11434
initialDelaySeconds: 10
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: false # Ollama writes models to /root/.ollama
runAsNonRoot: false
capabilities:
drop:
- ALL
volumes:
- name: models
persistentVolumeClaim:
claimName: ollama-models

Key choices I’ve validated in production:

  • strategy: Recreate: Required because the ReadWriteOnce PVC can’t mount to two pods simultaneously.
  • Resource limits: 64Gi supports 70B parameter models with quantization. Use 32Gi for 7B-13B models.
  • OLLAMA_NUM_PARALLEL: "4": Batches concurrent requests. Increase only if you have spare VRAM.
  • lifecycle.preStop + terminationGracePeriodSeconds: 60: Gives in-flight inference requests 15 seconds to complete before SIGKILL during pod termination.
  • OLLAMA_KEEP_ALIVE: "30m": Keeps models loaded in VRAM for 30 minutes after the last request, reducing cold-start latency.

Apply the manifest:

Terminal window
kubectl apply -f 02-ollama-deployment.yaml

Wait for the pod to reach Running:

Terminal window
kubectl wait --for=condition=ready pod -l app.kubernetes.io/name=ollama -n ollama --timeout=300s

Step 4: Expose Ollama with Service and Ingress

Create a ClusterIP Service for internal communication and an Ingress for external access.

03-service-ingress.yaml
apiVersion: v1
kind: Service
metadata:
name: ollama
namespace: ollama
labels:
app.kubernetes.io/name: ollama
spec:
type: ClusterIP
selector:
app.kubernetes.io/name: ollama
ports:
- port: 11434
targetPort: 11434
name: http
protocol: TCP
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: ollama
namespace: ollama
annotations:
nginx.ingress.kubernetes.io/proxy-body-size: "512m"
nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
nginx.ingress.kubernetes.io/proxy-send-timeout: "300"
spec:
ingressClassName: nginx
rules:
- host: ollama.<your-cluster-domain>.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: ollama
port:
number: 11434

Warning: Ollama has no built-in authentication. I expose via Ingress only with an auth proxy (oauth2-proxy, Authelia) or internal VPN.

Tip: The proxy-body-size: "512m" annotation is critical if you plan to upload custom models via the API. Ollama’s default push endpoint can send large blobs.

Apply and verify:

Terminal window
kubectl apply -f 03-service-ingress.yaml
kubectl get ingress -n ollama

Step 5: Verification: Pull a Model and Run Inference

With the pod running, pull a model and test local LLM inference to confirm GPU acceleration is working.

Exec into the pod to pull your first model:

Terminal window
kubectl exec -it deployment/ollama -n ollama -- ollama pull llama3.2

Expected output:

pulling manifest
pulling dde5aa3fc5ff... 100% β–•β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– 2.0 GB
verifying sha256 digest
writing manifest
success

Test inference via port-forward:

Terminal window
kubectl port-forward svc/ollama 11434:11434 -n ollama

In another terminal:

Terminal window
curl http://localhost:11434/api/generate -d '{
"model": "llama3.2",
"prompt": "Why is Kubernetes the right platform for self-hosted AI?",
"stream": false
}'

You should receive a JSON response with generated text. If you see {"done":true,"response":"..."}, your local LLM inference stack is live.

For programmatic access, verify the OpenAI-compatible endpoint:

Terminal window
curl http://localhost:11434/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "llama3.2",
"messages": [{"role": "user", "content": "Hello!"}]
}'

FAQ

Why use strategy: Recreate instead of RollingUpdate?

ReadWriteOnce PVCs can only mount to one pod at a time. A rolling update would create a new pod before terminating the old one, and the new pod would fail to mount the PVC. Recreate terminates the old pod first, then starts the new one.

What do the liveness and readiness probes check?

Both probes hit the /api/tags endpoint. The liveness probe (checking every 30s after a 60s delay) restarts the pod if Ollama stops responding. The readiness probe (every 10s after a 10s delay) removes the pod from the Service endpoint until Ollama is ready to serve requests. I’ve tuned these delays to account for model loading time.

How do I verify GPU acceleration is actually working?

Run kubectl exec -it deployment/ollama -n ollama -- nvidia-smi inside the pod. If you see GPU processes listed with Ollama’s PID, the GPU is being used. During inference, GPU utilization should spike to 90-100%.

What does OLLAMA_KEEP_ALIVE: "30m" do?

It keeps the loaded model in VRAM for 30 minutes after the last request. Without this, Ollama unloads the model immediately after each inference, adding 5-30 seconds of cold-start latency to every request. For chat applications, I recommend 30m. For batch processing, set it higher or to -1 (never unload).

Can I run multiple models simultaneously with this setup?

Yes, but only up to OLLAMA_MAX_LOADED_MODELS: "2". Each loaded model consumes VRAM. On a single A100 with 80 GB, you can run two 7B models simultaneously. For production multi-model serving, I recommend the per-model Deployment pattern in Part 4.

Next Steps

Your Ollama instance is running on Kubernetes with GPU acceleration, persistent storage, and ingress exposure. In Part 4, I cover production hardening; HPA alternatives, NetworkPolicies, monitoring, troubleshooting common errors, and when NOT to use Kubernetes for Ollama.

Parts in this series: ← Part 2 | Part 4 β†’

# Ollama # Kubernetes # self-hosted-ai # gpu-inference # nvidia # DevOps # Llm