LiteLLM on Kubernetes: Deploy an AI Gateway
Table of Contents
Part 1 of 4. This series covers building a centralized AI gateway with LiteLLM on Kubernetes. Part 2 β Deploying to Production Β· Part 3 β Configuring AI Tools Β· Part 4 β Production Hardening
Managing multiple AI providers turns into a maintenance nightmare fast. One month your go-to model delivers cheap, blazing-fast responses. The next, that same provider hikes prices, restructures tiers, or sunsets the API without warning. Suddenly youβre hunting down API keys in five different tools: your IDE, CLI assistant, automation scripts, chatbot, and that side project you forgot about.
I got tired of that circus.
So I dropped LiteLLM onto my Kubernetes cluster. Now everything routes through a single endpoint with one master key. Switching providers means updating one config file; no client changes required.
This series walks through the exact architecture I run daily: real configs, real providers, zero fluff. Youβll learn how to deploy a centralized AI gateway that eliminates API key sprawl permanently.
What Is LiteLLM?
LiteLLM is an open-source AI gateway that exposes a single OpenAI-compatible API for 100+ providers. Stop wrestling with each providerβs auth format, endpoint quirks, and request schemas. Send standard OpenAI requests to LiteLLM; it handles the translation.
LiteLLM processes millions of API calls daily across enterprise and self-hosted deployments. It preserves provider-specific features: function calling, streaming, tool use; while abstracting complexity behind a unified interface.
βThe future of AI infrastructure isnβt about picking one model; itβs about routing to the right model at the right time. LiteLLM makes that possible without rewriting your application code.β Ishaan Jaffer, creator of LiteLLM
That philosophy drove my deployment. My coding agents donβt worry about Kimiβs headers, OpenRouterβs referer rules, or NVIDIAβs naming quirks. They fire standard requests; LiteLLM handles the rest.
What Youβll Build
A LiteLLM Proxy on Kubernetes serving as your unified gateway to multiple AI providers:
- Kimi Code: high-quality code generation
- OpenRouter: free and trending models
- NVIDIA NIM: Llama models via NVIDIAβs inference stack
Your coding agents, IDEs, and scripts all target a single URL. You decide which model they hit, and you can flip that decision instantly. No client updates required.
A 2025 Retool survey found 62% of engineering teams juggle three or more AI providers simultaneously. Managing API keys across that many tools creates real operational drag; precisely the problem LiteLLM eliminates.
Time to complete: 20β30 minutes
Difficulty: Intermediate
Cost: Free (uses free tiers and your existing K8s cluster)
Prerequisites
Before diving in, confirm you have these in place. If we havenβt met, Iβm a very technical monkey writing about production infrastructure and AI systems.
| Requirement | Minimum | Recommended | Verify Command |
|---|---|---|---|
| Kubernetes cluster | 1 node, 2 vCPU | 2+ nodes, 4 vCPU | kubectl version |
| kubectl | v1.28+ | v1.30+ | kubectl version --client |
| Storage | 1 GB for config | 5 GB+ for logs | df -h |
| Postgres | External or in-cluster | Dedicated instance | psql --version |
| Tailscale (optional) | For secure remote access | tailscale status |
Youβll also need API keys for the providers you want to route to:
Architecture Overview
Hereβs the architecture at a glance:
βββββββββββββββββββ βββββββββββββββββββ ββββββββββββββββββββ OpenCode IDE ββββββΆβ LiteLLM Proxy ββββββΆβ Kimi Code ββ (Your Agent) β β (K8s NodePort) β β (Coding LLM) ββββββββββββββββββββ β β βββββββββββββββββββ β <your-node> ββββββΆβββββββββββββββββββ β :<node-port> β β OpenRouter β β β β (Free Models) β β Single API Key β βββββββββββββββββββ β Multiple ModelsββββββΆβββββββββββββββββββ βββββββββββββββββββ β NVIDIA NIM β β (Llama 4) β βββββββββββββββββββData flow:
- Your client fires a standard OpenAI request to LiteLLM
- LiteLLM matches the model name in its config and routes to the right provider
- The provider responds, LiteLLM relays the result back
- All API keys stay server-side; clients only hold the master key
Frequently Asked Questions
What is LiteLLM?
An open-source AI gateway that exposes a unified OpenAI-compatible API for 100+ providers. It handles authentication, request translation, and routing behind a single endpoint.
Why deploy LiteLLM on Kubernetes?
Centralized management, persistent configuration, shared access across tools, and the ability to scale without client changes; no API key sprawl.
Is LiteLLM free and open source?
Yes. LiteLLM is MIT-licensed and freely available on GitHub. You can self-host it anywhere.
What providers does it support?
Over 100 including OpenAI, Anthropic, Kimi, OpenRouter, NVIDIA NIM, Groq, Together AI, and more. Add any of them via proxy_config.yaml.
Does LiteLLM work without Kubernetes?
Absolutely. You can run it locally with Docker or as a Python package. Kubernetes adds centralized management for production use.
Ready to deploy? Continue to Part 2: Deploying LiteLLM to Kubernetes.