LiteLLM on Kubernetes: Deploy an AI Gateway

2026.02.24
Technology
702 Words
LiteLLM on Kubernetes: Deploy an AI Gateway

Part 1 of 4. This series covers building a centralized AI gateway with LiteLLM on Kubernetes. Part 2 β†’ Deploying to Production Β· Part 3 β†’ Configuring AI Tools Β· Part 4 β†’ Production Hardening

Managing multiple AI providers turns into a maintenance nightmare fast. One month your go-to model delivers cheap, blazing-fast responses. The next, that same provider hikes prices, restructures tiers, or sunsets the API without warning. Suddenly you’re hunting down API keys in five different tools: your IDE, CLI assistant, automation scripts, chatbot, and that side project you forgot about.

I got tired of that circus.

So I dropped LiteLLM onto my Kubernetes cluster. Now everything routes through a single endpoint with one master key. Switching providers means updating one config file; no client changes required.

This series walks through the exact architecture I run daily: real configs, real providers, zero fluff. You’ll learn how to deploy a centralized AI gateway that eliminates API key sprawl permanently.

What Is LiteLLM?

LiteLLM is an open-source AI gateway that exposes a single OpenAI-compatible API for 100+ providers. Stop wrestling with each provider’s auth format, endpoint quirks, and request schemas. Send standard OpenAI requests to LiteLLM; it handles the translation.

LiteLLM processes millions of API calls daily across enterprise and self-hosted deployments. It preserves provider-specific features: function calling, streaming, tool use; while abstracting complexity behind a unified interface.

β€œThe future of AI infrastructure isn’t about picking one model; it’s about routing to the right model at the right time. LiteLLM makes that possible without rewriting your application code.” Ishaan Jaffer, creator of LiteLLM

That philosophy drove my deployment. My coding agents don’t worry about Kimi’s headers, OpenRouter’s referer rules, or NVIDIA’s naming quirks. They fire standard requests; LiteLLM handles the rest.

What You’ll Build

A LiteLLM Proxy on Kubernetes serving as your unified gateway to multiple AI providers:

  • Kimi Code: high-quality code generation
  • OpenRouter: free and trending models
  • NVIDIA NIM: Llama models via NVIDIA’s inference stack

Your coding agents, IDEs, and scripts all target a single URL. You decide which model they hit, and you can flip that decision instantly. No client updates required.

A 2025 Retool survey found 62% of engineering teams juggle three or more AI providers simultaneously. Managing API keys across that many tools creates real operational drag; precisely the problem LiteLLM eliminates.

Time to complete: 20–30 minutes
Difficulty: Intermediate
Cost: Free (uses free tiers and your existing K8s cluster)

Prerequisites

Before diving in, confirm you have these in place. If we haven’t met, I’m a very technical monkey writing about production infrastructure and AI systems.

RequirementMinimumRecommendedVerify Command
Kubernetes cluster1 node, 2 vCPU2+ nodes, 4 vCPUkubectl version
kubectlv1.28+v1.30+kubectl version --client
Storage1 GB for config5 GB+ for logsdf -h
PostgresExternal or in-clusterDedicated instancepsql --version
Tailscale (optional)For secure remote accesstailscale status

You’ll also need API keys for the providers you want to route to:

Architecture Overview

Here’s the architecture at a glance:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ OpenCode IDE │────▢│ LiteLLM Proxy │────▢│ Kimi Code β”‚
β”‚ (Your Agent) β”‚ β”‚ (K8s NodePort) β”‚ β”‚ (Coding LLM) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ <your-node> β”‚β”€β”€β”€β”€β–Άβ”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ :<node-port> β”‚ β”‚ OpenRouter β”‚
β”‚ β”‚ β”‚ (Free Models) β”‚
β”‚ Single API Key β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ Multiple Modelsβ”‚β”€β”€β”€β”€β–Άβ”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ NVIDIA NIM β”‚
β”‚ (Llama 4) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Data flow:

  1. Your client fires a standard OpenAI request to LiteLLM
  2. LiteLLM matches the model name in its config and routes to the right provider
  3. The provider responds, LiteLLM relays the result back
  4. All API keys stay server-side; clients only hold the master key

Frequently Asked Questions

What is LiteLLM?

An open-source AI gateway that exposes a unified OpenAI-compatible API for 100+ providers. It handles authentication, request translation, and routing behind a single endpoint.

Why deploy LiteLLM on Kubernetes?

Centralized management, persistent configuration, shared access across tools, and the ability to scale without client changes; no API key sprawl.

Is LiteLLM free and open source?

Yes. LiteLLM is MIT-licensed and freely available on GitHub. You can self-host it anywhere.

What providers does it support?

Over 100 including OpenAI, Anthropic, Kimi, OpenRouter, NVIDIA NIM, Groq, Together AI, and more. Add any of them via proxy_config.yaml.

Does LiteLLM work without Kubernetes?

Absolutely. You can run it locally with Docker or as a Python package. Kubernetes adds centralized management for production use.


Ready to deploy? Continue to Part 2: Deploying LiteLLM to Kubernetes.

# litellm # Kubernetes # AI # Llm # proxy # ai-gateway