Ollama vs vLLM: Benchmarks and Kubernetes Ready
Table of Contents
Ollama vs vLLM: Benchmarks and Kubernetes Readiness
This is Part 2 of a 4-part series comparing Ollama and vLLM. Part 1 covered architecture; Part 3 covers the decision framework.
Performance & Scalability
Performance is where these two tools diverge most dramatically. I ran benchmarks on an NVIDIA A100 80GB with Llama 3 8B Instruct.
Benchmark Environment: NVIDIA A100 80GB | Ubuntu 22.04 | CUDA 12.4 | Ollama 0.3.0 | vLLM 0.5.1. See broader LLM inference benchmarks.
Throughput Test (tokens/second)
| Metric | Ollama | vLLM | Delta |
|---|---|---|---|
| Single request (batch=1) | 85 t/s | 92 t/s | +8% vLLM |
| 4 concurrent requests | 110 t/s aggregate | 340 t/s aggregate | +209% vLLM |
| 16 concurrent requests | 125 t/s aggregate | 1,180 t/s aggregate | +844% vLLM |
| 32 concurrent requests | 130 t/s aggregate | 2,050 t/s aggregate | +1,477% vLLM |
Latency Test (time to first token, TTFT)
| Metric | Ollama | vLLM | Delta |
|---|---|---|---|
| Single request | 45 ms | 38 ms | -16% vLLM |
| 4 concurrent requests | 180 ms | 42 ms | -77% vLLM |
| 16 concurrent requests | 720 ms | 48 ms | -93% vLLM |
Resource Usage (16 concurrent requests)
| Metric | Ollama | vLLM | Delta |
|---|---|---|---|
| GPU VRAM (GB) | ~18 GB | ~14 GB | -22% vLLM |
| GPU Utilization | 65% | 98% | +51% vLLM |
| CPU (cores) | 2.1 | 3.8 | +81% vLLM |
For a single user these differences barely register. Scale beyond a handful of concurrent requests and vLLM’s PagedAttention leaves Ollama in the dust, consuming less VRAM per request thanks to efficient KV cache management.
Ease of Use & Developer Experience
Ollama wins on simplicity. The CLI is intuitive:
ollama pull llama3.1ollama run llama3.1The REST API is minimal:
curl http://localhost:11434/api/generate -d '{ "model": "llama3.1", "prompt": "Why is the sky blue?"}'Model management runs on autopilot. Ollama downloads the right quantization, handles updates, and supports Modelfiles.
vLLM requires more setup but the serve command is straightforward:
python -m vllm.entrypoints.openai.api_server \ --model meta-llama/Meta-Llama-3-8B-Instruct \ --tensor-parallel-size 1 \ --gpu-memory-utilization 0.9But production tuning requires understanding batch sizes, KV cache quantization, and scheduling policies.
Model Support & Quantization
Ollama uses the GGUF format with broad quantization support: Q4_0, Q4_K_M, Q5_K_M, Q8_0, and FP16. The ollama.com/library hosts hundreds of models.
vLLM supports HuggingFace Transformers format (safetensors) with AWQ, GPTQ, and FP8 quantization, but not GGUF. You cannot directly use Ollama’s model library, you need to download from HuggingFace or convert them.
vLLM adds support faster via the Transformers ecosystem.
API Compatibility & Integrations
Ollama exposes a native REST API with a partial OpenAI compatibility layer. /v1/chat/completions exists, but function calling stays limited and tool use remains experimental.
vLLM exposes a full OpenAI-compatible API with streaming, function calling, tool use, and embeddings. For OpenAI-standard behavior, vLLM is the closer match.
Kubernetes Readiness
Both run well on Kubernetes, though their operational models differ.
Ollama on Kubernetes is straightforward: one Deployment, one Service, one PVC for model cache. Scaling stays mostly vertical since Ollama does not distribute inference across GPUs. You can run multiple replicas with sticky sessions, but each replica operates independently.
vLLM on Kubernetes supports tensor and pipeline parallelism, so a single model can span multiple GPUs or nodes. See my guide to deploy vLLM in production for a walkthrough. This makes vLLM more complex but far more capable for large models, with official Helm charts and KubeRay integration.
| Aspect | Ollama | vLLM |
|---|---|---|
| Multi-GPU single model | No | Yes (tensor/pipeline parallel) |
| Horizontal scaling | Replicas with session stickiness | Replicas + load balancing |
| Model cache | Local PVC | Local PVC or shared filesystem |
| Official Helm chart | Community | Official |
| HPA-friendly | Moderate | High (stateless workers) |
Part 3 covers the decision framework. Part 4 covers cost analysis.
FAQ
Why does vLLM perform better under concurrency? vLLM uses continuous batching, dynamically grouping incoming requests into batches on the GPU. Combined with PagedAttention’s efficient memory management, this saturates GPU compute units far better than Ollama’s simpler scheduling.
Can I use Ollama’s models with vLLM? Not directly. Ollama uses GGUF, while vLLM requires HuggingFace safetensors. Download from HuggingFace or convert the model.
Does vLLM work on Kubernetes without a GPU? Not practically, it requires NVIDIA GPUs with CUDA. For CPU-only inference, Ollama via llama.cpp is the right choice.
Does Ollama support tensor parallelism? No, each replica operates independently. For models larger than 70B, you need vLLM with tensor parallelism.
What quantization should I use for production? For Ollama, Q4_K_M gives the best balance for most 7B-13B models. For vLLM, AWQ-4bit provides an excellent tradeoff, and FP8 on H100 GPUs is ideal for large-scale deployments.