Ollama vs vLLM: Benchmarks and Kubernetes Ready

2026.01.16
Technology
702 Words
Ollama vs vLLM: Benchmarks and Kubernetes Ready

Ollama vs vLLM: Benchmarks and Kubernetes Readiness

This is Part 2 of a 4-part series comparing Ollama and vLLM. Part 1 covered architecture; Part 3 covers the decision framework.

Performance & Scalability

Performance is where these two tools diverge most dramatically. I ran benchmarks on an NVIDIA A100 80GB with Llama 3 8B Instruct.

Benchmark Environment: NVIDIA A100 80GB | Ubuntu 22.04 | CUDA 12.4 | Ollama 0.3.0 | vLLM 0.5.1. See broader LLM inference benchmarks.

Throughput Test (tokens/second)

MetricOllamavLLMDelta
Single request (batch=1)85 t/s92 t/s+8% vLLM
4 concurrent requests110 t/s aggregate340 t/s aggregate+209% vLLM
16 concurrent requests125 t/s aggregate1,180 t/s aggregate+844% vLLM
32 concurrent requests130 t/s aggregate2,050 t/s aggregate+1,477% vLLM

Latency Test (time to first token, TTFT)

MetricOllamavLLMDelta
Single request45 ms38 ms-16% vLLM
4 concurrent requests180 ms42 ms-77% vLLM
16 concurrent requests720 ms48 ms-93% vLLM

Resource Usage (16 concurrent requests)

MetricOllamavLLMDelta
GPU VRAM (GB)~18 GB~14 GB-22% vLLM
GPU Utilization65%98%+51% vLLM
CPU (cores)2.13.8+81% vLLM

For a single user these differences barely register. Scale beyond a handful of concurrent requests and vLLM’s PagedAttention leaves Ollama in the dust, consuming less VRAM per request thanks to efficient KV cache management.

Ease of Use & Developer Experience

Ollama wins on simplicity. The CLI is intuitive:

Terminal window
ollama pull llama3.1
ollama run llama3.1

The REST API is minimal:

Terminal window
curl http://localhost:11434/api/generate -d '{
"model": "llama3.1",
"prompt": "Why is the sky blue?"
}'

Model management runs on autopilot. Ollama downloads the right quantization, handles updates, and supports Modelfiles.

vLLM requires more setup but the serve command is straightforward:

Terminal window
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Meta-Llama-3-8B-Instruct \
--tensor-parallel-size 1 \
--gpu-memory-utilization 0.9

But production tuning requires understanding batch sizes, KV cache quantization, and scheduling policies.

Model Support & Quantization

Ollama uses the GGUF format with broad quantization support: Q4_0, Q4_K_M, Q5_K_M, Q8_0, and FP16. The ollama.com/library hosts hundreds of models.

vLLM supports HuggingFace Transformers format (safetensors) with AWQ, GPTQ, and FP8 quantization, but not GGUF. You cannot directly use Ollama’s model library, you need to download from HuggingFace or convert them.

vLLM adds support faster via the Transformers ecosystem.

API Compatibility & Integrations

Ollama exposes a native REST API with a partial OpenAI compatibility layer. /v1/chat/completions exists, but function calling stays limited and tool use remains experimental.

vLLM exposes a full OpenAI-compatible API with streaming, function calling, tool use, and embeddings. For OpenAI-standard behavior, vLLM is the closer match.

Kubernetes Readiness

Both run well on Kubernetes, though their operational models differ.

Ollama on Kubernetes is straightforward: one Deployment, one Service, one PVC for model cache. Scaling stays mostly vertical since Ollama does not distribute inference across GPUs. You can run multiple replicas with sticky sessions, but each replica operates independently.

vLLM on Kubernetes supports tensor and pipeline parallelism, so a single model can span multiple GPUs or nodes. See my guide to deploy vLLM in production for a walkthrough. This makes vLLM more complex but far more capable for large models, with official Helm charts and KubeRay integration.

AspectOllamavLLM
Multi-GPU single modelNoYes (tensor/pipeline parallel)
Horizontal scalingReplicas with session stickinessReplicas + load balancing
Model cacheLocal PVCLocal PVC or shared filesystem
Official Helm chartCommunityOfficial
HPA-friendlyModerateHigh (stateless workers)

Part 3 covers the decision framework. Part 4 covers cost analysis.

FAQ

Why does vLLM perform better under concurrency? vLLM uses continuous batching, dynamically grouping incoming requests into batches on the GPU. Combined with PagedAttention’s efficient memory management, this saturates GPU compute units far better than Ollama’s simpler scheduling.

Can I use Ollama’s models with vLLM? Not directly. Ollama uses GGUF, while vLLM requires HuggingFace safetensors. Download from HuggingFace or convert the model.

Does vLLM work on Kubernetes without a GPU? Not practically, it requires NVIDIA GPUs with CUDA. For CPU-only inference, Ollama via llama.cpp is the right choice.

Does Ollama support tensor parallelism? No, each replica operates independently. For models larger than 70B, you need vLLM with tensor parallelism.

What quantization should I use for production? For Ollama, Q4_K_M gives the best balance for most 7B-13B models. For vLLM, AWQ-4bit provides an excellent tradeoff, and FP8 on H100 GPUs is ideal for large-scale deployments.

# Ollama # Vllm # llm-inference # self-hosted-ai # Gpu # benchmark # Kubernetes