Ollama vs vLLM: Benchmarks and Kubernetes Readiness

This is Part 2 of a 4-part series comparing Ollama and vLLM. Part 1 covered architecture; Part 3 covers the decision framework.

Performance & Scalability

Performance is where these two tools diverge most dramatically. I ran benchmarks on an NVIDIA A100 80GB with Llama 3 8B Instruct.

Benchmark Environment: NVIDIA A100 80GB | Ubuntu 22.04 | CUDA 12.4 | Ollama 0.3.0 | vLLM 0.5.1. See broader LLM inference benchmarks.

Throughput Test (tokens/second)

Metric	Ollama	vLLM	Delta
Single request (batch=1)	85 t/s	92 t/s	+8% vLLM
4 concurrent requests	110 t/s aggregate	340 t/s aggregate	+209% vLLM
16 concurrent requests	125 t/s aggregate	1,180 t/s aggregate	+844% vLLM
32 concurrent requests	130 t/s aggregate	2,050 t/s aggregate	+1,477% vLLM

Latency Test (time to first token, TTFT)

Metric	Ollama	vLLM	Delta
Single request	45 ms	38 ms	-16% vLLM
4 concurrent requests	180 ms	42 ms	-77% vLLM
16 concurrent requests	720 ms	48 ms	-93% vLLM

Resource Usage (16 concurrent requests)

Metric	Ollama	vLLM	Delta
GPU VRAM (GB)	~18 GB	~14 GB	-22% vLLM
GPU Utilization	65%	98%	+51% vLLM
CPU (cores)	2.1	3.8	+81% vLLM

For a single user these differences barely register. Scale beyond a handful of concurrent requests and vLLM’s PagedAttention leaves Ollama in the dust, consuming less VRAM per request thanks to efficient KV cache management.

Ease of Use & Developer Experience

Ollama wins on simplicity. The CLI is intuitive:

ollama pull llama3.1
ollama run llama3.1

The REST API is minimal:

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.1",
  "prompt": "Why is the sky blue?"
}'

Model management runs on autopilot. Ollama downloads the right quantization, handles updates, and supports Modelfiles.

vLLM requires more setup but the serve command is straightforward:

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Meta-Llama-3-8B-Instruct \
  --tensor-parallel-size 1 \
  --gpu-memory-utilization 0.9

But production tuning requires understanding batch sizes, KV cache quantization, and scheduling policies.

Model Support & Quantization

Ollama uses the GGUF format with broad quantization support: Q4_0, Q4_K_M, Q5_K_M, Q8_0, and FP16. The ollama.com/library hosts hundreds of models.

vLLM supports HuggingFace Transformers format (safetensors) with AWQ, GPTQ, and FP8 quantization, but not GGUF. You cannot directly use Ollama’s model library, you need to download from HuggingFace or convert them.

vLLM adds support faster via the Transformers ecosystem.

API Compatibility & Integrations

Ollama exposes a native REST API with a partial OpenAI compatibility layer. /v1/chat/completions exists, but function calling stays limited and tool use remains experimental.

vLLM exposes a full OpenAI-compatible API with streaming, function calling, tool use, and embeddings. For OpenAI-standard behavior, vLLM is the closer match.

Kubernetes Readiness

Both run well on Kubernetes, though their operational models differ.

Ollama on Kubernetes is straightforward: one Deployment, one Service, one PVC for model cache. Scaling stays mostly vertical since Ollama does not distribute inference across GPUs. You can run multiple replicas with sticky sessions, but each replica operates independently.

vLLM on Kubernetes supports tensor and pipeline parallelism, so a single model can span multiple GPUs or nodes. See my guide to deploy vLLM in production for a walkthrough. This makes vLLM more complex but far more capable for large models, with official Helm charts and KubeRay integration.

Aspect	Ollama	vLLM
Multi-GPU single model	No	Yes (tensor/pipeline parallel)
Horizontal scaling	Replicas with session stickiness	Replicas + load balancing
Model cache	Local PVC	Local PVC or shared filesystem
Official Helm chart	Community	Official
HPA-friendly	Moderate	High (stateless workers)

Part 3 covers the decision framework. Part 4 covers cost analysis.

FAQ

Why does vLLM perform better under concurrency? vLLM uses continuous batching, dynamically grouping incoming requests into batches on the GPU. Combined with PagedAttention’s efficient memory management, this saturates GPU compute units far better than Ollama’s simpler scheduling.

Can I use Ollama’s models with vLLM? Not directly. Ollama uses GGUF, while vLLM requires HuggingFace safetensors. Download from HuggingFace or convert the model.

Does vLLM work on Kubernetes without a GPU? Not practically, it requires NVIDIA GPUs with CUDA. For CPU-only inference, Ollama via llama.cpp is the right choice.

Does Ollama support tensor parallelism? No, each replica operates independently. For models larger than 70B, you need vLLM with tensor parallelism.

What quantization should I use for production? For Ollama, Q4_K_M gives the best balance for most 7B-13B models. For vLLM, AWQ-4bit provides an excellent tradeoff, and FP8 on H100 GPUs is ideal for large-scale deployments.