Ollama vs vLLM vs llama.cpp Benchmarks

2026.05.22
Technology
655 Words
Ollama vs vLLM vs llama.cpp Benchmarks

Part 3 of 4. Part 1: Methodology Ā· Part 2: Results Ā· Part 4: FAQ and Next Steps

Numbers alone don’t tell you what to deploy. Here’s my practical framework for choosing between these three engines based on real workload patterns.

Decision Framework: When to Use Each Engine

After running these benchmarks on my clusters, here’s my practical decision framework:

Use Ollama When:

  • You need zero-config setup. I’ve deployed Ollama on Kubernetes with a single Deployment manifest. It just works.
  • Interactive development. The low TTFT makes it feel snappy for chat applications.
  • Single-user scenarios. If you’re building a personal AI assistant or development tool, Ollama’s simplicity wins.
  • Resource-constrained environments. At 5.8 GB VRAM, it’s the lightest option.

Use vLLM When:

  • Serving multiple users. The continuous batching scales beautifully for API endpoints.
  • Maximum throughput matters. Nothing else comes close for high-concurrency scenarios.
  • Production APIs. I’ve run vLLM in production with 99.9% uptime. The Prometheus metrics integration is solid.
  • You need OpenAI API compatibility. vLLM’s endpoint is drop-in compatible with minimal changes.

Use llama.cpp When:

  • You need CPU-only inference. If you don’t have a GPU, llama.cpp is your best bet.
  • Custom integration required. The C++ core lets you embed inference directly in your application.
  • Maximum flexibility. You can fine-tune every aspect of the inference pipeline.
  • Cross-platform support. It runs everywhere: Linux, macOS, Windows, even iOS/Android.

Reproduction Instructions

Want to verify these results on your own hardware? Here’s how:

Prerequisites

  • Linux system with NVIDIA GPU (24GB+ VRAM recommended)
  • CUDA 12.x installed
  • Docker and Docker Compose
  • Python 3.10+ with openai, psutil packages

Step-by-Step

Terminal window
# 1. Clone the benchmark repository
git clone https://github.com/eduard3v/llm-benchmark-harness.git
cd llm-benchmark-harness
# 2. Install dependencies
pip install -r requirements.txt
# 3. Start Ollama
docker run -d --gpus all -p 11434:11434 -v ollama:/root/.ollama ollama/ollama:0.3.12
docker exec <container_id> ollama pull llama3:8b
# 4. Start vLLM
docker run -d --gpus all -p 8000:8000 \
--ipc=host vllm/vllm-openai:v0.5.4 \
--model meta-llama/Meta-Llama-3-8B-Instruct
# 5. Start llama.cpp server
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp && make GGML_CUDA=1
./server -m models/llama3-8b-q4_k_m.gguf -c 2048 --port 8080
# 6. Run the benchmark
python benchmark_harness.py --engine ollama --concurrency 8 --requests 100 --output ollama_results.json
python benchmark_harness.py --engine vllm --concurrency 8 --requests 100 --output vllm_results.json
python benchmark_harness.py --engine llama-cpp --concurrency 8 --requests 100 --output llamacpp_results.json
# 7. Generate comparison report
python generate_report.py --results ollama_results.json vllm_results.json llamacpp_results.json

Expected Runtime

A full benchmark run (all concurrency levels, 3 iterations each) takes approximately 45 minutes on the hardware specified above.

FAQ

How do I choose between Ollama and vLLM for a new project?

Start with Ollama. It takes 1 hour to set up and you’ll know within a day whether its single-request throughput meets your needs. If you hit the concurrency ceiling (around 16 concurrent requests), migrate to vLLM. I’ve done this migration path twice and the OpenAI-compatible API means you change one environment variable, not your application code.

Can I run Ollama and vLLM side by side on the same GPU?

Not efficiently. Each engine wants full control of the GPU memory. Ollama’s 5.8 GB and vLLM’s 9.2 GB don’t fit in 24 GB simultaneously with useful headroom. The practical approach runs them on separate GPUs or schedules them at different times. I use GPU time-slicing on Kubernetes with node selectors.

Does llama.cpp support any form of continuous batching?

No, not in its current server mode (b3324 at time of testing). Each request blocks the GPU until completion. The llama.cpp team has discussed parallel decoding in their roadmap, but production multitenant workloads should use vLLM. llama.cpp’s strength is single-stream throughput and embeddability.

What is the fastest way to benchmark my own model with this harness?

Download the harness, edit the model names in the engine configs, and run with --concurrency 1 --requests 20. You’ll get baseline metrics in under 5 minutes. For production tuning, I recommend the full suite but start with concurrency 1, 4, and 8 to identify the throughput curve.

How often should I re-run these benchmarks for production systems?

I run benchmarks quarterly or after any infrastructure change (CUDA driver update, engine version bump, GPU upgrade). vLLM releases performance improvements every few months. I saw a 15% throughput gain going from v0.4.0 to v0.5.4. Track your baseline numbers or you won’t notice improvements or regressions.

Is the OpenAI-compatible API truly identical across all three engines?

Close but not exact. Ollama and llama.cpp implement the core chat completions endpoint with streaming. vLLM adds function calling, structured output (JSON mode), and tool use. If your application uses these advanced features, you’re locked into vLLM or need to write abstraction layers.


Parts in this series: ← Part 2 | Part 4 →

# local-llm # benchmark # performance # Ollama # Vllm # llama-cpp # inference-speed