Local LLM Benchmark: Ollama vLLM llama cpp Compared
Table of Contents
Part 1 of 4. Part 2: Results · Part 3: When to Use Each Engine · Part 4: FAQ and Next Steps
If you’ve ever tried to serve a local LLM in production, you’ve probably asked the same question I did: “Which inference engine actually delivers the performance I need?” I’ve run Llama 3 8B across three major inference stacks on the same hardware, with the same model, under identical test conditions: Ollama, vLLM, and llama.cpp. The results surprised me, and they’ll probably change how you think about local inference architecture.
Executive Summary
I tested Meta’s Llama 3 8B (Q4_K_M quantization) across three inference engines using a standardized Python test harness with OpenAI-compatible client calls. The benchmark reveals that vLLM dominates throughput scenarios with continuous batching, delivering up to 3.2x higher tokens/second than Ollama at scale. However, Ollama wins on simplicity and single-request latency, making it ideal for development workflows. llama.cpp remains the only viable option for CPU-only deployments, though GPU acceleration with cuBLAS dramatically changes the equation.
| Engine | Best Use Case | Tokens/Sec (Peak) | VRAM Usage | Grade |
|---|---|---|---|---|
| Ollama | Dev/Quick start | 68.4 | 5.8 GB | B+ |
| vLLM | Production/HQ | 217.6 | 9.2 GB | A |
| llama.cpp (GPU) | Custom/CPU fallback | 142.3 | 6.1 GB | A- |
| llama.cpp (CPU) | No GPU scenarios | 18.7 | 5.9 GB | C+ |
What Is Local LLM Inference Benchmarking?
Local LLM inference benchmarking is the practice of measuring tokens-per-second, latency, VRAM consumption, and concurrent request handling across inference engines on local hardware. I designed this benchmark to answer three specific questions that matter to platform engineers:
-
Which engine delivers the lowest latency for interactive use? When you’re iterating on prompts or building chat applications, Time to First Token (TTFT) matters more than throughput.
-
How does each engine handle concurrent requests? Production API endpoints face multiple simultaneous requests. Continuous batching in vLLM promises superior scaling, but does it deliver?
-
What’s the real VRAM overhead? I’ve seen too many deployments fail because the inference engine consumed more memory than expected. Accurate VRAM profiling prevents OOMKilled pods.
Test Methodology
Test Environment
I ran all tests on a dedicated bare-metal server to eliminate cloud instance variability. Here are the exact specifications:
| Component | Specification |
|---|---|
| CPU | AMD Ryzen 9 7950X (16 cores, 32 threads) |
| RAM | 64 GB DDR5 @ 5200 MHz |
| GPU | NVIDIA RTX 4090 (24 GB VRAM) |
| GPU Count | 1 |
| Storage | NVMe SSD (2 TB, 7,400 MB/s sequential) |
| Motherboard | ASUS ROG Crosshair X670E |
| Cooling | 360mm AIO liquid cooler |
| Power Supply | 1000W 80+ Gold |
Software Environment:
| Component | Version |
|---|---|
| OS | Ubuntu 24.04 LTS |
| Kernel | 6.8.0-31-generic |
| CUDA | 12.4 |
| NVIDIA Driver | 550.90.07 |
| Docker | 27.1.1 |
| Ollama | 0.3.12 |
| vLLM | 0.5.4 |
| llama.cpp | b3324 (built from source) |
| Python | 3.12.3 |
| Test Harness | Custom (openai 1.30.1) |
Workload Specification
Model: Meta Llama 3 8B Instruct (Q4_K_M quantization)
I chose Q4_K_M because it’s the sweet spot for production deployments: reasonable quality with manageable resource requirements. The GGUF file weighed in at 4.58 GB, while the safetensors version (for vLLM) was 15.2 GB.
Test Parameters:
| Parameter | Value |
|---|---|
| Average Prompt Length | 128 tokens |
| Output Length | 256 tokens (fixed) |
| Input Variance | ±30% (90-166 tokens) |
| Request Pattern | Poisson distribution (λ=target concurrency) |
| Warm-up Requests | 20 (discarded from results) |
| Test Duration | 5 minutes per configuration |
| Iterations | 3 (median reported) |
| Temperature | 0.7 (deterministic for comparison) |
Metrics Definitions
Before diving into results, let’s define what we’re measuring:
| Metric | Definition | Measurement Method |
|---|---|---|
| TTFT (Time to First Token) | Duration from request to first output token | Client-side timestamp difference |
| TPOT (Time Per Output Token) | Average duration between consecutive output tokens | (Total time - TTFT) / (Token count - 1) |
| Throughput | Total tokens generated per second | Output tokens / generation time |
| VRAM Usage | Peak GPU memory allocated | nvidia-smi --query-gpu=memory.used |
| Request Success Rate | Percentage of requests completing without error | Client-side tracking |
Test Harness: Reproducible Benchmarking
I built a Python test harness using the OpenAI client library, which works with all three engines thanks to Ollama and llama.cpp’s OpenAI-compatible APIs.
#!/usr/bin/env python3"""LLM Inference Benchmark HarnessReproducible testing for Ollama, vLLM, and llama.cpp
Usage: python benchmark_harness.py --engine ollama --concurrency 8 --requests 100"""
import argparseimport timeimport statisticsimport jsonfrom openai import OpenAIfrom concurrent.futures import ThreadPoolExecutor, as_completedimport psutilimport subprocess
class LLMBenchmark: def __init__(self, engine, base_url, model): self.engine = engine self.client = OpenAI( base_url=base_url, api_key="dummy" # Not needed for local engines ) self.model = model
def get_vram_usage(self): """Query nvidia-smi for current VRAM usage in MB""" try: result = subprocess.run([ "nvidia-smi", "--query-gpu=memory.used", "--format=csv,noheader,nounits" ], capture_output=True, text=True) return int(result.stdout.strip().split()[0]) except: return 0
def single_request(self, prompt, max_tokens=256): """Execute a single request and return timing metrics""" start_time = time.perf_counter() first_token_time = None tokens_received = 0
try: response = self.client.chat.completions.create( model=self.model, messages=[{"role": "user", "content": prompt}], max_tokens=max_tokens, stream=True, temperature=0.7 )
full_response = "" for chunk in response: if chunk.choices[0].delta.content: if first_token_time is None: first_token_time = time.perf_counter() tokens_received += 1 full_response += chunk.choices[0].delta.content
end_time = time.perf_counter()
ttft = (first_token_time - start_time) * 1000 # ms total_time = (end_time - start_time) * 1000 # ms tpot = (total_time - ttft) / max(tokens_received - 1, 1) throughput = tokens_received / (total_time / 1000)
return { "ttft_ms": ttft, "tpot_ms": tpot, "total_time_ms": total_time, "tokens": tokens_received, "throughput_tps": throughput, "success": True }
except Exception as e: return { "success": False, "error": str(e) }
def run_concurrent_benchmark(self, prompts, concurrency, num_requests): """Run concurrent requests with specified parallelism""" results = [] vram_start = self.get_vram_usage()
with ThreadPoolExecutor(max_workers=concurrency) as executor: futures = [] for i in range(num_requests): prompt = prompts[i % len(prompts)] futures.append(executor.submit(self.single_request, prompt))
for future in as_completed(futures): results.append(future.result())
vram_peak = self.get_vram_usage()
# Calculate aggregate metrics successful = [r for r in results if r["success"]] failed = [r for r in results if not r["success"]]
if not successful: return {"error": "All requests failed", "results": results}
return { "total_requests": num_requests, "successful": len(successful), "failed": len(failed), "success_rate": len(successful) / num_requests * 100, "ttft_ms": statistics.median([r["ttft_ms"] for r in successful]), "tpot_ms": statistics.median([r["tpot_ms"] for r in successful]), "throughput_tps": statistics.mean([r["throughput_tps"] for r in successful]), "tokens_per_request": statistics.mean([r["tokens"] for r in successful]), "vram_start_mb": vram_start, "vram_peak_mb": vram_peak }
def load_prompts(filepath="prompts.json"): """Load diverse prompts for testing""" with open(filepath, 'r') as f: return json.load(f)
def main(): parser = argparse.ArgumentParser(description="LLM Inference Benchmark") parser.add_argument("--engine", choices=["ollama", "vllm", "llama-cpp"], required=True) parser.add_argument("--concurrency", type=int, default=1) parser.add_argument("--requests", type=int, default=50) parser.add_argument("--output", default="results.json")
args = parser.parse_args()
# Engine configurations configs = { "ollama": {"url": "http://localhost:11434/v1", "model": "llama3:8b"}, "vllm": {"url": "http://localhost:8000/v1", "model": "meta-llama/Meta-Llama-3-8B-Instruct"}, "llama-cpp": {"url": "http://localhost:8080/v1", "model": "llama3-8b-q4_k_m"} }
config = configs[args.engine] benchmark = LLMBenchmark(args.engine, config["url"], config["model"])
print(f"Running benchmark: {args.engine} | Concurrency: {args.concurrency} | Requests: {args.requests}")
prompts = load_prompts() results = benchmark.run_concurrent_benchmark(prompts, args.concurrency, args.requests)
# Save results with open(args.output, 'w') as f: json.dump(results, f, indent=2)
print(f"\nResults saved to {args.output}") print(f"Throughput: {results.get('throughput_tps', 0):.1f} tokens/sec") print(f"TTFT: {results.get('ttft_ms', 0):.1f} ms") print(f"Success Rate: {results.get('success_rate', 0):.1f}%")
if __name__ == "__main__": main()Controlled Variables
To ensure fair comparison, I held these variables constant:
- Model: Llama 3 8B Instruct (same weights, quantized appropriately for each engine)
- Output length: Fixed at 256 tokens per request
- Temperature: 0.7 across all tests
- GPU: Single RTX 4090 (no multi-GPU testing)
- System load: No other GPU workloads during tests
Independent Variables
I varied these parameters to understand scaling behavior:
- Concurrency: 1, 2, 4, 8, 16, 32 requests
- Batch size: Default vs. tuned (vLLM only)
- Quantization: Q4_K_M (all engines)
- GPU layers: Full GPU offload (llama.cpp)
FAQ
How is Time to First Token (TTFT) measured differently across engines?
TTFT measures the delay between sending a request and receiving the first output token. Ollama’s server mode returns the first token in ~89ms on my RTX 4090, while vLLM takes ~156ms due to its scheduling overhead. The difference matters for real-time chat applications but becomes irrelevant under concurrent load.
Why did I choose Q4_K_M over other quantization levels?
Q4_K_M gives the best quality-to-performance ratio for Llama 3 8B. At 4.58 GB for the GGUF file, it fits comfortably in 24 GB VRAM with room for KV cache. Q8_0 would use 8.5 GB and deliver marginally better quality, while Q2_K at 3.3 GB has noticeable quality degradation that I wouldn’t put in production.
How do I adapt this benchmark for models larger than 8B parameters?
The same test harness works with any OpenAI-compatible model. For 70B models, you’ll need multi-GPU setups. vLLM supports tensor parallelism across GPUs with the --tensor-parallel-size flag. I’ve tested Mixtral 8x7B with this harness using 2x A100s. The methodology is identical.
Can I run these benchmarks without an NVIDIA GPU?
Yes, but you’re limited to llama.cpp for CPU-only inference. At 18.7 tokens/second on 16 CPU threads, it’s usable for offline processing but not real-time applications. AMD ROCm support exists for vLLM, but I haven’t tested it.
What is the minimum VRAM required for each engine to run Llama 3 8B?
Ollama needs 5.8 GB with Q4_K_M, llama.cpp GPU mode uses 6.1 GB, and vLLM requires 9.2 GB minimum (including PagedAttention’s KV cache overhead). For vLLM, lowering gpu_memory_utilization below 0.85 can reduce VRAM at the cost of throughput.
Why does my benchmark produce different numbers than yours?
Hardware variation, CUDA driver versions, and system load all affect results. The RTX 4090’s Ada Lovelace architecture gives specific performance characteristics that don’t translate directly to A100s or H100s. Run the test harness on your own hardware for numbers that match your deployment.
Next in the series: Part 2: Results →