Local LLM Benchmark: Ollama vLLM llama cpp Compared

2026.05.16
Technology
1096 Words
Local LLM Benchmark: Ollama vLLM llama cpp Compared

Part 1 of 4. Part 2: Results · Part 3: When to Use Each Engine · Part 4: FAQ and Next Steps

If you’ve ever tried to serve a local LLM in production, you’ve probably asked the same question I did: “Which inference engine actually delivers the performance I need?” I’ve run Llama 3 8B across three major inference stacks on the same hardware, with the same model, under identical test conditions: Ollama, vLLM, and llama.cpp. The results surprised me, and they’ll probably change how you think about local inference architecture.

Executive Summary

I tested Meta’s Llama 3 8B (Q4_K_M quantization) across three inference engines using a standardized Python test harness with OpenAI-compatible client calls. The benchmark reveals that vLLM dominates throughput scenarios with continuous batching, delivering up to 3.2x higher tokens/second than Ollama at scale. However, Ollama wins on simplicity and single-request latency, making it ideal for development workflows. llama.cpp remains the only viable option for CPU-only deployments, though GPU acceleration with cuBLAS dramatically changes the equation.

EngineBest Use CaseTokens/Sec (Peak)VRAM UsageGrade
OllamaDev/Quick start68.45.8 GBB+
vLLMProduction/HQ217.69.2 GBA
llama.cpp (GPU)Custom/CPU fallback142.36.1 GBA-
llama.cpp (CPU)No GPU scenarios18.75.9 GBC+

What Is Local LLM Inference Benchmarking?

Local LLM inference benchmarking is the practice of measuring tokens-per-second, latency, VRAM consumption, and concurrent request handling across inference engines on local hardware. I designed this benchmark to answer three specific questions that matter to platform engineers:

  1. Which engine delivers the lowest latency for interactive use? When you’re iterating on prompts or building chat applications, Time to First Token (TTFT) matters more than throughput.

  2. How does each engine handle concurrent requests? Production API endpoints face multiple simultaneous requests. Continuous batching in vLLM promises superior scaling, but does it deliver?

  3. What’s the real VRAM overhead? I’ve seen too many deployments fail because the inference engine consumed more memory than expected. Accurate VRAM profiling prevents OOMKilled pods.

Test Methodology

Test Environment

I ran all tests on a dedicated bare-metal server to eliminate cloud instance variability. Here are the exact specifications:

ComponentSpecification
CPUAMD Ryzen 9 7950X (16 cores, 32 threads)
RAM64 GB DDR5 @ 5200 MHz
GPUNVIDIA RTX 4090 (24 GB VRAM)
GPU Count1
StorageNVMe SSD (2 TB, 7,400 MB/s sequential)
MotherboardASUS ROG Crosshair X670E
Cooling360mm AIO liquid cooler
Power Supply1000W 80+ Gold

Software Environment:

ComponentVersion
OSUbuntu 24.04 LTS
Kernel6.8.0-31-generic
CUDA12.4
NVIDIA Driver550.90.07
Docker27.1.1
Ollama0.3.12
vLLM0.5.4
llama.cppb3324 (built from source)
Python3.12.3
Test HarnessCustom (openai 1.30.1)

Workload Specification

Model: Meta Llama 3 8B Instruct (Q4_K_M quantization)

I chose Q4_K_M because it’s the sweet spot for production deployments: reasonable quality with manageable resource requirements. The GGUF file weighed in at 4.58 GB, while the safetensors version (for vLLM) was 15.2 GB.

Test Parameters:

ParameterValue
Average Prompt Length128 tokens
Output Length256 tokens (fixed)
Input Variance±30% (90-166 tokens)
Request PatternPoisson distribution (λ=target concurrency)
Warm-up Requests20 (discarded from results)
Test Duration5 minutes per configuration
Iterations3 (median reported)
Temperature0.7 (deterministic for comparison)

Metrics Definitions

Before diving into results, let’s define what we’re measuring:

MetricDefinitionMeasurement Method
TTFT (Time to First Token)Duration from request to first output tokenClient-side timestamp difference
TPOT (Time Per Output Token)Average duration between consecutive output tokens(Total time - TTFT) / (Token count - 1)
ThroughputTotal tokens generated per secondOutput tokens / generation time
VRAM UsagePeak GPU memory allocatednvidia-smi --query-gpu=memory.used
Request Success RatePercentage of requests completing without errorClient-side tracking

Test Harness: Reproducible Benchmarking

I built a Python test harness using the OpenAI client library, which works with all three engines thanks to Ollama and llama.cpp’s OpenAI-compatible APIs.

#!/usr/bin/env python3
"""
LLM Inference Benchmark Harness
Reproducible testing for Ollama, vLLM, and llama.cpp
Usage:
python benchmark_harness.py --engine ollama --concurrency 8 --requests 100
"""
import argparse
import time
import statistics
import json
from openai import OpenAI
from concurrent.futures import ThreadPoolExecutor, as_completed
import psutil
import subprocess
class LLMBenchmark:
def __init__(self, engine, base_url, model):
self.engine = engine
self.client = OpenAI(
base_url=base_url,
api_key="dummy" # Not needed for local engines
)
self.model = model
def get_vram_usage(self):
"""Query nvidia-smi for current VRAM usage in MB"""
try:
result = subprocess.run([
"nvidia-smi",
"--query-gpu=memory.used",
"--format=csv,noheader,nounits"
], capture_output=True, text=True)
return int(result.stdout.strip().split()[0])
except:
return 0
def single_request(self, prompt, max_tokens=256):
"""Execute a single request and return timing metrics"""
start_time = time.perf_counter()
first_token_time = None
tokens_received = 0
try:
response = self.client.chat.completions.create(
model=self.model,
messages=[{"role": "user", "content": prompt}],
max_tokens=max_tokens,
stream=True,
temperature=0.7
)
full_response = ""
for chunk in response:
if chunk.choices[0].delta.content:
if first_token_time is None:
first_token_time = time.perf_counter()
tokens_received += 1
full_response += chunk.choices[0].delta.content
end_time = time.perf_counter()
ttft = (first_token_time - start_time) * 1000 # ms
total_time = (end_time - start_time) * 1000 # ms
tpot = (total_time - ttft) / max(tokens_received - 1, 1)
throughput = tokens_received / (total_time / 1000)
return {
"ttft_ms": ttft,
"tpot_ms": tpot,
"total_time_ms": total_time,
"tokens": tokens_received,
"throughput_tps": throughput,
"success": True
}
except Exception as e:
return {
"success": False,
"error": str(e)
}
def run_concurrent_benchmark(self, prompts, concurrency, num_requests):
"""Run concurrent requests with specified parallelism"""
results = []
vram_start = self.get_vram_usage()
with ThreadPoolExecutor(max_workers=concurrency) as executor:
futures = []
for i in range(num_requests):
prompt = prompts[i % len(prompts)]
futures.append(executor.submit(self.single_request, prompt))
for future in as_completed(futures):
results.append(future.result())
vram_peak = self.get_vram_usage()
# Calculate aggregate metrics
successful = [r for r in results if r["success"]]
failed = [r for r in results if not r["success"]]
if not successful:
return {"error": "All requests failed", "results": results}
return {
"total_requests": num_requests,
"successful": len(successful),
"failed": len(failed),
"success_rate": len(successful) / num_requests * 100,
"ttft_ms": statistics.median([r["ttft_ms"] for r in successful]),
"tpot_ms": statistics.median([r["tpot_ms"] for r in successful]),
"throughput_tps": statistics.mean([r["throughput_tps"] for r in successful]),
"tokens_per_request": statistics.mean([r["tokens"] for r in successful]),
"vram_start_mb": vram_start,
"vram_peak_mb": vram_peak
}
def load_prompts(filepath="prompts.json"):
"""Load diverse prompts for testing"""
with open(filepath, 'r') as f:
return json.load(f)
def main():
parser = argparse.ArgumentParser(description="LLM Inference Benchmark")
parser.add_argument("--engine", choices=["ollama", "vllm", "llama-cpp"], required=True)
parser.add_argument("--concurrency", type=int, default=1)
parser.add_argument("--requests", type=int, default=50)
parser.add_argument("--output", default="results.json")
args = parser.parse_args()
# Engine configurations
configs = {
"ollama": {"url": "http://localhost:11434/v1", "model": "llama3:8b"},
"vllm": {"url": "http://localhost:8000/v1", "model": "meta-llama/Meta-Llama-3-8B-Instruct"},
"llama-cpp": {"url": "http://localhost:8080/v1", "model": "llama3-8b-q4_k_m"}
}
config = configs[args.engine]
benchmark = LLMBenchmark(args.engine, config["url"], config["model"])
print(f"Running benchmark: {args.engine} | Concurrency: {args.concurrency} | Requests: {args.requests}")
prompts = load_prompts()
results = benchmark.run_concurrent_benchmark(prompts, args.concurrency, args.requests)
# Save results
with open(args.output, 'w') as f:
json.dump(results, f, indent=2)
print(f"\nResults saved to {args.output}")
print(f"Throughput: {results.get('throughput_tps', 0):.1f} tokens/sec")
print(f"TTFT: {results.get('ttft_ms', 0):.1f} ms")
print(f"Success Rate: {results.get('success_rate', 0):.1f}%")
if __name__ == "__main__":
main()

Controlled Variables

To ensure fair comparison, I held these variables constant:

  • Model: Llama 3 8B Instruct (same weights, quantized appropriately for each engine)
  • Output length: Fixed at 256 tokens per request
  • Temperature: 0.7 across all tests
  • GPU: Single RTX 4090 (no multi-GPU testing)
  • System load: No other GPU workloads during tests

Independent Variables

I varied these parameters to understand scaling behavior:

  • Concurrency: 1, 2, 4, 8, 16, 32 requests
  • Batch size: Default vs. tuned (vLLM only)
  • Quantization: Q4_K_M (all engines)
  • GPU layers: Full GPU offload (llama.cpp)

FAQ

How is Time to First Token (TTFT) measured differently across engines?

TTFT measures the delay between sending a request and receiving the first output token. Ollama’s server mode returns the first token in ~89ms on my RTX 4090, while vLLM takes ~156ms due to its scheduling overhead. The difference matters for real-time chat applications but becomes irrelevant under concurrent load.

Why did I choose Q4_K_M over other quantization levels?

Q4_K_M gives the best quality-to-performance ratio for Llama 3 8B. At 4.58 GB for the GGUF file, it fits comfortably in 24 GB VRAM with room for KV cache. Q8_0 would use 8.5 GB and deliver marginally better quality, while Q2_K at 3.3 GB has noticeable quality degradation that I wouldn’t put in production.

How do I adapt this benchmark for models larger than 8B parameters?

The same test harness works with any OpenAI-compatible model. For 70B models, you’ll need multi-GPU setups. vLLM supports tensor parallelism across GPUs with the --tensor-parallel-size flag. I’ve tested Mixtral 8x7B with this harness using 2x A100s. The methodology is identical.

Can I run these benchmarks without an NVIDIA GPU?

Yes, but you’re limited to llama.cpp for CPU-only inference. At 18.7 tokens/second on 16 CPU threads, it’s usable for offline processing but not real-time applications. AMD ROCm support exists for vLLM, but I haven’t tested it.

What is the minimum VRAM required for each engine to run Llama 3 8B?

Ollama needs 5.8 GB with Q4_K_M, llama.cpp GPU mode uses 6.1 GB, and vLLM requires 9.2 GB minimum (including PagedAttention’s KV cache overhead). For vLLM, lowering gpu_memory_utilization below 0.85 can reduce VRAM at the cost of throughput.

Why does my benchmark produce different numbers than yours?

Hardware variation, CUDA driver versions, and system load all affect results. The RTX 4090’s Ada Lovelace architecture gives specific performance characteristics that don’t translate directly to A100s or H100s. Run the test harness on your own hardware for numbers that match your deployment.


Next in the series: Part 2: Results →

# local-llm # benchmark # performance # Ollama # Vllm # llama-cpp # inference-speed