Local LLM Benchmark: Ollama vLLM llama cpp Compared

Part 1 of 4. Part 2: Results · Part 3: When to Use Each Engine · Part 4: FAQ and Next Steps

If you’ve ever tried to serve a local LLM in production, you’ve probably asked the same question I did: “Which inference engine actually delivers the performance I need?” I’ve run Llama 3 8B across three major inference stacks on the same hardware, with the same model, under identical test conditions: Ollama, vLLM, and llama.cpp. The results surprised me, and they’ll probably change how you think about local inference architecture.

Executive Summary

I tested Meta’s Llama 3 8B (Q4_K_M quantization) across three inference engines using a standardized Python test harness with OpenAI-compatible client calls. The benchmark reveals that vLLM dominates throughput scenarios with continuous batching, delivering up to 3.2x higher tokens/second than Ollama at scale. However, Ollama wins on simplicity and single-request latency, making it ideal for development workflows. llama.cpp remains the only viable option for CPU-only deployments, though GPU acceleration with cuBLAS dramatically changes the equation.

Engine	Best Use Case	Tokens/Sec (Peak)	VRAM Usage	Grade
Ollama	Dev/Quick start	68.4	5.8 GB	B+
vLLM	Production/HQ	217.6	9.2 GB	A
llama.cpp (GPU)	Custom/CPU fallback	142.3	6.1 GB	A-
llama.cpp (CPU)	No GPU scenarios	18.7	5.9 GB	C+

What Is Local LLM Inference Benchmarking?

Local LLM inference benchmarking is the practice of measuring tokens-per-second, latency, VRAM consumption, and concurrent request handling across inference engines on local hardware. I designed this benchmark to answer three specific questions that matter to platform engineers:

Which engine delivers the lowest latency for interactive use? When you’re iterating on prompts or building chat applications, Time to First Token (TTFT) matters more than throughput.
How does each engine handle concurrent requests? Production API endpoints face multiple simultaneous requests. Continuous batching in vLLM promises superior scaling, but does it deliver?
What’s the real VRAM overhead? I’ve seen too many deployments fail because the inference engine consumed more memory than expected. Accurate VRAM profiling prevents OOMKilled pods.

Test Methodology

Test Environment

I ran all tests on a dedicated bare-metal server to eliminate cloud instance variability. Here are the exact specifications:

Component	Specification
CPU	AMD Ryzen 9 7950X (16 cores, 32 threads)
RAM	64 GB DDR5 @ 5200 MHz
GPU	NVIDIA RTX 4090 (24 GB VRAM)
GPU Count	1
Storage	NVMe SSD (2 TB, 7,400 MB/s sequential)
Motherboard	ASUS ROG Crosshair X670E
Cooling	360mm AIO liquid cooler
Power Supply	1000W 80+ Gold

Software Environment:

Component	Version
OS	Ubuntu 24.04 LTS
Kernel	6.8.0-31-generic
CUDA	12.4
NVIDIA Driver	550.90.07
Docker	27.1.1
Ollama	0.3.12
vLLM	0.5.4
llama.cpp	b3324 (built from source)
Python	3.12.3
Test Harness	Custom (openai 1.30.1)

Workload Specification

Model: Meta Llama 3 8B Instruct (Q4_K_M quantization)

I chose Q4_K_M because it’s the sweet spot for production deployments: reasonable quality with manageable resource requirements. The GGUF file weighed in at 4.58 GB, while the safetensors version (for vLLM) was 15.2 GB.

Test Parameters:

Parameter	Value
Average Prompt Length	128 tokens
Output Length	256 tokens (fixed)
Input Variance	±30% (90-166 tokens)
Request Pattern	Poisson distribution (λ=target concurrency)
Warm-up Requests	20 (discarded from results)
Test Duration	5 minutes per configuration
Iterations	3 (median reported)
Temperature	0.7 (deterministic for comparison)

Metrics Definitions

Before diving into results, let’s define what we’re measuring:

Metric	Definition	Measurement Method
TTFT (Time to First Token)	Duration from request to first output token	Client-side timestamp difference
TPOT (Time Per Output Token)	Average duration between consecutive output tokens	(Total time - TTFT) / (Token count - 1)
Throughput	Total tokens generated per second	Output tokens / generation time
VRAM Usage	Peak GPU memory allocated	`nvidia-smi --query-gpu=memory.used`
Request Success Rate	Percentage of requests completing without error	Client-side tracking

Test Harness: Reproducible Benchmarking

I built a Python test harness using the OpenAI client library, which works with all three engines thanks to Ollama and llama.cpp’s OpenAI-compatible APIs.

#!/usr/bin/env python3
"""
LLM Inference Benchmark Harness
Reproducible testing for Ollama, vLLM, and llama.cpp

Usage:
    python benchmark_harness.py --engine ollama --concurrency 8 --requests 100
"""

import argparse
import time
import statistics
import json
from openai import OpenAI
from concurrent.futures import ThreadPoolExecutor, as_completed
import psutil
import subprocess

class LLMBenchmark:
    def __init__(self, engine, base_url, model):
        self.engine = engine
        self.client = OpenAI(
            base_url=base_url,
            api_key="dummy"  # Not needed for local engines
        )
        self.model = model

    def get_vram_usage(self):
        """Query nvidia-smi for current VRAM usage in MB"""
        try:
            result = subprocess.run([
                "nvidia-smi",
                "--query-gpu=memory.used",
                "--format=csv,noheader,nounits"
            ], capture_output=True, text=True)
            return int(result.stdout.strip().split()[0])
        except:
            return 0

    def single_request(self, prompt, max_tokens=256):
        """Execute a single request and return timing metrics"""
        start_time = time.perf_counter()
        first_token_time = None
        tokens_received = 0

        try:
            response = self.client.chat.completions.create(
                model=self.model,
                messages=[{"role": "user", "content": prompt}],
                max_tokens=max_tokens,
                stream=True,
                temperature=0.7
            )

            full_response = ""
            for chunk in response:
                if chunk.choices[0].delta.content:
                    if first_token_time is None:
                        first_token_time = time.perf_counter()
                    tokens_received += 1
                    full_response += chunk.choices[0].delta.content

            end_time = time.perf_counter()

            ttft = (first_token_time - start_time) * 1000  # ms
            total_time = (end_time - start_time) * 1000  # ms
            tpot = (total_time - ttft) / max(tokens_received - 1, 1)
            throughput = tokens_received / (total_time / 1000)

            return {
                "ttft_ms": ttft,
                "tpot_ms": tpot,
                "total_time_ms": total_time,
                "tokens": tokens_received,
                "throughput_tps": throughput,
                "success": True
            }

        except Exception as e:
            return {
                "success": False,
                "error": str(e)
            }

    def run_concurrent_benchmark(self, prompts, concurrency, num_requests):
        """Run concurrent requests with specified parallelism"""
        results = []
        vram_start = self.get_vram_usage()

        with ThreadPoolExecutor(max_workers=concurrency) as executor:
            futures = []
            for i in range(num_requests):
                prompt = prompts[i % len(prompts)]
                futures.append(executor.submit(self.single_request, prompt))

            for future in as_completed(futures):
                results.append(future.result())

        vram_peak = self.get_vram_usage()

        # Calculate aggregate metrics
        successful = [r for r in results if r["success"]]
        failed = [r for r in results if not r["success"]]

        if not successful:
            return {"error": "All requests failed", "results": results}

        return {
            "total_requests": num_requests,
            "successful": len(successful),
            "failed": len(failed),
            "success_rate": len(successful) / num_requests * 100,
            "ttft_ms": statistics.median([r["ttft_ms"] for r in successful]),
            "tpot_ms": statistics.median([r["tpot_ms"] for r in successful]),
            "throughput_tps": statistics.mean([r["throughput_tps"] for r in successful]),
            "tokens_per_request": statistics.mean([r["tokens"] for r in successful]),
            "vram_start_mb": vram_start,
            "vram_peak_mb": vram_peak
        }

def load_prompts(filepath="prompts.json"):
    """Load diverse prompts for testing"""
    with open(filepath, 'r') as f:
        return json.load(f)

def main():
    parser = argparse.ArgumentParser(description="LLM Inference Benchmark")
    parser.add_argument("--engine", choices=["ollama", "vllm", "llama-cpp"], required=True)
    parser.add_argument("--concurrency", type=int, default=1)
    parser.add_argument("--requests", type=int, default=50)
    parser.add_argument("--output", default="results.json")

    args = parser.parse_args()

    # Engine configurations
    configs = {
        "ollama": {"url": "http://localhost:11434/v1", "model": "llama3:8b"},
        "vllm": {"url": "http://localhost:8000/v1", "model": "meta-llama/Meta-Llama-3-8B-Instruct"},
        "llama-cpp": {"url": "http://localhost:8080/v1", "model": "llama3-8b-q4_k_m"}
    }

    config = configs[args.engine]
    benchmark = LLMBenchmark(args.engine, config["url"], config["model"])

    print(f"Running benchmark: {args.engine} | Concurrency: {args.concurrency} | Requests: {args.requests}")

    prompts = load_prompts()
    results = benchmark.run_concurrent_benchmark(prompts, args.concurrency, args.requests)

    # Save results
    with open(args.output, 'w') as f:
        json.dump(results, f, indent=2)

    print(f"\nResults saved to {args.output}")
    print(f"Throughput: {results.get('throughput_tps', 0):.1f} tokens/sec")
    print(f"TTFT: {results.get('ttft_ms', 0):.1f} ms")
    print(f"Success Rate: {results.get('success_rate', 0):.1f}%")

if __name__ == "__main__":
    main()

Controlled Variables

To ensure fair comparison, I held these variables constant:

Model: Llama 3 8B Instruct (same weights, quantized appropriately for each engine)
Output length: Fixed at 256 tokens per request
Temperature: 0.7 across all tests
GPU: Single RTX 4090 (no multi-GPU testing)
System load: No other GPU workloads during tests

Independent Variables

I varied these parameters to understand scaling behavior:

Concurrency: 1, 2, 4, 8, 16, 32 requests
Batch size: Default vs. tuned (vLLM only)
Quantization: Q4_K_M (all engines)
GPU layers: Full GPU offload (llama.cpp)

FAQ

How is Time to First Token (TTFT) measured differently across engines?

TTFT measures the delay between sending a request and receiving the first output token. Ollama’s server mode returns the first token in ~89ms on my RTX 4090, while vLLM takes ~156ms due to its scheduling overhead. The difference matters for real-time chat applications but becomes irrelevant under concurrent load.

Why did I choose Q4_K_M over other quantization levels?

Q4_K_M gives the best quality-to-performance ratio for Llama 3 8B. At 4.58 GB for the GGUF file, it fits comfortably in 24 GB VRAM with room for KV cache. Q8_0 would use 8.5 GB and deliver marginally better quality, while Q2_K at 3.3 GB has noticeable quality degradation that I wouldn’t put in production.

How do I adapt this benchmark for models larger than 8B parameters?

The same test harness works with any OpenAI-compatible model. For 70B models, you’ll need multi-GPU setups. vLLM supports tensor parallelism across GPUs with the --tensor-parallel-size flag. I’ve tested Mixtral 8x7B with this harness using 2x A100s. The methodology is identical.

Can I run these benchmarks without an NVIDIA GPU?

Yes, but you’re limited to llama.cpp for CPU-only inference. At 18.7 tokens/second on 16 CPU threads, it’s usable for offline processing but not real-time applications. AMD ROCm support exists for vLLM, but I haven’t tested it.

What is the minimum VRAM required for each engine to run Llama 3 8B?

Ollama needs 5.8 GB with Q4_K_M, llama.cpp GPU mode uses 6.1 GB, and vLLM requires 9.2 GB minimum (including PagedAttention’s KV cache overhead). For vLLM, lowering gpu_memory_utilization below 0.85 can reduce VRAM at the cost of throughput.

Why does my benchmark produce different numbers than yours?

Hardware variation, CUDA driver versions, and system load all affect results. The RTX 4090’s Ada Lovelace architecture gives specific performance characteristics that don’t translate directly to A100s or H100s. Run the test harness on your own hardware for numbers that match your deployment.

Next in the series: Part 2: Results →