Local LLM Benchmark Results and Analysis Guide

2026.05.19
Technology
1015 Words
Local LLM Benchmark Results and Analysis Guide

Part 2 of 4. Part 1: Methodology · Part 3: When to Use Each Engine · Part 4: FAQ and Next Steps

Standardized benchmarks don’t tell the whole story. Each engine has a different performance profile depending on concurrency, quantization, and whether you care about latency or throughput. Here’s exactly what I measured on my RTX 4090 test rig.

Results

Ollama Performance

Ollama impressed me with its simplicity, but the performance story is more complex than it looks. Using default settings with ollama serve, I measured:

Single-Request Latency (Concurrency = 1):

ConfigurationTTFT (ms)TPOT (ms)Throughput (t/s)VRAM (GB)
Default (Q4_K_M)8914.668.45.8
With num_ctx=5129215.166.25.9
With num_ctx=20488814.867.56.2

Ollama’s TTFT is excellent for interactive use. At under 100ms, users perceive the response as instantaneous. However, the throughput plateaus quickly.

Concurrent Request Scaling:

ConcurrencyThroughput (t/s)P50 Latency (ms)P99 Latency (ms)Success Rate
168.4892915100%
271.217851842100%
473.835213684100%
874.17012745398%
1672.3142011589294%
3268.9291233524187%

The key insight: Ollama doesn’t benefit from concurrency. Throughput barely budges as you add requests, and the failure rate climbs past 16 concurrent requests. I’ve seen this bottleneck in production: Ollama processes requests serially within its default configuration.

vLLM Performance

vLLM’s PagedAttention and continuous batching delivered what I expected: dramatically better scaling.

Single-Request Latency:

ConfigurationTTFT (ms)TPOT (ms)Throughput (t/s)VRAM (GB)
Default (tensor-parallel=1)1567.8128.29.2
With max_num_seqs=321624.6217.610.8
With gpu_memory_util=0.951584.5221.312.1

Notice the TTFT penalty compared to Ollama. vLLM’s scheduling introduces overhead. But look at that throughput jump with batching enabled!

Concurrent Request Scaling:

ConcurrencyThroughput (t/s)P50 Latency (ms)P99 Latency (ms)VRAM (GB)Success Rate
1128.297610029.2100%
2198.4128113429.3100%
4287.6142315219.5100%
8412.31589172310.1100%
16523.81847203410.8100%
32587.22134245612.1100%
64612.42641312814.399%

vLLM shines at scale. At 32 concurrent requests, it delivers 587 tokens/second, 8.5x better than Ollama at the same concurrency. The continuous batching works exactly as advertised.

llama.cpp Performance

llama.cpp offers the most flexibility, so I tested both CPU-only and GPU-accelerated modes.

GPU Mode (cuBLAS with n_gpu_layers=-1):

ConfigurationTTFT (ms)TPOT (ms)Throughput (t/s)VRAM (GB)
Default (n_gpu_layers=99)1129.2108.76.1
Optimized (n_batch=512)1087.0142.36.1
Server mode (—port 8080)1159.5105.26.0

CPU-Only Mode (no GPU layers):

ThreadsTTFT (ms)Throughput (t/s)RAM (GB)
8120512.46.2
1661218.76.2
3239818.56.3
6435617.26.5

Key takeaway: llama.cpp with GPU acceleration is competitive with Ollama, but the server requires manual tuning. The CPU fallback works surprisingly well for a 8B model: 18.7 tokens/second is usable for offline processing.

Concurrent Request Scaling (GPU Mode):

ConcurrencyThroughput (t/s)P50 Latency (ms)Success Rate
1142.31021100%
4198.75123100%
8201.21023498%
16198.42051295%

llama.cpp’s server mode doesn’t implement continuous batching, so throughput saturates around 200 tokens/second regardless of concurrency.

Benchmark Matrix: Full Comparison

Here’s the complete benchmark matrix across all tested configurations:

EngineConcurrencyTTFT (ms)TPOT (ms)Throughput (t/s)VRAM (GB)Batch Impact
Ollama18914.668.45.8None
Ollama89114.274.15.9None
Ollama329515.168.96.2None
vLLM11567.8128.29.2Default
vLLM81624.6412.310.1Optimal
vLLM321584.5587.212.1Optimal
vLLM322988.2321.49.3No batching
llama.cpp GPU11087.0142.36.1n_batch=512
llama.cpp GPU81127.3201.26.1n_batch=512
llama.cpp CPU1120580.612.408 threads
llama.cpp CPU1661253.518.7016 threads

FAQ

Why does Ollama’s throughput not improve with concurrent requests?

Ollama processes requests serially by default. Each request blocks the next until it finishes. I confirmed this by checking the process-level thread count during benchmarks. Ollama doesn’t parallelize inference within a single model instance. The vLLM team at UC Berkeley Sky Computing Lab designed PagedAttention specifically to solve this problem.

How much VRAM does vLLM’s PagedAttention overhead actually cost?

On my RTX 4090, vLLM used 9.2 GB at rest versus Ollama’s 5.8 GB. That 3.4 GB delta comes from the KV cache pool pre-allocation. The trade-off: you pay 3-4 GB in VRAM for the ability to batch up to 32 concurrent requests. For A100s with 80 GB, this overhead is negligible. For consumer GPUs, it’s a real constraint.

What happens when I disable continuous batching in vLLM?

With batching disabled (max_num_seqs=1), vLLM drops to 321.4 tokens/second at concurrency 32, roughly half its batched performance. TTFT also climbs to 298ms. Continuous batching is vLLM’s killer feature. Turn it off and you lose the main reason to choose vLLM over Ollama.

What is the best configuration for CPU-only inference with llama.cpp?

For Llama 3 8B on CPU, use 16 threads with n_batch=512. Beyond 16 threads, memory bandwidth becomes the bottleneck and throughput stops improving. My test showed 18.7 tokens/second at 16 threads: usable for batch processing but too slow for interactive applications.

How do these benchmarks translate to larger models like Llama 3 70B?

The relative ordering stays the same, but VRAM requirements multiply. A Llama 3 70B at Q4_K_M needs roughly 40 GB. With vLLM’s tensor parallelism, you’d split this across 2x A100s. I’ve tested 70B models and found vLLM’s throughput advantage grows wider at larger parameter counts due to better memory management.

How long did the full benchmark suite take to run?

Each engine required 5 minutes per concurrency level. With 6 concurrency levels, 3 iterations each, plus warmup and data export, the full suite ran about 90 minutes. I automated this with a shell wrapper that cycled through engine configs overnight.


Parts in this series: ← Part 1 | Part 3 →

# local-llm # benchmark # performance # Ollama # Vllm # llama-cpp # inference-speed