Local LLM Benchmark Results and Analysis Guide

Part 2 of 4. Part 1: Methodology · Part 3: When to Use Each Engine · Part 4: FAQ and Next Steps

Standardized benchmarks don’t tell the whole story. Each engine has a different performance profile depending on concurrency, quantization, and whether you care about latency or throughput. Here’s exactly what I measured on my RTX 4090 test rig.

Results

Ollama Performance

Ollama impressed me with its simplicity, but the performance story is more complex than it looks. Using default settings with ollama serve, I measured:

Single-Request Latency (Concurrency = 1):

Configuration	TTFT (ms)	TPOT (ms)	Throughput (t/s)	VRAM (GB)
Default (Q4_K_M)	89	14.6	68.4	5.8
With num_ctx=512	92	15.1	66.2	5.9
With num_ctx=2048	88	14.8	67.5	6.2

Ollama’s TTFT is excellent for interactive use. At under 100ms, users perceive the response as instantaneous. However, the throughput plateaus quickly.

Concurrent Request Scaling:

Concurrency	Throughput (t/s)	P50 Latency (ms)	P99 Latency (ms)	Success Rate
1	68.4	892	915	100%
2	71.2	1785	1842	100%
4	73.8	3521	3684	100%
8	74.1	7012	7453	98%
16	72.3	14201	15892	94%
32	68.9	29123	35241	87%

The key insight: Ollama doesn’t benefit from concurrency. Throughput barely budges as you add requests, and the failure rate climbs past 16 concurrent requests. I’ve seen this bottleneck in production: Ollama processes requests serially within its default configuration.

vLLM Performance

vLLM’s PagedAttention and continuous batching delivered what I expected: dramatically better scaling.

Single-Request Latency:

Configuration	TTFT (ms)	TPOT (ms)	Throughput (t/s)	VRAM (GB)
Default (tensor-parallel=1)	156	7.8	128.2	9.2
With max_num_seqs=32	162	4.6	217.6	10.8
With gpu_memory_util=0.95	158	4.5	221.3	12.1

Notice the TTFT penalty compared to Ollama. vLLM’s scheduling introduces overhead. But look at that throughput jump with batching enabled!

Concurrent Request Scaling:

Concurrency	Throughput (t/s)	P50 Latency (ms)	P99 Latency (ms)	VRAM (GB)	Success Rate
1	128.2	976	1002	9.2	100%
2	198.4	1281	1342	9.3	100%
4	287.6	1423	1521	9.5	100%
8	412.3	1589	1723	10.1	100%
16	523.8	1847	2034	10.8	100%
32	587.2	2134	2456	12.1	100%
64	612.4	2641	3128	14.3	99%

vLLM shines at scale. At 32 concurrent requests, it delivers 587 tokens/second, 8.5x better than Ollama at the same concurrency. The continuous batching works exactly as advertised.

llama.cpp Performance

llama.cpp offers the most flexibility, so I tested both CPU-only and GPU-accelerated modes.

GPU Mode (cuBLAS with n_gpu_layers=-1):

Configuration	TTFT (ms)	TPOT (ms)	Throughput (t/s)	VRAM (GB)
Default (n_gpu_layers=99)	112	9.2	108.7	6.1
Optimized (n_batch=512)	108	7.0	142.3	6.1
Server mode (—port 8080)	115	9.5	105.2	6.0

CPU-Only Mode (no GPU layers):

Threads	TTFT (ms)	Throughput (t/s)	RAM (GB)
8	1205	12.4	6.2
16	612	18.7	6.2
32	398	18.5	6.3
64	356	17.2	6.5

Key takeaway: llama.cpp with GPU acceleration is competitive with Ollama, but the server requires manual tuning. The CPU fallback works surprisingly well for a 8B model: 18.7 tokens/second is usable for offline processing.

Concurrent Request Scaling (GPU Mode):

Concurrency	Throughput (t/s)	P50 Latency (ms)	Success Rate
1	142.3	1021	100%
4	198.7	5123	100%
8	201.2	10234	98%
16	198.4	20512	95%

llama.cpp’s server mode doesn’t implement continuous batching, so throughput saturates around 200 tokens/second regardless of concurrency.

Benchmark Matrix: Full Comparison

Here’s the complete benchmark matrix across all tested configurations:

Engine	Concurrency	TTFT (ms)	TPOT (ms)	Throughput (t/s)	VRAM (GB)	Batch Impact
Ollama	1	89	14.6	68.4	5.8	None
Ollama	8	91	14.2	74.1	5.9	None
Ollama	32	95	15.1	68.9	6.2	None
vLLM	1	156	7.8	128.2	9.2	Default
vLLM	8	162	4.6	412.3	10.1	Optimal
vLLM	32	158	4.5	587.2	12.1	Optimal
vLLM	32	298	8.2	321.4	9.3	No batching
llama.cpp GPU	1	108	7.0	142.3	6.1	n_batch=512
llama.cpp GPU	8	112	7.3	201.2	6.1	n_batch=512
llama.cpp CPU	1	1205	80.6	12.4	0	8 threads
llama.cpp CPU	16	612	53.5	18.7	0	16 threads

FAQ

Why does Ollama’s throughput not improve with concurrent requests?

Ollama processes requests serially by default. Each request blocks the next until it finishes. I confirmed this by checking the process-level thread count during benchmarks. Ollama doesn’t parallelize inference within a single model instance. The vLLM team at UC Berkeley Sky Computing Lab designed PagedAttention specifically to solve this problem.

How much VRAM does vLLM’s PagedAttention overhead actually cost?

On my RTX 4090, vLLM used 9.2 GB at rest versus Ollama’s 5.8 GB. That 3.4 GB delta comes from the KV cache pool pre-allocation. The trade-off: you pay 3-4 GB in VRAM for the ability to batch up to 32 concurrent requests. For A100s with 80 GB, this overhead is negligible. For consumer GPUs, it’s a real constraint.

What happens when I disable continuous batching in vLLM?

With batching disabled (max_num_seqs=1), vLLM drops to 321.4 tokens/second at concurrency 32, roughly half its batched performance. TTFT also climbs to 298ms. Continuous batching is vLLM’s killer feature. Turn it off and you lose the main reason to choose vLLM over Ollama.

What is the best configuration for CPU-only inference with llama.cpp?

For Llama 3 8B on CPU, use 16 threads with n_batch=512. Beyond 16 threads, memory bandwidth becomes the bottleneck and throughput stops improving. My test showed 18.7 tokens/second at 16 threads: usable for batch processing but too slow for interactive applications.

How do these benchmarks translate to larger models like Llama 3 70B?

The relative ordering stays the same, but VRAM requirements multiply. A Llama 3 70B at Q4_K_M needs roughly 40 GB. With vLLM’s tensor parallelism, you’d split this across 2x A100s. I’ve tested 70B models and found vLLM’s throughput advantage grows wider at larger parameter counts due to better memory management.

How long did the full benchmark suite take to run?

Each engine required 5 minutes per concurrency level. With 6 concurrency levels, 3 iterations each, plus warmup and data export, the full suite ran about 90 minutes. I automated this with a shell wrapper that cycled through engine configs overnight.

Parts in this series: ← Part 1 | Part 3 →