Local LLM Benchmark Results and Analysis Guide
Table of Contents
Part 2 of 4. Part 1: Methodology · Part 3: When to Use Each Engine · Part 4: FAQ and Next Steps
Standardized benchmarks donât tell the whole story. Each engine has a different performance profile depending on concurrency, quantization, and whether you care about latency or throughput. Hereâs exactly what I measured on my RTX 4090 test rig.
Results
Ollama Performance
Ollama impressed me with its simplicity, but the performance story is more complex than it looks. Using default settings with ollama serve, I measured:
Single-Request Latency (Concurrency = 1):
| Configuration | TTFT (ms) | TPOT (ms) | Throughput (t/s) | VRAM (GB) |
|---|---|---|---|---|
| Default (Q4_K_M) | 89 | 14.6 | 68.4 | 5.8 |
| With num_ctx=512 | 92 | 15.1 | 66.2 | 5.9 |
| With num_ctx=2048 | 88 | 14.8 | 67.5 | 6.2 |
Ollamaâs TTFT is excellent for interactive use. At under 100ms, users perceive the response as instantaneous. However, the throughput plateaus quickly.
Concurrent Request Scaling:
| Concurrency | Throughput (t/s) | P50 Latency (ms) | P99 Latency (ms) | Success Rate |
|---|---|---|---|---|
| 1 | 68.4 | 892 | 915 | 100% |
| 2 | 71.2 | 1785 | 1842 | 100% |
| 4 | 73.8 | 3521 | 3684 | 100% |
| 8 | 74.1 | 7012 | 7453 | 98% |
| 16 | 72.3 | 14201 | 15892 | 94% |
| 32 | 68.9 | 29123 | 35241 | 87% |
The key insight: Ollama doesnât benefit from concurrency. Throughput barely budges as you add requests, and the failure rate climbs past 16 concurrent requests. Iâve seen this bottleneck in production: Ollama processes requests serially within its default configuration.
vLLM Performance
vLLMâs PagedAttention and continuous batching delivered what I expected: dramatically better scaling.
Single-Request Latency:
| Configuration | TTFT (ms) | TPOT (ms) | Throughput (t/s) | VRAM (GB) |
|---|---|---|---|---|
| Default (tensor-parallel=1) | 156 | 7.8 | 128.2 | 9.2 |
| With max_num_seqs=32 | 162 | 4.6 | 217.6 | 10.8 |
| With gpu_memory_util=0.95 | 158 | 4.5 | 221.3 | 12.1 |
Notice the TTFT penalty compared to Ollama. vLLMâs scheduling introduces overhead. But look at that throughput jump with batching enabled!
Concurrent Request Scaling:
| Concurrency | Throughput (t/s) | P50 Latency (ms) | P99 Latency (ms) | VRAM (GB) | Success Rate |
|---|---|---|---|---|---|
| 1 | 128.2 | 976 | 1002 | 9.2 | 100% |
| 2 | 198.4 | 1281 | 1342 | 9.3 | 100% |
| 4 | 287.6 | 1423 | 1521 | 9.5 | 100% |
| 8 | 412.3 | 1589 | 1723 | 10.1 | 100% |
| 16 | 523.8 | 1847 | 2034 | 10.8 | 100% |
| 32 | 587.2 | 2134 | 2456 | 12.1 | 100% |
| 64 | 612.4 | 2641 | 3128 | 14.3 | 99% |
vLLM shines at scale. At 32 concurrent requests, it delivers 587 tokens/second, 8.5x better than Ollama at the same concurrency. The continuous batching works exactly as advertised.
llama.cpp Performance
llama.cpp offers the most flexibility, so I tested both CPU-only and GPU-accelerated modes.
GPU Mode (cuBLAS with n_gpu_layers=-1):
| Configuration | TTFT (ms) | TPOT (ms) | Throughput (t/s) | VRAM (GB) |
|---|---|---|---|---|
| Default (n_gpu_layers=99) | 112 | 9.2 | 108.7 | 6.1 |
| Optimized (n_batch=512) | 108 | 7.0 | 142.3 | 6.1 |
| Server mode (âport 8080) | 115 | 9.5 | 105.2 | 6.0 |
CPU-Only Mode (no GPU layers):
| Threads | TTFT (ms) | Throughput (t/s) | RAM (GB) |
|---|---|---|---|
| 8 | 1205 | 12.4 | 6.2 |
| 16 | 612 | 18.7 | 6.2 |
| 32 | 398 | 18.5 | 6.3 |
| 64 | 356 | 17.2 | 6.5 |
Key takeaway: llama.cpp with GPU acceleration is competitive with Ollama, but the server requires manual tuning. The CPU fallback works surprisingly well for a 8B model: 18.7 tokens/second is usable for offline processing.
Concurrent Request Scaling (GPU Mode):
| Concurrency | Throughput (t/s) | P50 Latency (ms) | Success Rate |
|---|---|---|---|
| 1 | 142.3 | 1021 | 100% |
| 4 | 198.7 | 5123 | 100% |
| 8 | 201.2 | 10234 | 98% |
| 16 | 198.4 | 20512 | 95% |
llama.cppâs server mode doesnât implement continuous batching, so throughput saturates around 200 tokens/second regardless of concurrency.
Benchmark Matrix: Full Comparison
Hereâs the complete benchmark matrix across all tested configurations:
| Engine | Concurrency | TTFT (ms) | TPOT (ms) | Throughput (t/s) | VRAM (GB) | Batch Impact |
|---|---|---|---|---|---|---|
| Ollama | 1 | 89 | 14.6 | 68.4 | 5.8 | None |
| Ollama | 8 | 91 | 14.2 | 74.1 | 5.9 | None |
| Ollama | 32 | 95 | 15.1 | 68.9 | 6.2 | None |
| vLLM | 1 | 156 | 7.8 | 128.2 | 9.2 | Default |
| vLLM | 8 | 162 | 4.6 | 412.3 | 10.1 | Optimal |
| vLLM | 32 | 158 | 4.5 | 587.2 | 12.1 | Optimal |
| vLLM | 32 | 298 | 8.2 | 321.4 | 9.3 | No batching |
| llama.cpp GPU | 1 | 108 | 7.0 | 142.3 | 6.1 | n_batch=512 |
| llama.cpp GPU | 8 | 112 | 7.3 | 201.2 | 6.1 | n_batch=512 |
| llama.cpp CPU | 1 | 1205 | 80.6 | 12.4 | 0 | 8 threads |
| llama.cpp CPU | 16 | 612 | 53.5 | 18.7 | 0 | 16 threads |
FAQ
Why does Ollamaâs throughput not improve with concurrent requests?
Ollama processes requests serially by default. Each request blocks the next until it finishes. I confirmed this by checking the process-level thread count during benchmarks. Ollama doesnât parallelize inference within a single model instance. The vLLM team at UC Berkeley Sky Computing Lab designed PagedAttention specifically to solve this problem.
How much VRAM does vLLMâs PagedAttention overhead actually cost?
On my RTX 4090, vLLM used 9.2 GB at rest versus Ollamaâs 5.8 GB. That 3.4 GB delta comes from the KV cache pool pre-allocation. The trade-off: you pay 3-4 GB in VRAM for the ability to batch up to 32 concurrent requests. For A100s with 80 GB, this overhead is negligible. For consumer GPUs, itâs a real constraint.
What happens when I disable continuous batching in vLLM?
With batching disabled (max_num_seqs=1), vLLM drops to 321.4 tokens/second at concurrency 32, roughly half its batched performance. TTFT also climbs to 298ms. Continuous batching is vLLMâs killer feature. Turn it off and you lose the main reason to choose vLLM over Ollama.
What is the best configuration for CPU-only inference with llama.cpp?
For Llama 3 8B on CPU, use 16 threads with n_batch=512. Beyond 16 threads, memory bandwidth becomes the bottleneck and throughput stops improving. My test showed 18.7 tokens/second at 16 threads: usable for batch processing but too slow for interactive applications.
How do these benchmarks translate to larger models like Llama 3 70B?
The relative ordering stays the same, but VRAM requirements multiply. A Llama 3 70B at Q4_K_M needs roughly 40 GB. With vLLMâs tensor parallelism, youâd split this across 2x A100s. Iâve tested 70B models and found vLLMâs throughput advantage grows wider at larger parameter counts due to better memory management.
How long did the full benchmark suite take to run?
Each engine required 5 minutes per concurrency level. With 6 concurrency levels, 3 iterations each, plus warmup and data export, the full suite ran about 90 minutes. I automated this with a shell wrapper that cycled through engine configs overnight.
Parts in this series: â Part 1 | Part 3 â