Local LLM Benchmark FAQ and Next Steps Guide

Part 4 of 4. Part 1: Methodology · Part 2: Results · Part 3: When to Use Each Engine

This series covered a lot of data. Here are the most common questions I get about running these benchmarks and deploying the engines in production.

FAQ

What quantization level should I use for production?

Q4_K_M offers the best balance of quality and performance for most use cases. If you need maximum quality and have VRAM to spare, consider Q5_K_M or Q8_0. Avoid Q2_K, as the quality degradation is noticeable.

Why does vLLM use more VRAM than Ollama?

vLLM’s PagedAttention maintains a KV-cache pool for continuous batching. This overhead enables the superior throughput scaling but costs ~3-4 GB of additional VRAM.

Can I run these benchmarks on consumer GPUs with less VRAM?

Yes, but you’ll need to use smaller models or more aggressive quantization. A RTX 3060 (12GB) can run Llama 3 8B with Q4_K_M, but batch sizes will be limited.

How does Ollama’s performance compare to the official Ollama benchmarks?

My results align with community reports. Ollama prioritizes simplicity over peak performance. If you need maximum throughput, vLLM is the better choice.

Is llama.cpp still relevant with Ollama available?

Absolutely. llama.cpp’s CPU mode is unmatched, and the C++ core allows embedding in resource-constrained environments where a separate server process isn’t feasible.

What’s the best engine for Kubernetes deployments?

For Kubernetes, I recommend vLLM for production APIs and Ollama for development. I’ve covered the Ollama vs vLLM decision in detail, including Kubernetes manifests.

How often should I re-run these benchmarks?

I suggest quarterly benchmarks or when any component updates (CUDA, drivers, engine version). vLLM’s rapid development means performance improvements arrive frequently.

Conclusion

After running Llama 3 8B across all three engines, the path forward is clear: match the engine to your use case, not the other way around.

For development and interactive use, Ollama’s simplicity and low latency win. For production APIs serving multiple users, vLLM’s continuous batching delivers unmatched throughput. And for CPU-only or embedded scenarios, llama.cpp remains the gold standard.

I’ve deployed all three in production at different times. Currently, my production API uses vLLM behind a Kubernetes ingress, while my development cluster runs Ollama for rapid prototyping. The llama.cpp engine sits ready for edge deployments where GPU resources aren’t guaranteed.

Next Steps:

Read the Ollama vs vLLM comparison for a deeper architectural analysis
Check out the Cost Analysis: Self-Hosted AI vs. OpenAI API to understand the financial implications
Deploy your own benchmark harness and share your results with the community

The local LLM field moves fast. These numbers represent a snapshot in time. The methodology I’ve shared lets you track performance as the tools evolve.

Eduardo is a AI & DevOps engineer who has deployed AI infrastructure across bare-metal and cloud Kubernetes clusters. He believes in reproducible benchmarks and honest performance reporting.

Parts in this series: ← Part 3