Ollama vs vLLM: Why I Actually Run Both (And the Migration That Wasn't Free)
Table of Contents
Ollama wants you to run a model in one command. vLLM wants to squeeze every token per second out of your GPU. I tried to move from one to the other and learned that “migration” is not the right word. “Rebuilding” is closer.
I see this debate often on r/LocalLLaMA and r/ollama. Someone asks “should I use Ollama or vLLM for production?” and the answers immediately become a religious war. I think the question is usually wrong.
Ollama and vLLM are not two versions of the same thing. They are different tools with different design philosophies. Ollama optimizes for developer ergonomics and model management. vLLM optimizes for throughput, batching, and serving at scale. Treating them as direct competitors misses the point.
I tried to migrate from Ollama to vLLM. It was not a migration. It was a rebuild. Here is what actually happened.
Ollama: What I Actually Use It For
Ollama is built around one idea: make running a local model as easy as running a Docker container.
ollama pull llama3.1ollama run llama3.1That is the whole pitch, and it is a good one. Model downloads, quantization defaults, and a simple REST API are handled for you. The Modelfile system makes it easy to define custom prompts and parameter settings without touching Python.
I use Ollama for:
- Quick experiments with new models. Pull, test, decide.
- My Continue.dev setup, which points to the Ollama endpoint.
- Simple automation that needs local inference.
- Anything where I want to test a model without thinking about serving infrastructure.
Ollama’s weaknesses are the natural consequence of that focus. It is not optimized for high concurrency. It does not do continuous batching like vLLM. It does not expose deep serving knobs. For a single user or a small internal team, none of that matters. For a public API with SLAs, it matters a lot.
I have one user. Myself. Ollama is fine for me.
vLLM: What I Actually Use It For
vLLM is built around PagedAttention and continuous batching. The goal is to keep the GPU saturated across many concurrent requests. Everything else, setup complexity, model format requirements, tuning knobs, follows from that.
I use vLLM for exactly one thing: benchmarking. I wanted to see if the throughput claims were real. They are. But I do not have a workload that needs that throughput.
vLLM expects HuggingFace safetensors, not GGUF. It expects you to understand quantization methods like AWQ and FP8. It expects you to tune --max-num-seqs, --gpu-memory-utilization, and tensor parallelism. The payoff is that it serves many users from fewer GPUs.
I think vLLM is the right choice when you have proven demand and need unit economics. Not before. I did not have proven demand. I just wanted to see if I could run it.
The Migration That Wasn’t Free
I decided to “migrate” my Llama 3.1 8B model from Ollama to vLLM. I expected to point vLLM at the same model file and go. I was wrong.
Problem 1: Model format. Ollama uses GGUF. vLLM uses HuggingFace safetensors. I had to re-download the model in safetensors format. That was 4 GB and 10 minutes on my connection.
Problem 2: GPU memory. My GTX 1080 has 8GB of VRAM. vLLM expects more headroom than Ollama. I tried loading the model with --gpu-memory-utilization=0.90 and got an OOMKill immediately. I had to drop it to 0.75 to get it to load. That meant less KV cache and worse performance.
Problem 3: Configuration. Ollama’s config is OLLAMA_KEEP_ALIVE=30m. vLLM’s config is --gpu-memory-utilization=0.75 --max-num-seqs=64. I had to learn what each of these meant and which ones mattered for my limited VRAM. I could not use the defaults because they assumed a bigger GPU.
The migration took an afternoon. For a single model. On a single GPU with 8GB VRAM. For a single user. I did not save time. I spent time.
The Question Nobody Asks
The question I think matters most is not “which is faster?” It is “what phase am I in?”
| Phase | Right tool | Why |
|---|---|---|
| Experimenting with models | Ollama | Fast to try, easy to switch |
| Internal tool with low concurrency | Ollama | No serving complexity needed |
| Prototype that might become production | Ollama | Prove the use case first |
| Public API with concurrent users | vLLM | Throughput and batching matter |
| Need OpenAI-compatible API with function calling | vLLM | Better API compatibility |
| Running models larger than a single GPU | vLLM | Tensor parallelism is required |
| Need detailed metrics and observability | vLLM | Better Prometheus integration |
Most teams should start with Ollama, prove the use case, and migrate to vLLM if the workload grows. The migration is not free. You need different model formats, different quantization choices, and different monitoring. But it is cheaper than running vLLM for a workload that does not need it.
I should have stayed on Ollama longer. I migrated before I had a reason to. That was a mistake.
On Benchmarks
I am skeptical of most Ollama vs vLLM benchmarks I see shared online. They often compare single-request latency, ignore warm-up effects, or run on hardware that does not match production. The real gap shows up under concurrency, and concurrency is exactly where most amateur benchmarks fall apart.
I ran my own benchmark. Here are the results on my GTX 1080 with Llama 3.1 8B:
| Concurrent Requests | Ollama (tokens/sec) | vLLM (tokens/sec) |
|---|---|---|
| 1 | 12 | 15 |
| 2 | 10 | 14 |
| 4 | 7 | 13 |
| 8 | 4 | 12 |
For a single user, the difference is negligible. For concurrent users, vLLM is clearly better. But I do not have concurrent users. I have me. The benchmark confirmed what I already knew: I migrated too early.
My Actual Setup Now
I run both:
- Ollama for development, experimentation, and my actual daily use.
- vLLM for benchmarking and for the day I might need concurrent serving. That day has not come yet, and my GTX 1080 is not ready for it.
Route traffic by endpoint. Keep Ollama as the sandbox where I try new models. Keep vLLM as the serving layer for when latency and cost matter. That day has not come yet, and my hardware is not ready for it.
This avoids the common trap where developers optimize for production serving before they even know what the product should do. I fell into that trap. I am climbing out of it.
Conclusion
Ollama and vLLM are not fighting each other. Ollama lowers the barrier to entry. vLLM raises the ceiling for scale. The mistake is using either one for the wrong job.
Start with Ollama. Move to vLLM when you have real concurrent load, strict latency requirements, or models that do not fit on one GPU. Until then, you are probably optimizing something you have not built yet.
The migration taught me that “production-grade” is not a reason to use a tool. “My workload needs this” is the only reason that matters. I did not need vLLM yet, and my GTX 1080 cannot support it. I used it anyway. That was an afternoon I will not get back.