Self-Hosted AI vs API: What I Actually Pay (And What I Actually Get)
Table of Contents
I spent three months self-hosting LLMs before I realized I was paying more in time than I would have paid OpenAI in dollars. The math is not as simple as the spreadsheets make it look.
The self-hosted vs API debate is full of bad math. Someone posts a spreadsheet showing that an A100 pays for itself at 10 million tokens per day. Someone else replies that the real number is 50 million. Both are probably wrong for their own workload, because the inputs they are using do not match yours.
I have been through this enough times, both in my own setups and in conversations on r/LocalLLaMA, to believe that the question is not “what is the break-even point?” The question is “what do I actually pay, and what do I actually get?”
This post is what I actually pay. It is opinionated, approximate, and based on my own bills.
What I Actually Run
My homelab setup:
- One machine with a GTX 1080 (8GB VRAM, bought used for €300)
- Three-node K3s cluster on older hardware
- Ollama for quick local inference
- vLLM for heavier workloads (rarely used)
- LiteLLM as a gateway to route between local and API models
My API usage:
- OpenRouter for experiments and fallback
- Kimi Code for coding tasks
- Occasionally ZAI or MiniMax when I want to test something new
- Sometimes GPU hosting for tests when I need more VRAM than my 1080 can provide
What I Actually Pay (Monthly)
| Cost | Amount | Notes |
|---|---|---|
| Electricity (GTX 1080 at ~80% load, ~4 hrs/day) | ~€15/month | Measured with a smart plug. Spain electricity is expensive. |
| Internet (portion for homelab) | ~€10/month | I would pay this anyway, but I allocate a portion |
| API bills (OpenRouter + Kimi + others) | ~€35/month | Variable. Spikes when I am experimenting. |
| GPU hosting for tests | ~€20/month | Only when I need more than 8GB VRAM |
| Hardware amortization (€300 over 3 years) | ~€8/month | Assuming I keep it that long |
| Total self-hosted loaded cost | ~€48/month | Plus my time |
| Total API-only equivalent | ~€60-80/month | Estimated for my token volume |
The self-hosted number looks cheaper. It is not. I am not counting the time I spend on it.
What My Time Actually Costs
Here is what I tracked for one month:
| Activity | Time | What I Was Doing |
|---|---|---|
| Driver updates and CUDA debugging | 2 hours | NVIDIA driver 535 broke Ollama. Had to roll back. |
| Model downloads and conversions | 1.5 hours | Converting GGUF to safetensors for vLLM. |
| Quantization tuning | 2 hours | Finding the right Q4_K_M vs Q5_K_M tradeoff for my use case. |
| Monitoring and alerting setup | 1 hour | Prometheus scraping for GPU metrics. |
| Random debugging | 2 hours | Pod restarts, OOMs, mysterious slowdowns. |
| Total | 8.5 hours | In one month |
At my consulting rate, 8.5 hours is €850. At my salary rate, it is still €425. Even at “hobby time” valuation, it is real time I could have spent on something else.
The API does not require any of this. I pay the bill and it works. That is worth something.
What I Actually Get
Self-hosted advantages I actually use:
- No rate limits. I can run inference locally if I want.
- No data leaves my network. Important for some experiments.
- I can run models that are not on any API. Useful for testing new releases.
- It is fun. I enjoy tinkering. This is not a business reason, but it is a real reason.
Self-hosted disadvantages I actually hit:
- Model quality is lower than GPT-4o for most tasks. Llama 3.1 8B is good, but it is not GPT-4o.
- Context window is small. My GTX 1080 cannot run 128K context models. It struggles with 32K.
- Setup time for new models is real. Download, convert, test, tune. Every time.
- Downtime is my problem. When the power goes out or the driver breaks, I fix it.
- VRAM is the bottleneck. 8GB is not enough for modern models. I often need to use APIs for anything serious.
API advantages I actually use:
- Better model quality for complex tasks.
- Zero maintenance. I do not think about drivers, CUDA, or quantization.
- Scales to zero. When I do not use it, I do not pay.
- Faster for small tasks. No cold start, no model loading.
- Access to models that do not fit on my hardware.
API disadvantages I actually hit:
- Rate limits during heavy use. I have hit OpenRouter limits during experiments.
- Costs can spike. A bad loop or a runaway script can burn €10 in an hour.
- Data leaves my network. Not ideal for sensitive workloads.
- Vendor lock-in. Pricing changes, models get deprecated.
The Real Break-Even
For my workload, the break-even is not about token count. It is about what I value.
If I value my time at zero and I enjoy the tinkering, self-hosted is cheaper.
If I value my time at anything reasonable, the API is cheaper for everything except high-volume, latency-sensitive, privacy-critical workloads.
My actual split now:
- Self-hosted: Local experiments, privacy-sensitive code, small models that fit on the GTX 1080.
- API: Production tasks, complex reasoning, coding assistance, models larger than 8B, anything where downtime is annoying.
This is not a purity test. I use both. The question is which one for which task.
What I Would Do Differently
Three things I learned from doing this wrong:
- Start with the API. Prove the workload exists and the model quality is acceptable before you buy anything. I bought the GTX 1080 before I knew what I would use it for. I would not do that again.
- Rent before buying. A month on a cloud GPU costs a fraction of an A100 and tells you whether your assumptions about throughput and latency are real. I did not do this. I should have.
- Track your time honestly. The hardware cost is not the real cost. The time you spend debugging CUDA is the real cost. I tracked it for one month and was surprised.
What I Tell People Now
When someone on r/LocalLLaMA asks “should I self-host?” I ask three questions:
- Do you enjoy system administration? If no, self-hosting will make you miserable.
- Is your workload steady and high-volume? If no, the API is probably cheaper.
- Do you have a specific privacy or latency requirement? If yes, self-hosting might be worth it. If no, the API is probably fine.
If the answer to all three is not a clear yes, I tell them to start with the API. They can always self-host later. The reverse is harder.
The Honest Conclusion
Self-hosted AI can be cheaper than API access, but only if you value your time at zero and you enjoy the work. For most people, the API is the better deal. It is not just about the token cost. It is about the total cost of ownership, including the cost of your attention.
I still self-host because I enjoy it and because some of my work is privacy-sensitive. But I am honest about the cost. It is not saving me money. It is costing me time in exchange for control and fun.
If you are making this decision for a team, the math is even simpler. Multiply my 8.5 hours by your team size. That is what self-hosting actually costs. Compare that to the API bill. The answer is usually obvious.
The GTX 1080 is a good learning tool. It is not a production inference server. Know the difference.