Renting the Frontier
What does it cost to run your own AI coding model? I did the math. Cost is not the problem. The real risk is depending on a few vendors you can't control.
TL;DR
- I priced out running my own AI coding model. The question was simple. What would it cost to stop renting from a vendor and run the models myself?
- For one person, it doesn’t pay for itself. Around $2K gets you a Haiku-class model at home and ~$10K gets you near-Sonnet, but you almost never save money against just paying for an API.
- For a company, it can. Batching lets one GPU node serve hundreds of people at once. The real costs are staff and the quality gap, not the GPUs.
- The real issue isn’t cost. It’s control. You don’t control the models you depend on. They change under you, the vendor sets the rules, and you can’t stay on the version that worked.
- Open models have nearly caught last year’s frontier. On coding benchmarks the best open weights now sit right next to Claude Opus 4.6. What self-hosting really buys you is independence, not savings.
- You can test it for a few hundred dollars. Rent a 2xH200 for a weekend, run the real stack, and find out whether you could walk away if you had to.
The Question That Looked Simple
I was looking at the usage page in Claude Code, the row of bars that tells you how much of your week you’ve spent. And I asked myself a question I assumed had a clean dollar answer. What would it cost to host my own coding model, just for me? No vendor. No bars. A machine in my office that I own.
I started pricing GPUs the way you’d price a new laptop. And the further I got, the more I realized I was asking the wrong question.
Because the thing that actually pushed me down this road wasn’t the meter. It was watching the models change under me. Fable shipped, 4.7 and 4.8 landed, and somewhere in there I noticed something I’d been comfortably ignoring. You don’t control the model lineup you depend on. The vendor decides what ships, what changes, and what gets quietly retired. You get migrated forward on their schedule, not yours. That isn’t a complaint about Anthropic specifically. It’s the honest structure of renting capacity from someone else’s frontier.
That feeling of obviousness is the tell. The question looked simple, so I never actually sat with it. I’d been skipping right past it.
I’ve written about the relationship with these tools before. First I Was Wrong About AI, about the mindset shift that made me take them seriously. Later, in Stop Burning Tokens, I gestured at local models as a someday option and moved on. This post is me actually doing the math I waved at back then, and finding that cost was never the real question.
Name the Risk Plainly
So let me name it. Depending on a handful of frontier AI vendors, right now, is a real risk. Not a doomsday one. A boring, structural one, the kind you only notice when something breaks.
There are four pieces to it. The first is capability you can’t reproduce. The current frontier (Fable 5 around 95% on SWE-bench Verified, Opus 4.8 around 88.6%) is API-only. The open models that come anywhere near Opus-tier scores are huge, 600 billion to one trillion parameters. They need datacenter GPUs. You cannot put that in your office.
The second is the model roadmap you don’t control. You’re not just locked to a vendor. You’re locked to whatever they ship next, and to however good it turns out to be. Claude 4.7 and 4.8 posted higher benchmark numbers than 4.6, and plenty of people who use these tools for real work felt the newer models were a step back anyway. That’s a perception, not a provable fact, and it’s exactly the kind of thing benchmarks miss. The scores went up. But a long agentic session felt different, in ways that are easy to notice and hard to measure.
You can’t choose to stay on the version that worked best for you. You get migrated forward, and the vendor’s reasons for the change (cost, capacity, their own priorities) aren’t necessarily yours. Pricing, availability, and which models are allowed for which use cases can all move under you with little notice. They have been fickle about it before. The control you thought you had turns out to be thin.
The third is the data promise. Enterprise API agreements, Anthropic’s included, contractually commit to not training on your data. I believe them. But a contract is a promise you trust, not a property of the system. It’s a trust relationship, not an architecture. Self-hosting is the only option where the question never even arises.
The fourth is a kill switch you don’t hold. If the terms change, you have no recourse. You don’t eliminate this risk by self-hosting. You just trade “trust the vendor” for “trust your own ops team.” You’re choosing who you trust. That’s the honest framing, and it matters for everything that follows.
The Math for One Person
So I priced it out for me, one user, sitting at one desk. The tiers look like this.
| Tier | Hardware | Cost | What it runs | Compares to |
|---|---|---|---|---|
| Entry | Used RTX 3090 (24GB) | ~$700-1,000 used | Qwen3 30B, gpt-oss-20b, Gemma 27B at Q4, ~20-40 tok/s | Below Haiku 4.5, fine for chat and simple coding |
| Enthusiast | RTX 5090 (32GB) | ~$4-5K build | 32B models fast (~45 tok/s), gpt-oss-120b only with offloading | Haiku 4.5-ish, flashes of Sonnet on narrow tasks |
| Dual-GPU | 2x used 3090 (48GB) | ~$2-4K | 70B dense at Q4, gpt-oss-120b comfortably, ~15-30 tok/s | Solid Haiku-plus, approaching old-Sonnet on coding |
| Mac Studio | M3 Ultra, 512GB unified | ~$10K | DeepSeek 671B MoE at 4-bit, the whole model, ~17-20 tok/s at 160-180W | Near-Sonnet, the sweet spot for “frontier-ish at home” |
| Datacenter | 4-8x H100 | $100K+ to own, or rented | Kimi K2.6, DeepSeek V4, GLM-5.1 unquantized, fast | Top open models, not “personal” territory |
One thing worth flagging on that table. The RTX 5090 was supposed to be the affordable hero at a $1,999 list price. It isn’t. AI-driven memory shortages have pushed actual street prices well past $3,500, the one corner of this whole story where consumer GPU prices went the wrong way. The clean $3K build I imagined doesn’t exist anymore.
The Mac Studio is the interesting one. Ten grand, runs the full DeepSeek 671B model at 4-bit, sips 180 watts, lands somewhere near Sonnet on coding. That’s genuinely impressive. It’s also where the cost argument falls apart completely.
Here’s the math that ended the cost conversation. Spend ten thousand dollars on API calls. At Sonnet’s roughly $15 per million output tokens, that buys you something like 650 to 700 million output tokens, or a couple billion if you lean on cheaper input and blended rates. For most people doing personal use, you will not break even on the hardware.
There’s one honest exception. If you already own a gaming GPU or a box for other compute, or would buy one anyway, the cost stops being something you charge fully to AI. A $2K consumer card you wanted regardless, one that holds decent resale value, changes the entry-tier math. A $10K Mac Studio bought purely to run a coding model does not. The dual-use case is real for a 3090 or 5090 you already had your eye on. It doesn’t rescue the Mac Studio.
So for a dedicated purchase, cost is a dead end. Self-hosting for one person is for privacy, tinkering, offline work, or unlimited-volume experiments. It is not for savings. Once I accepted that, the project got more interesting, not less.
The Math for an Org Is a Different Animal
Single-user inference wastes the machine. That Mac Studio doing one request at a time leaves more than 95% of the hardware idle. That is the whole reason personal economics look so bad.
Production serving engines change that. vLLM and SGLang batch requests, so one GPU node serves hundreds of people at once. Throughput on a mid-sized model can hit tens of thousands of tokens per second in total. Each individual request still feels like about 30 tokens per second. (The headline 16,000 tok/s benchmarks usually involve small models, so validate with a pilot, but the shape is right.) Batched inference is the entire economic story.
The costs that survive batching are people and the quality gap, not the GPUs. A rented 8xH100 node on a neocloud like Lambda runs about $2-3 per GPU-hour, call it $130-210K a year running around the clock. On top of that, someone has to own model serving, quantization, upgrades, and on-call. That’s half an engineer to two engineers of MLOps time, which runs $100-400K a year fully loaded. The GPUs are close to the cheap part.
Sizing depends entirely on workload, and coding agents are the gotcha. A chat user sips tokens. An agentic coding session, the Claude Code-style loop, chugs them continuously, 10 to 100 times the load. One node might serve 300 to 500 concurrent chat users but only 20 to 50 agent sessions. A 1,000-person company doing chat and search might floor around $200-400K a year. An engineering org running agents hard is north of a million.
The architecture I keep seeing people land on is a hybrid gateway. Employees hit a LiteLLM proxy that handles auth, per-key budgets, logging, and routing. Behind it sits a self-hosted vLLM or SGLang cluster for sensitive data and bulk volume. External frontier APIs handle the hard tasks, the ones you opt into when nothing sensitive is involved. The gateway gives you the kill switch and the audit log either way, and it lets you swap the model underneath without anyone changing their tooling. That last part is itself an answer to the vendor-control problem. (LiteLLM is open source and free, though single sign-on lives in its paid enterprise tier, so budget for that if you need it.)
Four Paths, Each a Trade
Here’s the heart of it. You’ve named the risk and you’ve seen the numbers. The choice comes down to four paths, and each one trades away something real.
Stay on the frontier API. You get the highest quality, zero ops burden, and the newest models the day they ship, with nothing to maintain. This is also the maximum-dependence option. You don’t hold the kill switch. The vendor sets the terms. Your data leaves your network. And the no-training promise is a contract you trust, not a wall you built. For most people this is the right default, and there’s nothing wrong with that. Just know what it is.
Move to open-model APIs, OpenRouter and the like. These are the same open models (DeepSeek, Qwen, Kimi), served by third parties. They often cost far less than the proprietary frontier, somewhere between $0.10 and $3 per million tokens, depending on the model and tier. You get near-frontier capability with no hardware and no ops. That covers most of the cost-and-capability case, cheaply. The catch is that your data still leaves your network, so it only half-answers the control concern, and you’ve swapped one trust relationship for another that is often less battle-tested.
Self-host your own. This is the only path where “are they training on my data” never comes up. Privacy and independence stop being a promise and become part of how the system is built. You hold the kill switch and the audit log. The price is everything from the two sections above. You almost never break even for personal use, the quality ceiling is roughly Sonnet from six months ago, the speed feels sluggish in an agent loop, and an org needs real ops headcount. You trade “trust the vendor” for “trust your own ops team.”
Run a hybrid gateway. This is where most serious write-ups land right now. Self-host the sensitive data and the predictable bulk, then route to the frontier only for opt-in tasks where the quality gap genuinely matters. You get control, an audit trail, and the freedom to swap models underneath. The catch is that it has the most moving parts of any option. You run a serving cluster and a gateway and you still pay for frontier APIs. Utilization is destiny here. A node running around the clock to serve 40 hours a week of real traffic costs four times what it should.
Where I Land
Here’s my position, and it’s not the cheapest one.
The benchmark gap to last-gen frontier has basically closed. The best open weights now sit right next to Claude Opus 4.6, last year’s frontier, on coding (~80.4-80.6% versus ~80.8% on SWE-bench Verified). A recent DeepSeek V3.x line at 4-bit lands in the low-to-mid 70s on the same benchmark, on a $10K Mac Studio, in your house. That would have sounded absurd eighteen months ago.
| Model | SWE-bench Verified | Self-hostable |
|---|---|---|
| Claude Fable 5 (current frontier) | ~95% | No, API only |
| Claude Opus 4.8 (current frontier) | ~88.6% | No, API only |
| Claude Opus 4.6 / Sonnet 4.6 (last-gen) | ~80.8% / ~79.6% | No, API only |
| DeepSeek V4-Pro-Max / MiniMax M3 / Qwen3.7 Max | ~80.4-80.6% | Open weights, datacenter-only |
| Recent DeepSeek V3.x (671B, 4-bit) | low-to-mid 70s | Yes, the $10K Mac Studio tier |
| Qwen3 30B / gpt-oss-120b | well below | Yes, the $1-3K tier |
Third-party aggregator numbers on SWE-bench Verified. Treat them as directional. They move with the source and the month.
Read that table closely and the real gap shows up. It isn’t on benchmarks anymore. It’s between what fits in your house and the current frontier. Roughly “Sonnet from six months ago” is the ceiling for a $10K personal rig, and the gap is wider on long-horizon agentic work than the scores suggest. Benchmarks reward the things models are good at. They undercount the slow grind of a multi-hour agent session.
So if it’s not cheaper and it’s not as good, why back open models at all? Because what’s left to buy isn’t capability. It’s independence. Using open models, paying for them, contributing to them: that is insurance against a kill switch you don’t currently hold. It feels backwards the first few times. You’re choosing a path that isn’t strictly the most capable, on purpose, because the most capable path is one you don’t control.
The economics keep drifting your way, too. Cloud rental GPU prices have fallen something like 64-75% since late 2024, and batched inference already puts hundreds of people on one node. The thing that was clearly uneconomical last year is merely expensive this year. Next year it’s something else.
I haven’t resolved this for myself yet. I haven’t moved off the frontier API for my day-to-day, and I might not. But the question stopped being “can I save money” a while ago, and started being “how much of my workflow do I actually want to rent.”
The cheapest way to answer that isn’t more spreadsheets. It’s a weekend. Rent a 2xH200 instance at $10-15 an hour, stand up vLLM and LiteLLM, point my real tools at a real open model, and live on it. Call it twenty or thirty hours of honest use across a couple of evenings and a Saturday. A few hundred dollars all in. By the end of it, I’d know whether I could actually walk away if I had to.
That’s probably the next thing I do. Not because I want to leave the frontier. Because I want to know what it costs to be able to.
And honestly? I’m still figuring out where that line is.
David Kerr is the founder of Kerrberry Systems. He builds custom software for businesses that want to own their systems, not rent someone else’s. Find him on LinkedIn or GitHub.