Skip to content
All posts
Post

Stop Burning Tokens. Start Allocating Them.

Token prices keep falling, but the meter that actually stops you is the window. A field guide to treating AI capacity like a budget you allocate.

David Kerr
An artist's palette seen from above, paint divided into separate wells at different levels, some full and some nearly scraped empty

TL;DR

  • The scarce resource isn’t money, it’s capacity. Token prices keep falling. The meter that actually stops you is the window, and your work gets hungrier every month.
  • Burning tokens is the name of the game. Burning them wisely is what makes a pro. Volume was never the problem. Thoughtless volume is.
  • Maximize the windows by spreading work across buckets. Know your bottleneck (mine is Opus) and deliberately route work into the meters you aren’t exhausting.
  • Get more from the same tokens three ways. Right-size the model, right-size the context, and use the mechanics (batch and caching) instead of watching the clock.
  • Context is a cost nobody watches. Just loading a thread can burn a fifth of a window, and useless context taxes your quality on top. Pruning is a real lever.
  • Do this well and it promotes you. Off QA and babysitting, onto vision and leadership. That’s the real return, and it’s the whole point.

Open the usage page in Claude Code and you get a row of bars. One for the session you’re in. One for the week across every model. One just for Sonnet. Lately, design work draws from the same pool too. They fill as you work, and when one tops out, you stop until it resets.

Spend a minute looking at that screen and something clicks. This is a game board. Somebody designed it. The bars, the categories, the resets, all of it is a set of rules, and there’s a way to play it well or badly.

Here’s the tell. There’s no dollar figure anywhere on that screen. No running total of what you’ve spent. Just capacity, filling up. Anthropic isn’t charging you by the sip. It’s handing you a budget and rationing how fast you can draw it down. Once you see that, the question stops being “how much is this costing me” and becomes “how do I get the most out of each window before it resets.”

I’ve been working through what AI actually changes about this job for a while now. First I Was Wrong About AI, about the mindset shift that made me take these tools seriously. Then You’re the Broker, Not the Builder, about the habits that follow once you trust the machine to build. Then How to Tell If You’re Using AI Well, about reading whether any of it is actually working. This is the next thing I had to learn. Once the AI is doing the building, the resource you’re really managing is the capacity to run it. And for a while, I managed it badly.

The Meter Isn’t Your Wallet. It’s the Window.

Most people get the economics backwards, and I did too. The assumption is that tokens are getting more expensive, so you should use fewer of them. The first half of that is just wrong. The going rate for a token has fallen by more than 90% since 2023. Measured in dollars per token, this has never been cheaper, and the line is still heading down.

So why does it feel tighter every month? Because the price isn’t the constraint. The window is. And your appetite is growing faster than the price is falling.

Think about home electricity. The price per kilowatt-hour is cheap and getting cheaper, but if your utility caps you at a fixed amount per week, the cap is what stops you, not the price. Now imagine you keep plugging in hungrier appliances. That’s exactly where AI is right now. Each token costs less, but a reasoning model thinks for thousands of tokens you never see before it answers. An agent loops for an hour. A context window runs to a million tokens. A newer model can spend a third more tokens to say the very same thing. Cheaper per unit, far more units per job.

So the thing that runs out isn’t your wallet. It’s your window. That reframe matters because it changes what you optimize for. You’re not trying to spend less. You’re trying to get more done inside a fixed budget that refills on a clock. Which is really the whole game. Burning tokens is the name of it. Burning them wisely is what separates a pro from someone who just runs up the meter.

Spread the Work Across the Buckets.

The first move is the one the bars are practically begging you to make. That row of meters isn’t one pool, it’s several. There’s the rolling session window, a few hours long, and the weekly windows sitting behind it. And inside the week, the categories have their own sub-limits. Sonnet has its own bar. Burn through one and you’re throttled there even if the others have room to spare.

That structure rewards a specific habit. Know your bottleneck, then route around it. Mine is Opus. It’s the model I reach for when the thinking is hard, which means the all-models weekly total is almost always what binds me first. Meanwhile the Sonnet bucket sits half full, because finding work that’s genuinely fine on Sonnet is harder than it sounds once you’ve gotten used to reaching for the strongest model on instinct.

So I’ve made it deliberate. Every task I can honestly hand to Sonnet instead of Opus is capacity I was otherwise going to leave on the table. Sometimes I’ll even take a design task off a local setup that could handle it, just to push the load into a bucket I’m not exhausting. It feels backwards the first few times. You’re choosing a path that isn’t strictly the most capable, on purpose, because the most capable path has a line out the door.

The short windows ladder up into the long ones, too. Max out a five-hour session and you’ve spent against the week without quite noticing. In my own use, on the 20x plan, a fully maxed session feels like it eats something like an eighth of my weekly budget. Your ratio will differ depending on how hard you lean on Opus versus Sonnet. But the lesson holds. Winning the afternoon forces you to think about the week, and winning the week is the real game.

None of this is about using less. It’s portfolio management. You have several budgets refilling on different clocks, and the goal is to keep any one of them from becoming the bottleneck while the others sit idle.

Right-Size the Model.

The second move is getting more out of the tokens you do spend, and the biggest lever there is not swinging a sledgehammer at every nail.

You don’t need the flagship model for everything. Anthropic’s own guidance is a three-tier ladder. Reach for Opus when the work is genuinely hard, the multi-hour autonomous runs, the big refactors, the real planning. Let Sonnet be the daily driver. Send the high-volume, repetitive sub-tasks to Haiku. They even say, in their own docs, that tuning your prompt is often a better lever than reaching for a bigger model. That’s a refreshing thing for a model vendor to admit, and it’s true.

Some of this is already automated if you look. Claude Code ships an “opusplan” mode that thinks through the plan in Opus and then drops down to Sonnet to do the actual coding. That single switch captures most of the benefit, because planning is where the expensive reasoning earns its keep and execution mostly doesn’t. You can push it further by setting which model your subagents run on, so the cheap parallel work doesn’t quietly bill at flagship rates.

And for the genuinely trivial, you can leave the cloud entirely. Local open-weight models are real now. A capable coding model in the 24-billion-parameter range runs on a Mac with 32 gigs of memory and scores respectably on real coding benchmarks. It won’t architect your system or hold a sprawling multi-file change in its head, but it’ll handle the boilerplate, the renames, the one-off scripts, all without touching a single bar.

I’ll give you a live example. The background research for this post, checking how the usage limits actually work and what the tools really do, ran on Sonnet, not Opus. It was reading and summarizing, not reasoning hard, so there was no reason to spend Opus capacity on it. That’s the move, in miniature. Match the model to the difficulty of the task, not to your habit.

Right-Size the Context.

Here’s the lever almost nobody is watching, and I think it’s the most interesting one.

Loading context costs tokens. Not output, just loading. When you’re running a few threads against a million-token window, simply pulling in the context to get started can burn ten to twenty percent of that window before you’ve done a lick of useful work. Everyone watches the output meter. Almost nobody watches what it costs just to bring the model up to speed.

And not all of that context earns its place. Some of it is genuinely useful. Some is inert, loaded but never referenced. And some is actively harmful. Stale instructions, dead code paths, abandoned tangents from three tasks ago. That last category is the expensive one, because irrelevant context isn’t neutral. It pulls the model toward worse answers. So bloated context bills you twice. Once to load it, and again in degraded quality on the way out.

Which makes pruning a real lever, not housekeeping. Most people only prune reactively, when the window fills up and compaction kicks in to save them. The pros do it on purpose. Scoped memory instead of dragging everything along. Smaller, focused threads instead of one sprawling conversation. A distilled summary instead of the raw back-and-forth. Don’t load a million tokens of context when a hundred thousand would do.

This connects to something from the broker post. The reason a good plan survives a context reset is that it’s already pruned. It’s the distilled version, the signal with the noise thrown out. Writing a tight plan and writing lean context are the same discipline. I don’t think anyone has nailed the tooling for this yet. Actively managing what a model carries with it, trimming the dead weight automatically, is wide open. If you’re looking for a problem to solve, it’s a good one.

Stop Watching the Clock. Watch the Mechanics.

A question I kept asking was whether timing matters. Does a thousand tokens at noon buy more than the same thousand at midnight? Could I queue work for the cheap hours?

Mostly, no. None of the big Western providers price by the hour. The rate is the rate whether it’s 3pm or 3am. There was a stretch where Anthropic effectively penalized peak hours by shrinking your morning quota, but they removed it. So the romantic idea of a nightly job sneaking in cheap tokens while the world sleeps doesn’t really pay off. The clock is a dead end.

What does pay off is the mechanics, and they pay off big. Anything you can wait on can go through a batch lane that runs at half price, in exchange for up to a day of turnaround. Anything you repeat can be cached, and a cache hit costs about a tenth of the original. Those two levers stack. Queue the work that isn’t urgent, cache the context you keep reusing, and you’ve cut the real cost of a big chunk of your workload without changing a single model.

So the instinct to time your work is right. You’re just timing it against the wrong thing. It’s not the hour of the day, it’s whether the work is urgent and whether it repeats. Sort your work along those two axes and the savings are sitting right there.

Then Build the System That Does It for You.

Do all of this by hand and you’ve just traded one kind of work for another. The deliberate routing, the context pruning, the account juggling, it’s real overhead, and at some point babysitting the resource becomes its own full-time job. The endgame isn’t doing this manually forever. It’s building a system that does it, with you involved only at the planning level and on the calls that actually need judgment.

The infrastructure for this is arriving fast. There are orchestration frameworks like BMAD-METHOD that run a whole documented lifecycle through specialized agents. There’s gastown, which coordinates dozens of parallel Claude Code instances using git as the shared memory between them. I’m running Conductor right now, which keeps a fleet of agents working in isolated worktrees. Each of these is a real commitment. Picking one and leaning in costs you time and tokens to learn, which is exactly why you can’t adopt all of them at once.

The smaller stuff is arriving too. Right now I keep an eye on three accounts and flip to a fresh one when one gets near its limit, all by hand. It works, but it’s exactly the kind of chore that shouldn’t need a human. Tools for juggling multiple accounts are already showing up (claude-swap is one I’ve seen), and that’s the tell that matters. When the ecosystem starts building utilities for a problem, the problem is real and the manual era is ending. I haven’t leaned on them yet. I will.

There’s a trap in here, though, and it’s worth naming. This stuff moves so fast that keeping up is its own tax. Shifting your whole workflow is genuinely expensive, so when you find something that works, it’s fine to sit still and just use it for a while. Don’t chase every release. But set a timer. Stagnation is comfortable, and the field will lap you if you let it. Find what works, ride it, then make yourself look up again.

The Real Return Is Getting Out of the Loop.

So why go to all this trouble over some progress bars?

Because resource management is the skill that promotes you. Every hour I spend acting as QA, as the project manager, as the person babysitting the build, is an hour I’m not spending on the work that actually needs me. Product vision. Long-term planning. The leadership calls that don’t have a clean answer. Those are the things I can’t hand to a model, and they’re the first things crowded out when I’m down in the weeds managing capacity by hand.

Get the routing and the pruning and the allocation running on their own, and you climb out of that loop. Not by lowering the bar, but by raising the floor, so the quality holds without your hand on every lever. The team doing the building might be entirely digital. That doesn’t change the shape of the job. The cycles and the standards I’m refining now are the same ones that buy me a hands-off process later, one that ships good work without me in the middle of every decision.

That’s the real return on managing the resource well. It was never really about saving tokens. It’s about what you free yourself up to do once the machine is spending them wisely. Burning tokens is the name of the game. Burning them wisely is what makes a pro. And learning to do it on purpose is how you stop being the operator and start being the one who decides where the whole thing is pointed.

I’m still figuring out where that ends. But for the first time, the bottleneck isn’t the building anymore. It’s me, and how fast I can climb up to the work that’s actually mine to do.


David Kerr is the founder of Kerrberry Systems. He builds custom software for businesses that want a partner, not a vendor. Find him on LinkedIn or GitHub.

Have a project in mind?

A 30-minute discovery call is the fastest way to find out if we're a fit.