· AI

CEO of Tallyfy · AI advisor at Blue Sheen for mid-size companies

How to budget tokens in Claude Code

A surprising Claude Code bill is almost never one big expense. It is four different cost shapes stacked up: a context window that bills every turn, subagents that each cost a fixed chunk, skills that cost almost nothing until used, and caching that can cut the recurring cost or not. Budgeting tokens means knowing the four shapes.

Key takeaways

  • Token spend has four shapes, not one - context, subagents, skills, and caching each cost differently
  • The context window is the recurring cost - every file you load is re-billed on every turn after
  • Budget per task, not per month - decide what a task is worth before you start it
  • Measure the real number - the context readout tells you where you stand mid-session

A Claude Code bill that surprises you is almost never one large, obvious expense. If it were, you would have seen it coming. It is small amounts, charged in four different ways, stacking up under a session that felt ordinary while you ran it.

That is why “use fewer tokens” is useless advice. Tokens are not one thing with one price. They are spent in four distinct shapes, and each shape behaves differently: one is recurring, one is a fixed lump, one is nearly free until it is not, and one is a discount you either earn or miss. You cannot budget a number you think of as flat. You can budget four shapes once you can tell them apart.

So this is not a post about spending less. It is a post about spending on purpose. Knowing the four shapes turns a vague unease about cost into a set of specific, answerable questions, and answerable questions are the whole of budgeting.

Where your tokens actually go

Open a long Claude Code session and the single biggest line item, almost always, is the context window. This is the cost shape people miss, because it does not feel like spending. It feels like the session just working.

Here is the mechanism. Claude Code’s context window holds the entire conversation: every message, every file Claude has read, every command output. And the model is re-sent that whole window on every turn. So a file you asked Claude to read on turn three is not paid for once on turn three. It is paid for again on turn four, and turn five, and every turn after, until the session ends or the context is compacted. A large file read early in a session is not a one-time purchase. It is a subscription you signed without noticing. This is why a session that did not feel expensive can be: nothing in it was a big move, but a dozen file reads each quietly re-billed thirty times adds up to the surprise. The first rule of token budgeting follows directly: what you load, you keep paying for. Load deliberately.

The reason this is hard to see is that the cost and the action are separated in time. You do the loading on turn three. You pay for it on turns four through forty. By the time the bill is real, the moment that caused it is long gone, and nothing on the screen connects the two. A normal expense announces itself. You make a choice, money leaves, and the link is obvious. This one does not work that way. The action is one quiet tool call and the cost is a slow drip that follows you for the rest of the session. So your instinct, the one that has kept your spending sane everywhere else in life, never fires. There was no big moment to flinch at.

Picture a session where you ask Claude to read three large files near the start so it “has the full picture.” It feels responsible. It feels like you are setting the work up well. But you have just made those three files part of every single turn that follows, and most of those turns will never touch two of the three. You paid for breadth you are not using. The fix is not to load nothing. It is to load the file you need for the step you are on, and to treat “read this so it is available later” as the expensive bet it actually is. Later is a lot of turns. Each one re-bills what you loaded for it.

Two more things drive context cost, and both surprise people. Command output counts. When Claude runs a build, a test suite, or a search and the result is a wall of text, that wall enters the context window and is re-sent on every following turn exactly like a file read. A single noisy command can weigh more than several careful file reads. And conversation itself counts. Every question you ask and every answer Claude gives stays in the window and is re-billed. A long, meandering session is not just slow. It is literally more expensive per turn than a short, direct one, because the model is dragging the entire history of the chat behind it on every reply. None of this is a flaw. It is just how the window works, and once you can see it, you stop being surprised by it.

The four cost shapes

The context window is one shape. Three more sit alongside it, and a budget is just knowing which shape each action belongs to.

The context window is the recurring shape. It grows as you work and is re-billed every turn, so its cost is roughly the size of the window multiplied by how many turns remain. A subagent is the fixed-lump shape. Every subagent you spawn pays for a whole fresh context window and a delegation message to brief it; that overhead is real and it does not shrink, which is why spawning a subagent for a tiny task is poor budgeting. A skill is the nearly-free shape. The official skills documentation is exact about this: “a skill’s body loads only when it’s used, so long reference material costs almost nothing until you need it.” A skill sits in context as a short description and costs next to nothing until something invokes it. And caching is the discount shape. A cache read costs a fraction of sending the same tokens fresh, while a cache write costs a little more than a fresh send, and the cache lives only a few minutes before it expires. Caching does not lower your token count. It changes the price of the tokens you were going to send anyway, if you arrange your session to keep hitting the cache. The mechanics of that are their own subject, covered in LLM caching strategies.

Why does the shape matter so much? Because the same action can be wise or wasteful, and the only way to tell is to know which shape it belongs to. Spawning a subagent is a good move when the task is noisy and self-contained, and a bad move when the task is tiny, and nothing about the action itself tells you which case you are in. The lump is the same either way. What changes is whether the work you delegated is large enough to be worth a whole fresh window. “Use fewer tokens” cannot answer that. “Is this lump worth paying?” can. The advice has to be shaped like the cost, or it gives you nothing to decide with.

The four shapes also fail in four different ways, which is the practical reason to keep them separate in your head. You overspend on context by loading too much and never clearing it. You overspend on subagents by reaching for one out of reflex on work that did not need isolation. You barely overspend on skills at all, because their whole design is to cost nothing until used; the only way to waste a skill is to never invoke one and keep pasting its content into chat instead. And you miss the caching discount by working in long, scattered bursts that let the cache expire between turns. Four shapes, four distinct mistakes, four distinct fixes. A single mental model of “tokens” blurs all four into one fog you cannot act on. Pull them apart and each one becomes a question with an answer.

There is one more reason this split is worth holding onto. It tells you where your attention is best spent. For most people the context window is the shape that dominates the bill, so that is where a few minutes of care returns the most. Subagents matter, but you spawn far fewer of them than you make turns. Skills are close to free by design. Caching is a discount you set up once and then mostly stop thinking about. So if you only had the energy to watch one shape, watch the recurring one. The other three are worth knowing, and the rest of this post covers them, but the context window is the shape that quietly decides whether your session was cheap or expensive.

The four shapes of Claude Code token spend: context window, subagents, skills, and caching

Budget by the task

Most people who try to control AI cost reach for a monthly number. A monthly cap tells you nothing useful in the moment, because by the time you have breached it the spending already happened. The unit of token budgeting is not the month. It is the task.

A monthly cap fails for a simple reason. It is a measurement, not a decision. It tells you, after the fact, that the total got too high. It does not tell you, at the moment a choice is in front of you, whether this particular choice is the wasteful one. By the time the cap turns red, every decision that pushed it there is already in the past and cannot be unmade. It is a smoke alarm that goes off after the room has burned. Useful as a record, useless as a control. The task is different because the task is where the decisions live. A task is a thing you are about to start, which means it is a thing you can still shape.

Before a task starts, decide what it is worth. Think in proportion rather than money, since a money figure would be meaningless here and would age badly anyway. Is this a task that deserves the whole context window filled with relevant files, several subagents, and an hour of back and forth? Or is it a task that deserves one file and a direct answer? Most tasks are the second kind and get treated like the first, because nobody set the proportion at the start. A task with a ceiling behaves differently from a task without one. You load less speculatively, you delegate only when the isolation earns its lump, you stop when the answer is good rather than when the model runs out of room. If you are setting these proportions for a whole team rather than just yourself, that is the kind of operating discipline Blue Sheen helps put in place. The point is small and it is the entire post: you cannot overspend on a task whose worth you decided in advance.

What does setting the proportion actually look like in practice? It is not a spreadsheet. It is one short thought before you type the first prompt. Say the task is a small bug fix in a file you already know. The real proportion is low: one file, a focused question, no subagents, done in a few turns. Say the task is a redesign that touches a dozen files and needs the model to hold a lot of structure in its head at once. The proportion is high, and that is fine, because the work is large enough to deserve it. The mistake is never matching effort to a big task. The mistake is pouring big-task effort into small-task work because you never paused to size the work first. The pause costs you a few seconds. Skipping it costs you a session.

Here is the trap most people fall into. Without a ceiling, a task does not end when the answer is good. It ends when something external stops it. The model runs low on room. You get tired. The context compacts and the thread loses its grip on the detail. None of those is “the work is done.” They are just the work running out of fuel. A task with a stated worth has a natural finish line: the answer is good enough for what you decided this was worth, so you stop. A task without one drifts, because there is nothing in the loop telling it to stop, and drift is where most overspend hides. It is rarely one reckless move. It is a task that should have taken ten turns quietly taking forty because nobody decided, at the start, that ten was the budget.

Watch the real number

You cannot budget what you cannot see, and Claude Code does show you. The context window has a visible usage readout, and a custom status line can keep token usage in front of you continuously while you work. The number people guess at is sitting on the screen.

Build the habit of glancing at it. When the context window is filling fast, that is information: something heavy went in, and it is now being re-billed on every turn. The readout is also how you catch the moment to act. Claude Code compacts the context automatically when it gets full, summarizing to free room, but you do not have to wait for that. Watching the number lets you compact on purpose between tasks, or clear the context when you switch to something unrelated, before the bloat has cost you twenty turns of re-billing. A budget you check is a budget. A budget you set and never look at is a wish. The difference is one glance, every few minutes, at a number that is already there.

The readout is not just a number. It is a rate, if you watch it over a few turns instead of once. A window that climbs a little each turn is the work moving along normally. A window that jumps in one turn is a signal: something heavy just entered, a big file or a noisy command result, and from now on it rides along on every reply. You do not need to react to every climb. You do need to notice the jumps, because each jump is a new line item that quietly attached itself to the rest of your session. Reading the rate, not the snapshot, is what turns the readout from a curiosity into a control.

So what do you actually do when the number is too high? You have two moves, and the choice between them is easy once you know what each one does. Compact when the work continues but the window is bloated. Compaction keeps the thread of the task and summarizes the bulk away, so you carry forward the meaning without carrying forward every raw file and command output that produced it. Clear when the work is finished and the next thing is unrelated. A fresh start drops the whole window, which is exactly right when none of it applies to what you are about to do. The mistake is doing neither: pushing a bug fix’s worth of context into a documentation task and re-billing all of it, turn after turn, for nothing. The readout is what tells you the moment has come. The two moves are what you do about it.

Letting the automatic compaction do all the work feels easier, and it is, but it is not free. By the time Claude Code compacts on its own, the window was already full, which means you already paid the full-window price on every turn that led up to it. Automatic compaction rescues you. It does not refund you. The whole advantage of watching the number is that you act before the rescue is needed, at the seam between two tasks, when clearing or compacting costs you nothing because the old context was about to become useless anyway. That is the cheapest possible moment to reset, and it is invisible unless you are looking. The glance is small. What it buys you is the timing.

Five habits that cut spend

Knowing the shapes is the theory. Five habits put it into practice, and none of them costs you any quality.

Load narrow. Ask for the specific file or the specific function, not the directory. Every file you pull in joins the recurring cost, so the cheapest token is the one you never loaded. The instinct to load wide comes from a good place. You want the model to have everything it might need. But “might need” is the expensive phrase. Each file loaded on a “might” is paid for on every turn whether the might comes true or not, and most of them never do. Loading narrow is not about starving the model. It is about loading on a “need” instead of a “might,” and reaching for the rest only when a real step actually calls for it. The directory you skipped is still there. You can pull one file from it the moment you actually need that file, and not a turn before.

Clear context between unrelated tasks. When you switch from a bug fix to a documentation pass, the bug-fix context is now pure overhead, re-billed every turn for nothing. Resetting it is the single largest saving available to most sessions. This one feels wasteful, which is why people skip it. You built up all that context; clearing it feels like throwing away work. It is not. The context for a finished task is not an asset, it is luggage, and you are paying to carry it on every turn of the next task that will never open it. The work you did is in the files you changed. The context that produced it has done its job. Letting it go at the seam between two tasks is the cheapest move in this whole post, and it is the one most sessions never make.

Delegate by the lump, not by reflex. A subagent costs a fixed chunk every time. Spend that chunk when the isolation is worth it, on a noisy, file-heavy side task, and not on something small you could have answered in the main session. The test is simple. Ask whether the side task would dump a lot of clutter into your main window if you did it inline. If yes, the lump buys you a clean main session and is worth paying. If the side task is small and tidy, a subagent just adds a fresh window and a briefing message on top of work that would have cost almost nothing in place. Delegation is a tool for keeping the main thread clean, not a default setting. Reach for it when the mess it prevents is bigger than the lump it costs.

Move recurring instructions into skills or CLAUDE.md. Anything you paste into chat repeatedly is being paid for as fresh context every time. The same text as a skill costs almost nothing until invoked. Think about what you actually retype across sessions: coding conventions, project rules, the way you want output formatted, the steps of a workflow you run often. Pasted into chat, every one of those is fresh context, re-billed for the rest of that session, every session. Moved into a skill, the same words sit as a short description that costs next to nothing and only loads its full body when something needs it. You wrote the instructions once either way. The choice is whether you pay for them once or pay for them forever.

Keep sessions tight enough to cache. Caching only helps if you keep hitting the cache, and long idle gaps let it expire. Working in focused stretches is not just good for attention. It is what keeps the discount alive. The cache has a short life by design, so the way you lose it is not dramatic. You step away, you get pulled into something else, you come back twenty minutes later, and the discount you had quietly lapsed while you were gone. Nothing warned you. The next turn just costs full price again. Tight sessions are how you stay inside the window where the discount is real. It is a rare case where the habit that is better for your focus and the habit that is better for your spend turn out to be the exact same habit.

None of these is a sacrifice. Each one is the same move: stop paying, turn after turn, for tokens that are no longer doing any work. That is all token budgeting is: refusing to keep paying, turn after turn, for what you have stopped using.

About the Author

Amit Kothari is an experienced consultant, advisor, coach, and educator specializing in AI and operations for executives and their companies. With 25+ years of experience, he is the Co-Founder & CEO of Tallyfy® (raised $3.6m, the Workflow Made Easy® platform) and Partner at Blue Sheen, an AI advisory firm for mid-size companies. He helps companies identify, plan, and implement practical AI solutions that actually work. Originally British and now based in St. Louis, MO, Amit combines deep technical expertise with real-world business understanding. Read Amit's full bio →

Disclaimer: The content in this article represents personal opinions based on extensive research and practical experience. While every effort has been made to ensure accuracy through data analysis and source verification, this should not be considered professional advice. Always consult with qualified professionals for decisions specific to your situation.

Related Posts

View All Posts »
Claude Code effort mode and where it falls short

Claude Code effort mode and where it falls short

Claude Code effort mode looks like a cost dial: turn it down to spend fewer tokens. The official docs say otherwise. Effort is a behavioral signal, not a strict budget, so low effort does not reliably cut spend and can quietly raise it. Here are the five levels, where they stop, and how to set effort with intent.

The real cost of a large context window in Claude Code

The real cost of a large context window in Claude Code

A large context window in Claude Code feels free, and it is the opposite. Every token you load is re-billed on every turn after. The prompt cache that should make that cheap expires in five minutes, turning a tenth-price read into a higher-price write. And accuracy fades as the window fills. Here is the real cost.

How Claude Code scheduled jobs actually work

How Claude Code scheduled jobs actually work

Claude Code scheduled jobs come in three forms with very different guarantees: the in-session /loop, Desktop tasks, and Cloud routines. A missed run does not queue up a backlog. And despite a common belief, none of them creates a Windows Task Scheduler entry or a .bat file. Here is how each one actually behaves.

How Claude Code stop hooks work

How Claude Code stop hooks work

A Claude Code stop hook runs the moment Claude finishes a turn and can refuse to let it stop. It is the one hook that inverts control: Claude must pass your check before the turn ends. This post covers what a stop hook is, why exit code 2 is the whole game, the infinite-loop trap to avoid, and the patterns worth wiring up.

How to ensure Claude actually follows through on a plan

How to ensure Claude actually follows through on a plan

Claude Code creates excellent plans but silently skips steps during execution. A three-layer enforcement system with CLAUDE.md rules, a structured verification protocol, and a Stop hook that physically blocks incomplete work fixed the problem after 6 bugs and 15 regression tests.

How I run my whole consulting practice with Claude

How I run my whole consulting practice with Claude

I run Blue Sheen, my AI advisory firm, through Claude and Claude Code. The practice lives in a version-controlled folder that Claude reads at the start of every session, with Close CRM as the source of truth. This is the real workflow stage by stage: prospecting, proposals, delivery, and the judgment a human still has to own.

AI advisory services via Blue Sheen.
Contact me Follow 10k+