Amit Kothari CEO of Tallyfy, AI advisor at Blue Sheen

The real cost of a large context window in Claude Code

In brief

A large context window in Claude Code feels free, and it is the opposite. Every token you load is re-billed on every turn after. The prompt cache that should make that cheap expires in five minutes, turning a tenth-price read into a higher-price write. And accuracy fades as the window fills. Here is the real cost.

Amit Kothari Follow 10k+

May 20, 2026 · Updated Jun 12, 2026 · AI

CEO of Tallyfy · AI advisor at Blue Sheen for mid-size companies

The real cost of a large context window in Claude Code

The short version

A big context window looks like free storage. It behaves like a running meter. Every token you load gets re-sent, and re-billed, on every turn that follows, and the cache that is meant to soften that has a five-minute fuse.

The context window is re-sent in full on every turn, so a file loaded once is paid for many times
Prompt caching cuts that cost, but the cache expires after five minutes by default
A cache read costs a tenth of a fresh send; a cache write costs a quarter more, so a miss is a steep swing
Model accuracy fades as the window fills, so a fuller window is not a smarter one

There is a comfortable belief about large context windows that goes like this: the window is big, so I can load whatever I want, and Claude will sort it out. A million tokens of room. Pour the whole codebase in.

The window is real and the room is real. What the belief misses is that a context window is not storage. It is not a drawer you fill once. It is closer to a meter that runs. Everything you put in the window is sent to the model again on every single turn for the rest of the session, and paid for again each time. A file you loaded on turn two is not a turn-two cost. It is a cost on turn three, turn four, and every turn after.

So the real question about a large context window is not “will it fit.” Almost anything fits. The question is “what does carrying it cost,” and the answer has three parts that compound: the re-billing, a cache that expires faster than people expect, and a quieter tax where the model gets less reliable as the window fills. Put together, brute-force loading is usually the expensive choice dressed up as the easy one.

What a full window costs

Start with the mechanism, because it is the part people skip. Claude Code’s context window holds the entire conversation: every message, every file Claude has read, every command output. And the model does not get a diff each turn. It gets the whole window, resent.

That is the cost engine. A large file you load early does not cost you once. It joins the window and is re-sent on every following turn until the session ends or the context is compacted. Load a handful of big files at the start of a long session and you have not made a handful of purchases. You have signed up for a subscription that bills every turn, and you signed it without a price shown. This is why a session that never felt expensive can be: nothing in it was a dramatic move, but a dozen heavy reads, each re-sent across thirty turns, is a large number arrived at quietly. The first cost of a big window is not the loading. It is the keeping. I made the same point from the budgeting side in how to budget tokens in Claude Code; here the point is sharper, because a large window is precisely the thing that makes the re-billing large. Since I wrote this, the point got sharper still: Anthropic shipped an updated tokenizer with Opus 4.7, and the same text can now map to up to a third more tokens depending on the content. The same files carried across the same turns are more tokens than they were, so the meter runs faster, not slower.

It helps to sit with why the design works this way. A model is stateless between turns. It does not remember your last message the way a person remembers a sentence they just heard. Each turn is a fresh evaluation, and the only way the model knows what came before is that the whole conversation is handed back to it as input. There is no shorter path. The window is not a record the model keeps. It is a payload you resend. Once that lands, the cost shape stops being surprising. You are not paying for the model to store your files. You are paying, every turn, to ship them to a model that has no memory of ever having seen them.

The reason this catches people out is that the cost is shaped wrong for human intuition. A purchase, in normal life, is an event. You hand over money, you get a thing, the transaction closes. Loading a file into the window feels like that kind of event, and it is not one. It is the start of a meter, and the meter is silent. Picture a long session where you read four large files in the first ten minutes and then work for two hours. Nothing in those two hours feels like spending. You are reading code, asking questions, making small edits. But each of those two hours of turns carries all four files inside it. The early reads were not the bill. They set the rate. The bill is the two hours that followed, and you never saw a moment that looked like a decision to spend it.

The five-minute cache cliff

The obvious objection is prompt caching, and it is a fair one. Caching exists to stop you re-paying full price for the same tokens turn after turn. It works. It also has an edge that almost nobody plans for.

The verified numbers are worth holding precisely. A cache read costs 0.1 times the price of a normal input token. A cache write costs 1.25 times. So reading cached context is a tenth of the price of sending it fresh, a real saving, and writing it to the cache costs a quarter more than a fresh send. The trap is in the lifetime. The official prompt caching documentation is exact: “By default, the cache has a 5-minute lifetime.”

Five minutes. Step away to read a Slack message, think through a hard problem, take a call, and the cache behind your big context window quietly expires. The next request cannot read at 0.1 times. It has to write again, at 1.25 times. That is the cache cliff: the same chunk of context swings from a tenth of the price to a quarter above it, a more than twelvefold jump, the instant you cross five idle minutes. A 1M-token window that you imagined as cheap-to-carry because it is mostly cache reads is only cheap while you keep the cache warm. Work in long, gappy stretches and you are not getting the cached price. You are paying the write price, again and again, for a window you were told would be efficient.

Now hold five minutes against the actual rhythm of a working day. Five minutes is not a long pause. It is the length of a good think. It is one decent code review of a teammate’s pull request. It is the time it takes to reproduce a bug by hand, or to read the docs for a library you half remember, or to write a careful message in a channel. None of those feel like idleness. They feel like the work. But to the cache they are all the same thing: a gap, and a gap past five minutes resets the clock. The cache was built for a tight back-and-forth, request following request with barely a breath between them. Real engineering is not that. Real engineering has thinking in it, and thinking has pauses, and the cache does not distinguish a pause for thought from a pause for nothing.

The swing matters more on a large window than a small one, and that is the part worth being precise about. On a small context the write penalty is real but the absolute number is modest, because there is not much to write. On a 1M-token window the same penalty is applied to a far heavier payload. So the cliff is not a fixed drop. It scales with how full you let the window get. The bigger the window you are carrying, the further you fall when the cache goes cold, and the more it costs to climb back to a warm state on the next turn. A large window does not just raise the re-billing covered above. It also makes every cache miss a more expensive event. The two costs are not separate problems. They are the same window, charged twice.

The Claude Code prompt cache cliff: a pause over five minutes expires the cache and forces a costly re-write

Accuracy fades as it fills

The cost so far has been measured in tokens. There is a second cost measured in quality, and it is the one that should worry you more, because you cannot see it on a bill.

A fuller context window is not a smarter one. Anthropic’s own best-practices guide states it without hedging: performance degrades as the context fills, and “when the context window is getting full, Claude may start forgetting earlier instructions or making more mistakes.” Read that as the real penalty of brute-force loading. When you pour the whole codebase into the window, you are not handing the model more to reason with. Past a point you are handing it more to lose track of. The instruction you gave at the start and the file that actually mattered are now competing for attention with hundreds of files the task never needed. The model does not fail loudly. It just gets a little less precise, a little more forgetful, and you pay for the bigger window and get a worse answer inside it. That is the trap at its purest: you spent more to make the result worse.

Think of it as signal against noise. The thing the task actually needs is the signal. Everything else in the window is noise, and a window stuffed with the whole codebase is mostly noise by volume. The model has to find the few files that matter inside a crowd of files that do not, and the larger the crowd, the harder that search. It is the same reason a precise question gets a better answer than a vague one. You did the model a favor by narrowing what it had to consider. A bloated window does the opposite. It buries the relevant in the irrelevant and then asks the model to dig.

What makes this cost worse than the token cost is that it hides. A token cost shows up. You can look at usage and see a number, and a number can be argued with. The quality cost leaves no number. The model returns an answer, the answer looks plausible, and you have no marker telling you it would have been sharper with a cleaner window. So picture two runs of the same task. One with a tight, narrow context. One with the whole repository poured in. The narrow run might catch the edge case the wide run misses, and produce the fix the wide run gets slightly wrong, and you would never know, because you only ran one of them and it gave you something. The danger is not a wrong answer you can see. It is a slightly worse answer you cannot, on a window you paid extra to fill. You overspent and downgraded in the same move, and the bill only reported half of it.

When the window earns its cost

None of this means the large window is a mistake. It is a tool, and there are tasks it is right for. Knowing the cost is what lets you spend it well, so consider the other side.

A large context window earns its cost when the task needs breadth that cannot be chunked: tracing one behavior across many files at once, reviewing a wide change where the interactions are the point, holding a long document whole because a summary would drop the detail that matters. In those cases the alternative, repeated narrow loads, would cost more in tokens and more in your time, and the degradation is a price worth paying for work that cannot be done any other way.

The reason chunking fails for these tasks is worth spelling out, because it is the line between a window well spent and a window wasted. Some problems live in the connections, not the parts. Imagine a bug where a value is set correctly in one file, passed cleanly through a second, and corrupted by an assumption in a third. Read those files one at a time, in separate narrow loads, and each one looks fine on its own. The fault is not in any single file. It is in how they meet. To see it you have to hold all three at once, and a summary will not do, because a summary of file one drops the exact detail that file three trips over. That is a task that needs the breadth. The window is not being lazy there. It is doing the one thing only it can do.

So you are left with a judgment, and the test is simple to state. Does this task need everything loaded at the same time, or does it just feel easier to load everything than to choose? The first is a good reason. The second is the habit this post is arguing against. The two can look identical from the outside, which is why the question has to be asked plainly each time rather than answered once and assumed. A reviewer with the whole repo loaded looks the same whether they needed it or not. Only the person who loaded it knows which it was. If you are trying to set that discipline across a team rather than relying on each person to judge it, Blue Sheen runs engagements like this.

Load like it costs something

The fix is not a smaller context window. It is treating the one you have as something with a price, because it has one.

Load narrow. Ask for the specific file and the specific function, not the directory, because every token you pull in joins the re-billing. Work in focused stretches rather than long gappy ones, so the cache stays warm and you keep the tenth-price read instead of the write. Compact or clear the context between unrelated tasks, before the old material has cost you twenty turns of re-billing and started to crowd the model’s attention; the discipline of keeping a plan and a clean context pays here too. And lean on caching deliberately, which has a craft of its own worth learning properly in LLM caching strategies and the broader question of what actually saves Claude costs.

Notice that each of those four practices answers one of the costs directly. Loading narrow shrinks what gets re-billed every turn. Working in focused stretches keeps the cache warm so reads stay at the tenth-price rate. Compacting between tasks clears the window before old material both bills you and crowds the model. Leaning on caching deliberately makes the cheap path the default rather than the accident. None of them is hard. None asks you to give up the large window or to work in a cramped one. They ask for something smaller and more durable: attention to what is in the window and why.

The mindset under all four is the real change, and it is one short shift. Stop treating a load as free and start treating it as a small purchase with a recurring charge attached. Before you pull a file in, the question is not “could this be useful.” Almost anything could be useful. The question is “is this worth carrying for the rest of the session,” because that is what loading it actually commits you to. Most of the time the answer is a smaller, sharper request than the one you were about to make. You wanted the directory; you needed one function. You wanted the whole file; you needed forty lines. The window will let you take all of it. The point is that letting you is not the same as it being free, and the gap between those two is the whole cost.

The large context window is one of the best things about modern Claude, and it is most useful to the people who respect what it costs. The trap was never the window. It was the belief that filling it is free. It is not free. It is re-billed every turn, it is cheap only while the cache is warm, and it makes the model less sharp as it fills. Load it the way you would pack a bag you have to carry all day. Only what the day needs, and nothing you will not use.

claude-codeai-costcontext-window

About the Author

Amit Kothari is an experienced consultant, advisor, coach, and educator specializing in AI and operations for executives and their companies. With 25+ years of experience, he is the Co-Founder & CEO of Tallyfy® (raised $3.6m, the Workflow Made Easy® platform) and Partner at Blue Sheen, an AI advisory firm for mid-size companies. He helps companies identify, plan, and implement practical AI solutions that actually work. Originally British and now based in St. Louis, MO, Amit combines deep technical expertise with real-world business understanding. Read Amit's full bio →

Disclaimer: The content in this article represents personal opinions based on extensive research and practical experience. While every effort has been made to ensure accuracy through data analysis and source verification, this should not be considered professional advice. Always consult with qualified professionals for decisions specific to your situation.

Contact me More about me

View All Posts »

Claude Code effort mode and where it falls short

Claude Code effort mode looks like a cost dial: turn it down to spend fewer tokens. The official docs say otherwise. Effort is a behavioral signal, not a strict budget, so low effort does not reliably cut spend and can quietly raise it. Here are the five levels, where they stop, and how to set effort with intent.

How to budget tokens in Claude Code

A surprising Claude Code bill is almost never one big expense. It is four different cost shapes stacked up: a context window that bills every turn, subagents that each cost a fixed chunk, skills that cost almost nothing until used, and caching that can cut the recurring cost or not. Budgeting tokens means knowing the four shapes.

An MCP server is unreviewed code with your file system in scope

Treat every MCP server as untrusted code that runs with the access your agent has, because that is what it is. Anthropic docs say the directory lists connectors but does not security-audit them. A registry of approved servers with nothing enforcing it is a memo. The control that binds is a managed allowlist matched by URL or command, never by name.

Your Claude Code deny rules are not a security boundary

Before you hand Claude Code to hundreds of people you add deny rules for .env and credentials and feel locked down. You are not. Those rules govern Claude own tools, not a Python one-liner that opens the same file, and the control that actually holds, the OS sandbox, reads your whole machine by default and fails open when it cannot start. The baseline worth setting is real. Its dangerous gaps are the defaults you never changed.

How to schedule Claude Code on your own machine

You want a Claude job to run every few hours on your Mac, not in the cloud. A cloud routine cannot do it, because it never touches your machine. Here are the local options that can, why launchd beats cron for this, and a working LaunchAgent that pulls every one of my repos on a schedule.

Claude Code loop is for the work you watch

Claude Code /loop reruns a prompt on an interval inside your session. It is perfect for babysitting a deploy or a test run, and wrong for anything that has to keep going while you are away. It also exposes a durable flag that, on version 2.1.185, quietly writes nothing to disk. Here is how to use it well.