· AI

CEO of Tallyfy · AI advisor at Blue Sheen for mid-size companies

How to run a long autonomous Claude Code job without it drifting

The hard part of a big AI job is not the work. It is making the agent run for many sessions without drifting or claiming it is done when it is not. I used an accessibility audit across four codebases as the test. The setup that kept Claude Code on track was a git ledger, atomic parallel claims, and two verification passes.

If you remember nothing else:

  • A long AI job fails by drifting and by lying about being done, not by getting the work wrong.
  • Git is the memory. One unit of work, one commit, so any session resumes from where the last one stopped.
  • Two verification passes, one adversarial and one self-check, turn a plausible result into a trustworthy one.

The accessibility audit was the easy part. The hard problem underneath it was time. A real audit across four codebases is not a one-sitting job. It runs for many sessions, one sixteen-hour stretch among them, and the thing that breaks first is never the testing. It is the agent’s grip on what it has already done.

Point Claude Code at a job that big and it will start strong, then slowly lose the plot. It re-does work it finished yesterday. It forgets a screen it skipped. Worst of all, it tells you the job is done when a quarter of it never happened, because by then the start of the work has scrolled out of its memory. None of that is a model being dim. It is a context window being finite, and pretending otherwise is how these jobs quietly fall over.

Why long autonomous jobs drift

An agent only knows what is in front of it. A context window holds a few hundred thousand tokens, which sounds like plenty until a job runs for days. Old work scrolls off the top. The agent cannot see the screen it audited on Tuesday, so on Friday it has no idea whether that screen is done.

So it guesses, and guesses drift. The dangerous failure is not the obvious crash. It is the confident wrong answer, the “all forty routes audited” when thirty-one were and the agent lost count. If you have ever run a project from memory instead of a list, you know exactly how that ends. The fix is the one a good team already uses. Stop trusting memory. Write it down somewhere that outlives the session.

Related reading

The audit this job ran is the work itself. What axe-core misses is the screen-reader half the two-pass check kept accurate.

Git is the memory

The setup that works treats git as the source of truth, not the agent’s head.

The job is cut into units, one screen, one component, one self-contained piece. Each unit ends in exactly one commit. Done is not a thing the agent remembers, it is a thing you can see in the log. A small core file tracks what is left, a cursor points at the next unit, and a ledger records what is finished. None of it sits in the context window. All of it sits on disk.

A loop: pick the next unit, do it, pass a validation gate, verify twice, commit and log, then resume anytime

The payoff is that any session can pick up cold. A fresh agent reads the core file, looks at what is committed, and knows where to start, with nothing lost and nothing repeated. The job becomes resumable, which is the only way a run measured in days ever actually finishes. It also becomes crash-safe. A power cut costs you one unit, not the whole job.

One more rule keeps it moving. The cursor never points at a unit that is blocked on something outside the job, a human review, another team, a slow external check. Blocked work gets pushed to the end with a note on what it is waiting for, and the agent picks the next thing it can actually finish. Every session makes real progress instead of stalling on the one unit it cannot move.

Running them in parallel

Once a job lives on disk instead of in a head, you can run several at once. Four sessions, four codebases, all going at the same time.

The trick is to never let two of them write the same thing. Each session claims a unit before it starts, using the one move an operating system guarantees is atomic, making a directory. If the directory already exists, someone else got there first, and you move on. No lock server, no database, no coordination beyond the filesystem itself. Where the work really shares a resource, and on a Mac the real screen reader is one of those, since only one VoiceOver can run at a time, the sessions pass a cooperative lock between them and wait their turn. The result is four agents working in parallel that never tread on each other, held together by careful use of files and nothing more.

The two-pass check that earns trust

Speed and scale are worth nothing if you cannot trust the output, and a long autonomous run is exactly where one agent’s mistakes pile up unseen. So every unit is checked twice before it counts as done.

First the unit checks its own work. It re-reads what it just claimed and asks what it asserted without confirming, which screens it left without a verdict, what it might have skipped. Then a second agent, with fresh context and a single instruction, to break the first one’s work, goes at it adversarially. The screen-reader bug from the audit, the toggle that stayed silent while the button beside it announced, was caught exactly this way. The first pass overstated it. The second pass corrected it to the precise truth.

Two passes sound like overhead. They are the opposite. They are what let you walk away from the job and trust what it hands back, which is the whole point of making it autonomous. An agent you have to watch every minute is not saving you anything.

The catch with self-checking is that an agent told to review its own work will sometimes declare victory and skip it, the same way it drifts. So the check is not left to good intentions. A hook fires the moment the agent tries to stop, and refuses to let it finish until a fresh pass has re-audited the unit against its acceptance criteria. You cannot mark your own homework if the system will not let you leave the room until the marking is done.

What this generalizes to

None of this is special to accessibility. The shape fits any job too big for one sitting: a migration across a thousand files, a content refresh over hundreds of posts, a sweep that audits every screen of a product. Cut it into units. Put the state on disk. Make done a commit, not a memory. Check the work twice. Run as many in parallel as the shared resources allow.

The accessibility audit was a good test because it is unforgiving. Real screen readers, four codebases, dozens of real bugs, no room to fudge a result. The machinery underneath it is general, though. I wrote about the audit itself in the main piece, and about the screen-reader work the two-pass check kept accurate. For the lower-level version of running Claude unattended, I covered the non-interactive mode on its own.

About the Author

Amit Kothari is an experienced consultant, advisor, coach, and educator specializing in AI and operations for executives and their companies. With 25+ years of experience, he is the Co-Founder & CEO of Tallyfy® (raised $3.6m, the Workflow Made Easy® platform) and Partner at Blue Sheen, an AI advisory firm for mid-size companies. He helps companies identify, plan, and implement practical AI solutions that actually work. Originally British and now based in St. Louis, MO, Amit combines deep technical expertise with real-world business understanding. Read Amit's full bio →

Disclaimer: The content in this article represents personal opinions based on extensive research and practical experience. While every effort has been made to ensure accuracy through data analysis and source verification, this should not be considered professional advice. Always consult with qualified professionals for decisions specific to your situation.

Related Posts

View All Posts »
Can AI actually do accessibility testing? I ran it on my own product

Can AI actually do accessibility testing? I ran it on my own product

Automated accessibility tools catch maybe a third of WCAG problems. I pointed Claude Code at Tallyfy, my own product, and let it run a real WCAG 2.2 audit with a live screen reader across four codebases. It found bugs that axe-core cannot see, and it showed clearly where the work still needs a person.

When to use a dynamic workflow

When to use a dynamic workflow

A dynamic workflow in Claude Code runs up to sixteen subagents at once and a thousand across a job. That power is wasted on most tasks. This is the decision I use before reaching for one: when a single agent wins, when a dynamic workflow earns its cost, and when the answer is to not automate at all.

Dynamic workflows: parallel verification at scale

Dynamic workflows: parallel verification at scale

Dynamic workflows in Claude Code run tens to hundreds of subagents that check each other before anything reaches you. The parallelism is not the interesting part. The verification is. Here is how I am using one to re-verify 250 posts on this site, and when it earns its cost.

The built-in agent types in Claude Code

The built-in agent types in Claude Code

Claude Code ships with five built-in agent types: Explore, Plan, general-purpose, statusline-setup, and claude-code-guide. Most people know two of them. The other three run constantly and shape how much your sessions cost. This is the full catalog, what each one is for, and why knowing them changes how you read your own terminal.

How the general-purpose agent works in Claude Code

How the general-purpose agent works in Claude Code

The general-purpose agent in Claude Code is not the main agent and not something you pick. It is a built-in subagent Claude routes to on its own for complex, multi-step work. It inherits your model and, by default, runs in its own fresh context that Claude briefs with a short summary. This post explains how it actually works and what that costs you.

How Claude Code scheduled jobs actually work

How Claude Code scheduled jobs actually work

Claude Code scheduled jobs come in three forms with very different guarantees: the in-session /loop, Desktop tasks, and Cloud routines. A missed run does not queue up a backlog. And despite a common belief, none of them creates a Windows Task Scheduler entry or a .bat file. Here is how each one actually behaves.

AI advisory services via Blue Sheen.
Contact me Follow 10k+