If you remember nothing else:
- A long AI job fails by drifting and by lying about being done, not by getting the work wrong.
- Git is the memory. One unit of work, one commit, so any session resumes from where the last one stopped.
- Two verification passes, one adversarial and one self-check, turn a plausible result into a trustworthy one.
The accessibility audit was the easy part. The hard problem underneath it was time. A real audit across four codebases is not a one-sitting job. It runs for many sessions, one sixteen-hour stretch among them, and the thing that breaks first is never the testing. It is the agent’s grip on what it has already done.
Point Claude Code at a job that big and it will start strong, then slowly lose the plot. It re-does work it finished yesterday. It forgets a screen it skipped. Worst of all, it tells you the job is done when a quarter of it never happened, because by then the start of the work has scrolled out of its memory. None of that is a model being dim. It is a context window being finite, and pretending otherwise is how these jobs quietly fall over.
Why long autonomous jobs drift
An agent only knows what is in front of it. A context window holds a few hundred thousand tokens, which sounds like plenty until a job runs for days. Old work scrolls off the top. The agent cannot see the screen it audited on Tuesday, so on Friday it has no idea whether that screen is done.
So it guesses, and guesses drift. The dangerous failure is not the obvious crash. It is the confident wrong answer, the “all forty routes audited” when thirty-one were and the agent lost count. If you have ever run a project from memory instead of a list, you know exactly how that ends. The fix is the one a good team already uses. Stop trusting memory. Write it down somewhere that outlives the session.
Related reading
The audit this job ran is the work itself. What axe-core misses is the screen-reader half the two-pass check kept accurate.
Git is the memory
The setup that works treats git as the source of truth, not the agent’s head.
The job is cut into units, one screen, one component, one self-contained piece. Each unit ends in exactly one commit. Done is not a thing the agent remembers, it is a thing you can see in the log. A small core file tracks what is left, a cursor points at the next unit, and a ledger records what is finished. None of it sits in the context window. All of it sits on disk.
The payoff is that any session can pick up cold. A fresh agent reads the core file, looks at what is committed, and knows where to start, with nothing lost and nothing repeated. The job becomes resumable, which is the only way a run measured in days ever actually finishes. It also becomes crash-safe. A power cut costs you one unit, not the whole job.
One more rule keeps it moving. The cursor never points at a unit that is blocked on something outside the job, a human review, another team, a slow external check. Blocked work gets pushed to the end with a note on what it is waiting for, and the agent picks the next thing it can actually finish. Every session makes real progress instead of stalling on the one unit it cannot move.
Running them in parallel
Once a job lives on disk instead of in a head, you can run several at once. Four sessions, four codebases, all going at the same time.
The trick is to never let two of them write the same thing. Each session claims a unit before it starts, using the one move an operating system guarantees is atomic, making a directory. If the directory already exists, someone else got there first, and you move on. No lock server, no database, no coordination beyond the filesystem itself. Where the work really shares a resource, and on a Mac the real screen reader is one of those, since only one VoiceOver can run at a time, the sessions pass a cooperative lock between them and wait their turn. The result is four agents working in parallel that never tread on each other, held together by careful use of files and nothing more.
The two-pass check that earns trust
Speed and scale are worth nothing if you cannot trust the output, and a long autonomous run is exactly where one agent’s mistakes pile up unseen. So every unit is checked twice before it counts as done.
First the unit checks its own work. It re-reads what it just claimed and asks what it asserted without confirming, which screens it left without a verdict, what it might have skipped. Then a second agent, with fresh context and a single instruction, to break the first one’s work, goes at it adversarially. The screen-reader bug from the audit, the toggle that stayed silent while the button beside it announced, was caught exactly this way. The first pass overstated it. The second pass corrected it to the precise truth.
Two passes sound like overhead. They are the opposite. They are what let you walk away from the job and trust what it hands back, which is the whole point of making it autonomous. An agent you have to watch every minute is not saving you anything.
The catch with self-checking is that an agent told to review its own work will sometimes declare victory and skip it, the same way it drifts. So the check is not left to good intentions. A hook fires the moment the agent tries to stop, and refuses to let it finish until a fresh pass has re-audited the unit against its acceptance criteria. You cannot mark your own homework if the system will not let you leave the room until the marking is done.
What this generalizes to
None of this is special to accessibility. The shape fits any job too big for one sitting: a migration across a thousand files, a content refresh over hundreds of posts, a sweep that audits every screen of a product. Cut it into units. Put the state on disk. Make done a commit, not a memory. Check the work twice. Run as many in parallel as the shared resources allow.
The accessibility audit was a good test because it is unforgiving. Real screen readers, four codebases, dozens of real bugs, no room to fudge a result. The machinery underneath it is general, though. I wrote about the audit itself in the main piece, and about the screen-reader work the two-pass check kept accurate. For the lower-level version of running Claude unattended, I covered the non-interactive mode on its own.





