Quick answers
Can AI do accessibility testing? It handles the automated layer that scanners do, plus a real part of the harder judgment layer, but a person still signs off.
What does it catch that scanners miss? State that is only wrong when announced, contrast that fails in dark mode, and controls a keyboard cannot reach.
What is the biggest mistake? Trusting one layer. A 30 percent scanner score is where the overlay lawsuits come from.
Can an AI agent test your product for accessibility? Partly, and the real answer matters more than the hype.
It can do the grinding work that scanners already do. It can also do a good chunk of the harder work that normally waits for a human with a screen reader. What it cannot do is vouch for its own results, and you should not let it try.
I know because I ran one against Tallyfy, the product I have built for over ten years. I pointed Claude Code at it and let it work for sixteen hours straight, across four separate codebases, checking screens against WCAG 2.2 at the AA level. It filed dozens of real bugs. Some of them were mine, sitting in plain sight for years. That stung a bit.
Why scanners catch only a third
Start with the number the whole industry quietly agrees on. Automated accessibility tools catch somewhere between 30 and 40 percent of real WCAG failures. The rest needs a person, or an agent doing a person’s job.
Axe-core, the engine inside most scanners, is good at the things it checks. Missing alt text. Empty form labels. An ARIA role that points at nothing. Colour contrast under the line. Run it in your build on every pull request and you catch a whole class of mistakes before anyone reviews the feature. I would not ship without it.
But axe will not tell you whether your alt text is a lie. It cannot decide if your custom dropdown announces its own on-or-off state. It does not listen to how a screen reader reads the page out loud. By design it skips anything that needs judgment, because a false pass is worse than no answer at all.
This gap is where the lawsuits live. The overlay companies that promised one line of JavaScript would make you compliant sold exactly this fantasy, that the visible 30 percent is the whole job. The FTC fined accessiBe a million dollars in early 2025 over that kind of claim. Hundreds of businesses running those widgets got sued anyway, because a scanner score is not a legal defence. The missing 70 percent does not disappear because you cannot see it.
It helps to know what compliance even means here, because the words get muddled. The ADA is the law in the United States, and it carries no technical spec of its own. Courts have settled on WCAG, the Web Content Accessibility Guidelines, as the bar a website gets measured against. Section 508 for federal buyers and Europe’s EN 301 549 both point straight back at the same WCAG criteria. So one audit, done right, answers all of them at once. That is the whole reason it is worth doing well rather than fast.
What I actually ran
So the interesting question is whether an agent can do part of that missing 70 percent. Not all of it. Part.
The setup was a pipeline, one screen at a time. For each route the agent ran the scanner first, then went well past it.
It read the live page, not the source code, and checked every interactive control for its real name, its role, and whether a keyboard could reach it. It measured contrast in light and dark mode separately, because a button that passes in one can fail badly in the other. Then it made calls on the thirteen WCAG criteria that no scanner will touch, things like whether a status message gets announced, or whether the page reflows at phone width without clipping content off the side.
The part I did not expect to work was the screen reader. The agent drove real VoiceOver on macOS, the same assistive tech a blind user runs, and recorded what it spoke. Not a simulation. The real thing, reading the page aloud while the agent listened to every word.
Then comes the bit that earns trust. A second agent tried to tear the first one’s work apart, and the unit re-checked itself before it was allowed to call a screen done. Two passes, both adversarial. If you have ever reviewed your own code an hour later and found the obvious bug staring back, you know why that second look matters.
The reason it ran for sixteen hours is that there is no shortcut through it. Tallyfy is four codebases, an Angular client, a Laravel API, a marketing site, and a docs site, and the agent walked them one screen at a time. Every screen got the full pass before it moved to the next. The state lived in git, so when a session ended the next one resumed exactly where the last had stopped, with nothing dropped. That is the only reason a job this size finishes at all instead of falling apart halfway.
What it caught that scanners miss
One finding stuck with me for days. On the email-notifications screen there are nine toggles. A scanner saw labels near them and moved on, green tick. The live probe found that all nine labels pointed at element IDs that did not exist, so to a screen reader the toggles had no names at all. You would hear “switch, on” with no idea what you just turned on.
It got stranger. Saving a single toggle was silent to assistive tech, no announcement at all. Saving the weekly-cadence buttons right beside them did announce, through a different code path a developer had wired up by hand years earlier. Same screen, two save actions, one speaks and one says nothing. Axe passed the page. The real screen reader caught it. That is WCAG 4.1.3 in one screenshot, and no scanner on earth would have flagged it.

The agent’s terminal, sixteen hours in, with the live macOS VoiceOver caption box at the bottom. That caption is the real screen reader speaking, not a mock-up.
The contrast results were humbling in a different way. Things that looked fine to me measured like this:
| What it looked like | Measured | WCAG minimum |
|---|---|---|
| White text on our brand green | 2.61:1 | 4.5:1 |
| A label, white on white | 1.0:1 | 4.5:1 |
| Navy text in dark mode | 1.13:1 | 4.5:1 |
White text on our own brand green came out at 2.61 to one. The minimum for normal text is 4.5. One label sat white-on-white at 1.0 to one, which is to say invisible, the result of a dark-mode rule fighting a light-mode rule. I had walked past these for years because they looked fine to my eyes. They are not fine to everyone’s, and that is the entire point.
Keyboard was its own category of pain. A handful of custom widgets, a calendar, a kanban board, a file uploader, opened fine with a mouse and were dead to the Tab key. If you cannot hold a mouse, those screens did not exist for you. A scanner sees a clickable element and assumes the best. The probe pressed Tab, watched nothing happen, and wrote it down.
Where AI still needs a human
None of this makes the agent a replacement for an accessibility specialist. It makes it a fast, patient first pass that never gets bored on screen number eighty.
The thing it cannot do is speak for itself. When the agent writes up a conformance report, it has to state exactly which assistive tech it ran on each screen, and where it only inferred behaviour from the accessibility tree instead of running the real thing. Claiming screen-reader coverage you did not run is its own kind of defect, the same trick the overlay vendors got fined for, just wearing a nicer outfit. A person reads the report before it goes out. That part is not optional.
There is taste involved too. Whether a heading describes what follows it. Whether an error message tells you how to fix the problem and not only that one exists. Whether the reading order makes sense to a stranger who landed mid-page. An agent has an opinion on all three. A person still decides.
What came out of it
After those sixteen hours and the sessions that followed, the count sits at forty-five open issues across the four codebases, on top of seventy-odd already fixed. Keyboard traps. Unlabelled controls. Dark-mode contrast. A notification that never announced itself. An animation that looped forever with no way to pause it. Real bugs, in a real product, found by an agent and fixed by people.
I am showing you the specifics, brand-green contrast failure and all, on purpose. A lot of companies treat an accessibility audit as a thing to hide. I would rather show the work. The output of all of it is a VPAT, the report that says, criterion by criterion, what a product supports and what it does not. Ours is open about the gaps, because a gap you admit and fix beats a green badge you cannot trust.
Done by hand, a pass over this many screens is weeks of a specialist’s time, and you owe it again after every release that moves the interface around. That is the maths that makes most teams skip it, or do it once and never again. An agent does not make the work less real. It makes the per-release cost small enough that skipping it stops being defensible.
If you want the mechanics, I have written separately about running Claude Code unattended for long jobs like this. The instinct behind it is the same one from my three-day AI audit: watch what is real before you ask anyone a single question.
Related reading
Why accessibility overlays do not work digs into the overlay lawsuits and what real fixing looks like. What axe-core misses goes deep on the screen-reader half of the job.





