What you will learn
- Why axe-core, for all its strengths, stops at about a third of WCAG.
- The thirteen criteria a scanner cannot decide, in plain language.
- How an AI agent drives a real screen reader instead of a simulation of one.
- A bug that passed every automated check and was silent to a blind user.
Axe-core is the best automated accessibility tool there is, and it will tell you almost nothing about whether a blind person can use your product. Those two facts sit together more easily than you would expect.
A scanner reads your markup and checks rules. A screen reader reads your page out loud to a human who cannot see it. The gap between those two is where most real accessibility problems hide, and it is the part almost nobody tests, because until recently testing it meant a person with headphones working through every screen by hand.
What axe is good at
Credit where it is due. Axe-core, the engine inside most scanners and the one I run on every build, catches a whole class of mistakes fast and without complaint. A missing alt attribute. An ARIA role spelled wrong. An input with no label in the markup. Text sitting below the contrast line. These are real problems, they are common, and catching them on every pull request is the cheapest accessibility win you will ever get.
If you take one practical thing from this post, it is to run a scanner in CI. I am not here to talk you out of axe. I am here to tell you where it stops.
Related reading
How I ran a real accessibility audit is the full process. Why overlays do not work covers the tools that pretend the scanner score is the whole job.
The thirteen it cannot decide
Roughly 30 to 40 percent of WCAG failures are the kind a rule engine can catch. The rest need judgment, and the people who build axe are upfront about it. The tool deliberately skips any check where a wrong pass would be worse than no answer.
Thirteen WCAG success criteria fall squarely in that gap. A few of them, in human terms.
Does a status message get announced? When something saves, or an error appears, does a screen reader say so, or does it change only on screen where a blind user cannot see it. That is 4.1.3, and hold onto it, because it is the one that bit me.
Does the page reflow at phone width without clipping content off the side? Does a custom control announce its own on-or-off state? Does a tooltip that appears on hover also work from a keyboard, and stay put long enough to read? Is the contrast of a button’s border, not its text, high enough in dark mode as well as light? Are the headings on the page descriptive enough that someone jumping between them knows where they are? None of these can be answered by reading markup. A person, or an agent doing the person’s job, has to look and listen.
Driving a real screen reader
The part that surprised me is that an agent can run the actual screen reader. Not a model of one. The real thing.
On macOS that is VoiceOver, the same tool a blind user runs every day. An open-source harness called Guidepup lets a script start VoiceOver, move its cursor through a page, and capture every word it speaks. The agent jumps in by heading, walks the page from top to bottom, and records the narration. On one settings screen it captured eighty spoken phrases, twenty-nine of them real page content, and reached the line that reads out “heading level 3, Digest Emails”.

The white box is VoiceOver speaking the page aloud while the agent records what it says. That is the half of accessibility a scanner never hears.
That recording is the evidence. If a control has no name, you do not guess it from the markup. You hear the screen reader say “checkbox” with nothing attached, nine times over if there are nine of them, and you write down exactly what a blind user would hear.
The judgment comes from comparing two things. What the screen reader actually said, and what it should have said if the control were built right. A toggle that reads as “checkbox” with no name is a missing label, full stop. A button that reads identically whether it is on or off has no state a blind user can perceive. You do not need a rulebook for that. You need the recording and the sense to know what is wrong with it, which is the part an agent turns out to be surprisingly good at.
The bug a screen reader found and axe passed
This is the one I promised. On the email-notifications screen there are nine on-off toggles. Flip one and a small “Saved” appears beside it. A scanner looked at that screen and passed it. Contrast fine, markup had labels, nothing tripped a rule.
The screen reader told a different story. Those nine “Saved” messages were plain text on screen with no live region behind them, so a blind user flipping a toggle heard nothing. No confirmation, and no warning if a save had failed. Silent.
Then it got subtle, and this is the part I like. Right next to those toggles is a row of day buttons for the weekly digest, and saving one of those did announce, because a developer had run that one path through a different mechanism with a live region built in. Same screen. Two save actions. One speaks, one says nothing. The first pass of the audit wrote it up as “nothing on this screen announces”. The second agent, the one whose entire job is to attack the first one’s work, caught that as an overstatement and corrected it to the exact truth: the toggles are silent, the day buttons are fine. That is 4.1.3 decided correctly, and no scanner alive would have surfaced any of it.
That is also why the second pass matters. A single agent can be confidently wrong in the same way a single developer can. Pointing one at the other’s work, told to disprove it, is what turns a plausible answer into a defensible one.
Putting it in your own stack
None of this replaces the scanner. It sits on top of it, the way the harder 70 percent sits on top of the easy 30. A setup that works looks like this. Run axe in CI on every pull request to catch the structural mistakes early. Then run an agent pass on the screens that matter, driving a real screen reader and ruling on the judgment criteria a rule engine skips. The scanner does the cheap, certain part. The agent covers the part the scanner was never built for.
You also do not run the agent on all nine hundred screens. You run it on the ones that matter, the sign-up, the checkout, the core task a user repeats every day, and let the scanner hold the line everywhere else. Coverage is a budget, and the judgment pass is the expensive part. Spend it where a real person would feel the difference.
The full version of how I ran this, across four codebases for sixteen hours, is in the main write-up. If the question on your mind is how an agent runs that long without losing the thread, I went into the machinery of long autonomous jobs on its own.





