· AI

CEO of Tallyfy · AI advisor at Blue Sheen for mid-size companies

What axe-core misses, and how AI caught it with a real screen reader

Axe-core catches about a third of WCAG failures and skips anything that needs judgment. Here are the thirteen criteria a scanner cannot decide, how an AI agent drives a real VoiceOver session to cover them, and the save button that passed every automated check and was silent to a blind user.

What you will learn

  1. Why axe-core, for all its strengths, stops at about a third of WCAG.
  2. The thirteen criteria a scanner cannot decide, in plain language.
  3. How an AI agent drives a real screen reader instead of a simulation of one.
  4. A bug that passed every automated check and was silent to a blind user.

Axe-core is the best automated accessibility tool there is, and it will tell you almost nothing about whether a blind person can use your product. Those two facts sit together more easily than you would expect.

A scanner reads your markup and checks rules. A screen reader reads your page out loud to a human who cannot see it. The gap between those two is where most real accessibility problems hide, and it is the part almost nobody tests, because until recently testing it meant a person with headphones working through every screen by hand.

What axe is good at

Credit where it is due. Axe-core, the engine inside most scanners and the one I run on every build, catches a whole class of mistakes fast and without complaint. A missing alt attribute. An ARIA role spelled wrong. An input with no label in the markup. Text sitting below the contrast line. These are real problems, they are common, and catching them on every pull request is the cheapest accessibility win you will ever get.

If you take one practical thing from this post, it is to run a scanner in CI. I am not here to talk you out of axe. I am here to tell you where it stops.

Related reading

How I ran a real accessibility audit is the full process. Why overlays do not work covers the tools that pretend the scanner score is the whole job.

The thirteen it cannot decide

Roughly 30 to 40 percent of WCAG failures are the kind a rule engine can catch. The rest need judgment, and the people who build axe are upfront about it. The tool deliberately skips any check where a wrong pass would be worse than no answer.

Thirteen WCAG success criteria fall squarely in that gap. A few of them, in human terms.

Does a status message get announced? When something saves, or an error appears, does a screen reader say so, or does it change only on screen where a blind user cannot see it. That is 4.1.3, and hold onto it, because it is the one that bit me.

Does the page reflow at phone width without clipping content off the side? Does a custom control announce its own on-or-off state? Does a tooltip that appears on hover also work from a keyboard, and stay put long enough to read? Is the contrast of a button’s border, not its text, high enough in dark mode as well as light? Are the headings on the page descriptive enough that someone jumping between them knows where they are? None of these can be answered by reading markup. A person, or an agent doing the person’s job, has to look and listen.

Driving a real screen reader

The part that surprised me is that an agent can run the actual screen reader. Not a model of one. The real thing.

On macOS that is VoiceOver, the same tool a blind user runs every day. An open-source harness called Guidepup lets a script start VoiceOver, move its cursor through a page, and capture every word it speaks. The agent jumps in by heading, walks the page from top to bottom, and records the narration. On one settings screen it captured eighty spoken phrases, twenty-nine of them real page content, and reached the line that reads out “heading level 3, Digest Emails”.

Claude Code terminal after a 16 hour WCAG audit, with a real macOS VoiceOver caption box reading out a toolbar

The white box is VoiceOver speaking the page aloud while the agent records what it says. That is the half of accessibility a scanner never hears.

That recording is the evidence. If a control has no name, you do not guess it from the markup. You hear the screen reader say “checkbox” with nothing attached, nine times over if there are nine of them, and you write down exactly what a blind user would hear.

The judgment comes from comparing two things. What the screen reader actually said, and what it should have said if the control were built right. A toggle that reads as “checkbox” with no name is a missing label, full stop. A button that reads identically whether it is on or off has no state a blind user can perceive. You do not need a rulebook for that. You need the recording and the sense to know what is wrong with it, which is the part an agent turns out to be surprisingly good at.

The bug a screen reader found and axe passed

This is the one I promised. On the email-notifications screen there are nine on-off toggles. Flip one and a small “Saved” appears beside it. A scanner looked at that screen and passed it. Contrast fine, markup had labels, nothing tripped a rule.

The screen reader told a different story. Those nine “Saved” messages were plain text on screen with no live region behind them, so a blind user flipping a toggle heard nothing. No confirmation, and no warning if a save had failed. Silent.

Then it got subtle, and this is the part I like. Right next to those toggles is a row of day buttons for the weekly digest, and saving one of those did announce, because a developer had run that one path through a different mechanism with a live region built in. Same screen. Two save actions. One speaks, one says nothing. The first pass of the audit wrote it up as “nothing on this screen announces”. The second agent, the one whose entire job is to attack the first one’s work, caught that as an overstatement and corrected it to the exact truth: the toggles are silent, the day buttons are fine. That is 4.1.3 decided correctly, and no scanner alive would have surfaced any of it.

That is also why the second pass matters. A single agent can be confidently wrong in the same way a single developer can. Pointing one at the other’s work, told to disprove it, is what turns a plausible answer into a defensible one.

Putting it in your own stack

None of this replaces the scanner. It sits on top of it, the way the harder 70 percent sits on top of the easy 30. A setup that works looks like this. Run axe in CI on every pull request to catch the structural mistakes early. Then run an agent pass on the screens that matter, driving a real screen reader and ruling on the judgment criteria a rule engine skips. The scanner does the cheap, certain part. The agent covers the part the scanner was never built for.

You also do not run the agent on all nine hundred screens. You run it on the ones that matter, the sign-up, the checkout, the core task a user repeats every day, and let the scanner hold the line everywhere else. Coverage is a budget, and the judgment pass is the expensive part. Spend it where a real person would feel the difference.

The full version of how I ran this, across four codebases for sixteen hours, is in the main write-up. If the question on your mind is how an agent runs that long without losing the thread, I went into the machinery of long autonomous jobs on its own.

About the Author

Amit Kothari is an experienced consultant, advisor, coach, and educator specializing in AI and operations for executives and their companies. With 25+ years of experience, he is the Co-Founder & CEO of Tallyfy® (raised $3.6m, the Workflow Made Easy® platform) and Partner at Blue Sheen, an AI advisory firm for mid-size companies. He helps companies identify, plan, and implement practical AI solutions that actually work. Originally British and now based in St. Louis, MO, Amit combines deep technical expertise with real-world business understanding. Read Amit's full bio →

Disclaimer: The content in this article represents personal opinions based on extensive research and practical experience. While every effort has been made to ensure accuracy through data analysis and source verification, this should not be considered professional advice. Always consult with qualified professionals for decisions specific to your situation.

Related Posts

View All Posts »
Can AI actually do accessibility testing? I ran it on my own product

Can AI actually do accessibility testing? I ran it on my own product

Automated accessibility tools catch maybe a third of WCAG problems. I pointed Claude Code at Tallyfy, my own product, and let it run a real WCAG 2.2 audit with a live screen reader across four codebases. It found bugs that axe-core cannot see, and it showed clearly where the work still needs a person.

Accessibility overlays do not work, and AI auditing is the opposite

Accessibility overlays do not work, and AI auditing is the opposite

An accessibility overlay is one line of JavaScript that promises ADA compliance while you do nothing. The FTC fined accessiBe a million dollars over that promise. Here is why a widget cannot fix a problem that lives in your code, and how real AI auditing does the reverse by finding the broken line so a person can change it.

What a VPAT costs, and why the report is the cheap part

What a VPAT costs, and why the report is the cheap part

A VPAT is the report that states how accessible your product is, measured against WCAG. People ask what it costs and price the document, but the document is the cheap part. The real cost is re-auditing every release, and that is the number an AI agent actually moves. Here is the ADA, WCAG, Section 508 and EN 301 549 stack underneath it.

How to run a long autonomous Claude Code job without it drifting

How to run a long autonomous Claude Code job without it drifting

The hard part of a big AI job is not the work. It is making the agent run for many sessions without drifting or claiming it is done when it is not. I used an accessibility audit across four codebases as the test. The setup that kept Claude Code on track was a git ledger, atomic parallel claims, and two verification passes.

AI for non-technical teams: making it accessible

AI for non-technical teams: making it accessible

Finance, HR, and operations teams often extract more value from AI than engineering does. MIT research shows only 5 percent of organizations capture major AI value. The ones that succeed start with business problems, not technology.

Your AI context layer is only half a brain

Your AI context layer is only half a brain

An AI context layer feeds every model one governed source of company truth, and DataHub and Atlan will sell you that read half today. The half that notices when a person did not get what they wanted, the re-ask nobody logged, is what turns a knowledge store into a brain.

AI advisory services via Blue Sheen.
Contact me Follow 10k+