Amit Kothari CEO of Tallyfy, AI advisor at Blue Sheen

AI errors need AI-level explanations

In brief

AI systems fail gradually and partially, not in clear binary states like traditional software. IDC research shows 88% of AI proof-of-concepts never reach production. The model gives a plausible answer missing important context, latency spikes but stays under timeout limits, outputs degrade invisibly. Your AI error handling must match this reality.

Amit Kothari Follow 10k+

Nov 8, 2025 · Updated Jun 12, 2026 · AI

CEO of Tallyfy · AI advisor at Blue Sheen for mid-size companies

AI errors need AI-level explanations

AI error handling in production fails in ways your normal error logging never anticipated.

Where a workflow layer catches the failure your dashboard misses

Claims intake check Run #1,204 Running now

✓ Completed

1. Document arrives

Intake queue

On time

⏳ Active

2. AI extracts the fields

Claude AI agent

Now

⚠ Waiting

3. Low score routes to a reviewer

On-call reviewer

1h left

◐ Conditional

4. Auto-post when it validates

Auto when the schema checks

Auto

Phase 1

Set up

Write a success test for each step

You cannot catch a bad answer you never defined.

Set a confidence floor and a fallback

Below the floor, route to a person not a guess.

Pick the human review path

A failing step needs somewhere to land.

Phase 2

Run

AI runs the step and scores itself

A hidden confidence signal helps no one.

A validation gate rejects bad output

Valid JSON full of nonsense still fails.

Low confidence goes to a human

Guessing under the floor is the expensive mistake.

Phase 3

Track and improve

Track degradation, not just error rate

A 200 response can still be quietly wrong.

Feed every failure back as test data

The questions that broke it train the next fix.

I built Tallyfy to solve exactly this pattern.

Traditional error handling expects binary outcomes. Works or broken. Success or failure. But AI systems fail gradually, partially, inconsistently. The model gives you an answer that looks fine but misses key context. Latency spikes but stays under timeout thresholds. Output degrades in ways your metrics don’t catch. The whole thing is properly disorienting if you came from a traditional stack.

I learned this watching most AI projects fail before reaching production. The ones that make it? They fail differently. Getting AI observability and monitoring right is the prerequisite for handling errors properly. In conversations I’ve had with engineering leads at mid-size firms, the most common admission is that the team treats AI like another microservice and is then surprised when it breaks unlike one.

Why do AI errors break your usual assumptions?

Your application crashes, you get a stack trace. Clear. Reproducible. Fixable.

Your AI hallucinates? Good luck debugging that. The thing is, the same prompt works Tuesday, fails Thursday, works again Friday. This irritates me because everyone wants a deterministic bug to chase. Context windows fill gradually until responses degrade. Your model drifts as production data diverges from training data. None of this triggers traditional error handlers. The whole discipline of building reliable agents starts from this basic mismatch.

Research from Google’s PAIR team shows what users consider an error connects deeply to their expectations. When AI fails, users don’t know if they asked wrong, if the system broke, or if the task was impossible. Traditional software sets clear boundaries. AI blurs them.

The production reality hits hard. IBM invested heavily in Watson for Oncology. The system gave dangerous treatment recommendations because it trained on hypothetical cases instead of real patient data. Nothing flagged this. Technically, everything worked fine. Medically, it was a nightmare.

What partial failure actually looks like

November 8, 2023. OpenAI’s API went down with 502 and 503 errors for over 90 minutes.

Flowchart showing AI partial failure handling with confidence check, fallback model, cached response, and human escalation paths

Applications built on their API experienced widespread failures at the same time. If your error handling assumed “the API works,” your users got cryptic timeout messages and nothing else.

Knight Capital learned this expensively. A failed deployment left old test code running on one of eight servers, triggering millions of erroneous trades. $440 million lost in 45 minutes. The system never crashed. It just executed perfectly wrong instructions.

Your AI error handling in production needs to anticipate these partial failures: the model returns JSON that validates but contains nonsense, your embedding service times out intermittently, the vector database returns results with confidence scores all below your threshold, your guardrails catch inappropriate content but don’t tell users what they should ask instead. I keep going back and forth on which of these matters most, and the blunt answer is whichever one your particular system hits first in production.

Since I wrote this, the failure class picked up first-party acknowledgment. Anthropic’s Claude Fable 5 ships safety classifiers that can decline a request in-band: the call completes, stop_reason comes back as “refusal”, and the response names which classifier fired. A status-code check reads that as success. The list above still holds; the vendors now design for it, down to a server-side fallbacks retry parameter (in beta) for the decline path.

IDC found 88% of AI proof-of-concepts never reach production. They fail integration tests that never considered AI-specific failure modes. Recent experience produced what engineers call “Stalled Pilot” syndrome instead of the promised “Year of the Agent.” The reliability math gets worse from there. Error rates compound exponentially: 95% reliability per step yields only 36% success over 20 steps. Read that again. It is worse than it sounds. Production demands 99.9%+ reliability, yet the best AI agents still miss that bar badly on CRM tasks. The expected cancellation of many agentic AI projects tracks with this. Unanticipated cost, complexity, and unexpected risks pile up fast.

Let me say that better. I said above that AI fails “gradually and partially,” and that oversimplifies it. The actual pattern is that each individual call usually works, but the COMPOUND reliability across a multi-step chain collapses long before any single step fails outright. That is why the 95% per-step number looks fine on a dashboard and brilliant in a demo, but the end-to-end agent workflow still produces a broken experience.

Degradation patterns worth building

Before I go on. Circuit breakers for AI calls. The pattern, first popularized by Michael Nygard in Release It! (2007), is simple: after a threshold of failures, stop calling the broken service. But AI needs more thought here. A circuit breaker for traditional APIs might open after 5 consecutive failures. For AI, you need to track degradation over time, not just hard failures. Circuit breakers detect persistent failures and route traffic away from failing components until health is restored. Modern frameworks like LangGraph now offer durable state. If a server restarts mid-conversation or a workflow gets interrupted, it picks up exactly where it left off. The OpenAI outage showed this clearly. Applications with fallback models survived. Those that assumed the API always works crashed.

Progressive feature reduction. When your primary model fails, don’t show an error. Switch to a simpler model. When that fails, fall back to cached responses. When that fails, route to human review. Research on graceful degradation shows users prefer reduced functionality over broken features. Your LLM summarization fails? Show the original text. Your classification model times out? Default to the most common category and flag for review. Your embedding search returns nothing? Fall back to keyword search.

Intelligent retry with backoff. The difference matters: retry handles transient failures, circuit breakers handle persistent ones.

Transient: network hiccup, momentary rate limit, brief service degradation. Persistent: model serving failure, quota exhausted, fundamental capability limit.

Retry the first. Circuit break the second. The expensive mistake is retrying persistent failures until you hit timeout and waste the user’s time.

The table below is the diagnostic shortlist I use in production-readiness reviews. When consulting with companies on AI rollouts, this is usually the first artifact I hand the on-call lead. The left column is the symptom your dashboards or users surface. The middle column is the underlying class of problem. The right column is the specific production fix, not a vague principle.

What you observe	What it tells you	Production fix
Model guesses instead of asking	Intent is underspecified at the API boundary	Required-field validation upstream; reject the call before it hits the model
Model invents facts or structure	Constraints are missing from the system prompt	Typed JSON schema + structured-output mode; reject malformed responses
Model answers differently each time	Temperature too high for the task, or context drift between calls	Drop temperature to 0.1-0.3 for structured tasks; cache deterministic prefixes
Model drops key details	Context window getting eaten; lost-in-the-middle U-curve	Chunk inputs, summarize first, put the critical instruction last in context
Model answers confidently but incorrectly	Confidence signal is hidden; no verification step	Ask for confidence with output; route low-confidence to LLM-as-judge or human review
Output quality collapses at scale	Cost or latency squeezed you onto a cheaper model	Circuit-break to fallback model; reduce task scope; cache aggressively

Every fix in the right column is one line of code or one config change away. None of these are research problems. They are operational discipline problems.

Two of these rows aged fast. As of mid-2026, structured outputs are GA on the Claude API, so schema-enforced responses are a standard feature rather than a workaround. The temperature dial went the other direction: Anthropic deprecated temperature, top_p, and top_k on Opus 4.7 and later models, where non-default values return a 400. The advice still stands; on current Claude models, the schema route now carries the consistency work the temperature dial used to.

When you want to take this further, Blue Sheen helps firms work through this.

Telling users what actually went wrong

Microsoft’s Tay chatbot failed catastrophically in 16 hours. But I’d argue the real failure was communication. Users had no idea what they were teaching the system by interacting with it. The whole thing is rubbish to look back on now.

Follow this for a second. Error messages for AI need a different approach. Most teams cobble together generic 500-error strings and call it done. (Five hundred. The polite computer way of saying we have no idea.)

Don’t blame the user. “Invalid input” makes them feel stupid when they asked a reasonable question your model couldn’t handle. Try: “I can’t process questions about that topic yet, but I can help with…”

Explain the limitation. Air Canada’s chatbot gave wrong refund information and the airline paid for it. The error wasn’t the wrong answer. It was failing to communicate uncertainty. Give users somewhere to go next. Research on AI error messages shows users need paths forward, not explanations of what broke. Instead of “API timeout error 504,” try “This is taking longer than expected. Try a simpler question, or I can connect you to someone who can help.”

Is that extra sentence of explanation worth writing? Every time. Your monitoring catches the error. Your message determines whether users trust you the next time.

Recovery and learning from the wreckage

This is where AI incident response becomes its own discipline, distinct from traditional ops postmortems. I’m not convinced most teams are even close to ready for it.

AI observability tools track token usage, latency, prompt-response pairs, and failure modes. 89% of teams have implemented observability for their agents, outpacing evaluation adoption at just 52%. Observability without action just gives you prettier dashboards while things break. That gap is where so much engineering bandwidth gets quietly burned.

The production pattern that works: monitor, detect, act, learn.

Monitor the right metrics. Error rate matters less than degradation rate. Your API returns 200s but confidence scores dropped 30%. That’s a failure your HTTP status codes miss. Track quality metrics like hallucination rates, relevance scores, and grounding accuracy. Not just uptime.

Detect drift before users complain. Production data differs from training data. Your model performs well on last year’s patterns but production moved on. Set up alerts for statistical drift in input distributions and output confidence. Teams moving to event-driven architectures catch drift in real time rather than waiting for nightly batch analysis.

Auto-scale intelligently. The November OpenAI outage happened because routing nodes hit memory limits under unexpected load. AI traffic spikes differently than web traffic. One complex query might consume 100x the resources of a simple one.

Learn from every failure. Amazon’s AI recruiting tool discriminated against women because training data reflected existing bias. The error handling never caught it because technically, the system worked fine. Does more data fix this? No. You need human review of outputs, not just performance metrics.

Self-healing automation. The most mature teams now monitor their entire AI estate for patterns indicating impending failures: memory leaks, integration timeouts, embedding drift. Systems trigger preventive actions automatically before users notice anything wrong.

I think the teams that get this right treat every failure as training data. The questions that broke your model? That’s your next fine-tuning dataset. The contexts where confidence dropped? Gaps in your knowledge base, waiting to be filled.

Production AI isn’t about preventing all failures. It’s about failing with some dignity, communicating clearly, and recovering fast.

The companies that win with AI aren’t the ones whose models never fail. They’re the ones whose users barely notice when they do.

ai-error-handlingproduction-aisystem-reliabilitymonitoring

About the Author

Amit Kothari is an experienced consultant, advisor, coach, and educator specializing in AI and operations for executives and their companies. With 25+ years of experience, he is the Co-Founder & CEO of Tallyfy® (raised $3.6m, the Workflow Made Easy® platform) and Partner at Blue Sheen, an AI advisory firm for mid-size companies. He helps companies identify, plan, and implement practical AI solutions that actually work. Originally British and now based in St. Louis, MO, Amit combines deep technical expertise with real-world business understanding. Read Amit's full bio →

Disclaimer: The content in this article represents personal opinions based on extensive research and practical experience. While every effort has been made to ensure accuracy through data analysis and source verification, this should not be considered professional advice. Always consult with qualified professionals for decisions specific to your situation.

Contact me More about me

View All Posts »

AI observability monitoring - why your dashboards miss what matters

Traditional monitoring catches when systems are down but misses when AI is confidently wrong. Models reliably degrade in production as the data shifts, yet most teams do not detect it until users complain. Learn how to build AI observability monitoring with tools like Langfuse that catches problems before they compound.

AI operations: the missing discipline

Between technical MLOps and general business operations lies a missing discipline that determines whether AI creates lasting value or becomes expensive technical debt. With roughly 80 percent of AI projects failing in production, this ai operations framework applies Lean Six Sigma principles like continuous monitoring, quality assurance, and systematic improvement to AI systems at scale.

Building reliable AI agents - why boring beats brilliant

OpenAI GPT-4o failed 91.4 percent of office tasks in testing. Reliable AI agents require engineering discipline over model brilliance, with proven patterns like circuit breakers and error budgets that turn prototypes into trusted production systems.

Building your AI roadmap: the template

Most AI roadmaps focus on capabilities and features when they should focus on reliability and failure modes. RAND Corporation found more than 80% of AI projects fail before production, and only a small fraction of organizations have scaled AI fully across the enterprise. Your roadmap must prioritize reliable agent patterns over impressive demos. Start with constraints, measure operational health, and plan for continuous iteration.

Claude implementation patterns that actually scale

Most Claude deployments fail when complexity exceeds what prompt engineering can handle. IG Group saved 70 hours weekly by treating conversation design as infrastructure, not an afterthought. Success comes from systematic patterns for system prompts, context management, error handling, and scaling that survive production reality.

LLM monitoring: Why your AI can be up while failing

Traditional monitoring tells you if your LLM is running. It does not tell you if it is delivering garbage to users. LangChain found 89% of organizations now implement observability, but evaluation adoption lags at 52%. Here is how to build LLM monitoring that catches quality failures in production.