Amit Kothari CEO of Tallyfy, AI advisor at Blue Sheen

How to hire an applied AI engineer

In brief

A standard software interview will not tell you whether someone can hire as an applied AI engineer. The role-defining trait, making an unreliable model dependable, needs a different loop: a real take-home, a rubric that scores failure-mode thinking, and flags you can read in the room.

Amit Kothari Follow 10k+

May 20, 2026 · AI

CEO of Tallyfy · AI advisor at Blue Sheen for mid-size companies

How to hire an applied AI engineer

Key takeaways

Standard SWE interviews miss this role - leetcode and system design do not surface whether someone can make an unreliable model dependable
Test with real work - a take-home that builds a small AI feature reveals more in an evening than four whiteboard rounds
Score the failure-mode thinking - the rubric should reward the candidate who asked how it breaks, not the one with the slickest demo
Watch the flags - capability-first talk is a yellow flag, no eval and no opinion on hallucination is a red one

Hiring an applied AI engineer with a standard software interview is a category error, and it is a common one. The reason is plain. A standard interview, leetcode rounds and a system-design whiteboard, was built to test general software skill, and an applied AI engineer does need general software skill. But the thing that makes the role distinct, the ability to take an unreliable, probabilistic model and build something dependable on top of it, is exactly the thing a standard loop never asks about. A candidate can pass every round and still leave you with no idea whether they can do the actual job.

I wrote separately about what an applied AI engineer is and why the defining trait is failure-mode thinking: reasoning about how a system breaks before reasoning about what it can do. This post is the practical follow-on. If that trait is what you are hiring for, how do you actually test for it? The answer is a different interview, and the rest of this is how to run it.

Why standard interviews miss this

Look at what a standard interview measures and the gap is obvious. A leetcode round measures algorithmic problem-solving on a closed, deterministic problem with a known correct answer. A system-design round measures whether someone can architect a service that scales. Both are worth testing, and an applied AI engineer should be reasonably good at both. Neither touches the role’s center. The center is working with a component that does not have a known correct answer, that can be wrong, that can be manipulated by hostile input, and that behaves differently on inputs nobody tested. Nothing in a deterministic algorithm puzzle exercises the judgment that handles a non-deterministic component. So a candidate can ace the standard loop and still believe, underneath, that a model that worked in the demo is a model that works. That belief is the single most expensive thing you can hire, and the standard interview is blind to it.

The cost of getting this wrong is not abstract. An applied AI engineer who cannot do the reliability part still produces things. They produce demos that win the room and features that fail quietly in production a month later, and because the demo was convincing, nobody connects the later incident to the hire. The standard interview does not just fail to find the right person. It actively rewards the wrong one, because the candidate who is fluent and capability-focused interviews beautifully. You are not screening out the expensive mistake. You are selecting for it.

It is worth being precise about why the demo deceives, because the deception is structural, not a matter of dishonest candidates. A demo runs on inputs the builder chose. Of course it works; it was shaped until it did. Production runs on inputs nobody chose, including inputs nobody imagined, and the gap between those two input distributions is the whole job of an applied AI engineer. A standard interview only ever sees the chosen-input version of a candidate’s work. It is structurally incapable of showing you how they handle the unchosen input, which means it cannot evaluate the one skill that matters most. That is not a flaw you fix with better questions in the same format. It is a reason to change the format.

The interview loop that works

A loop built for this role keeps the general-skill checks but reorganizes around evidence of building reliable AI systems. Four stages do the work. First, a screen that is really one question asked well: tell me about an AI system you shipped, and what went wrong with it. The answer separates people fast. Second, a take-home, a small real AI feature to build, because the work itself reveals more than any amount of talking about the work. Third, a review of that take-home, where you walk through their solution and probe the failure modes they did and did not consider. Fourth, a discussion round, no coding, on how their system behaves at scale and under attack. Across all four, the question underneath is the same: does this person treat the model’s unreliability as the central problem. Keep one or two general software rounds if you like. Make these four the spine.

An applied AI engineer interview loop: screen what they shipped, take-home build task, review the failure modes, discuss how it breaks, hire

One caution about the loop: it should not become a marathon. Four focused stages plus maybe two general rounds is already a real ask of a candidate’s time, and the best applied AI engineers have other offers. The take-home in particular has to respect the evening you asked for; a take-home that quietly needs a weekend tells strong candidates you do not value their time, and they will act on that signal. The loop is meant to be sharper than a standard interview, not longer. Depth comes from asking the right things, not from adding rounds.

The screen question deserves more than one line, because it does a lot of work for one question. Tell me about an AI system you shipped, and what went wrong with it has two halves, and the second half is the real test. A candidate who shipped real AI systems has war stories without effort, the model that hallucinated a policy, the retrieval step that kept returning the wrong document. They tell them readily, because operating these systems means collecting them. A candidate who only built demos has no second half. They answer the shipped part and then go quiet, or reach for something generic. You are not grading the failure itself. You are grading whether they have lived close enough to production to have one.

Take-home assignments that reveal reliability

The take-home is the heart of the loop, so design it with care. The task should be a small, real AI feature, the kind of thing the job actually involves: a feature that answers questions from a set of documents, or a small agent that uses a tool or two to complete a task. Keep the scope to an evening; you are not buying free work, you are buying a signal. The signal is not whether the happy path works. Any competent candidate makes the happy path work. The signal is everything around it: did they handle the case where the model returns nonsense, did they treat the prompt as something to harden rather than something to get working once, did they leave any way to tell whether the feature is actually good. A take-home scoped this way turns an evening of a candidate’s time into the clearest read you will get.

One detail makes the take-home fairer and the signal cleaner: tell the candidate explicitly what you are looking for. Say, in writing, that you care less about the feature working in a demo than about how they handled the ways it can fail, and that an evening is the budget. This is not giving away the answer. A candidate who can act on that brief is showing you exactly the skill you want; a candidate who still ships only a happy-path demo after being told plainly is showing you something too. The instruction removes the excuse and keeps the signal.

A few take-home mistakes quietly ruin the signal, so avoid them on purpose. Do not make it a puzzle; a clever algorithmic trick tells you nothing about AI-system reliability and just filters for puzzle practice. Do not make it open-ended enough to need a weekend; scope creep in the prompt becomes scope creep in the submission, and then you are comparing candidates who spent different amounts of time. Do not ask for production polish; you want to see thinking, not a deployment. And do not reuse a take-home a candidate could find a public solution to. The task should be small, specific, novel enough to require real thought, and clear about the evening it asks for. Get those right and the take-home does its job. Get them wrong and you have noise.

Scoring what matters

A take-home only helps if you score it for the right thing, and the default scoring instinct, does it work, is the wrong one. Build the rubric around failure-mode awareness. Give real weight to a handful of questions. Did the candidate handle the model returning something unusable? Did they treat the prompt as something to harden? Did they build, or even sketch, a way to evaluate whether the feature is good, rather than trusting a glance? Did they name the failure modes they did not have time to handle, which shows they saw them? A slick demo with none of that scores low. A rougher submission that engaged seriously with how the thing breaks scores high. That inversion, rewarding the engagement with failure over the polish of the happy path, is the whole point of the rubric, and writing it down keeps every interviewer scoring the same thing.

To make the rubric concrete, give it a shape a panel can apply consistently. Score the take-home in two parts, weighted. The smaller part, perhaps a third, is general engineering: is the code sound, structured, readable. The larger part, the other two-thirds, is reliability engineering, and it is itself a short list: handling of bad model output, treatment of the prompt, presence of any evaluation, and explicit awareness of unhandled failure modes. Each of those gets a score, with notes. The exact weights matter less than the ratio, reliability outweighing general polish, and the discipline of every interviewer filling the same fields. A rubric like that turns “I liked their submission” into four specific judgments a panel can actually compare.

Designing that rubric is the part teams get wrong most, because it runs against instinct, and the interviewers themselves have to be calibrated to use it, or the slick demo wins by reflex anyway. In my own hiring, the rubric is the artifact I spend the most time on, more than the questions, because it is what makes a panel agree on what good looks like. Who owns that standard matters too; the head-of-AI hiring decision sets whether the whole function rewards demos or durability. If you want help designing an interview loop and a rubric for AI roles, my door is open.

Red flags and green flags

Pull it into signals you can use in the room. Green flags: the candidate brings up failure modes unprompted; they talk about evaluation as a normal part of building, not an afterthought; they can name a time a model surprised them in production and what they changed. Red flags are sharper. A candidate who has no opinion on hallucination, or has never thought about what happens when a model reads hostile input, a real and documented attack class, is missing the core of the job. So is one whose every answer is a capability and never a limitation. The most expensive red flag is the subtle one: the candidate who is brilliant on model capabilities and treats reliability as someone else’s problem. That person builds impressive demos and ships fragile systems, and they interview extremely well, which is exactly why a rubric that scores failure-mode thinking, not dazzle, has to be the thing that decides.

Two things round out the playbook. First, references are unusually useful for this role, if you ask the right question. Do not ask was she good; ask what broke on something she built, and how she handled it. A reference who can answer that is confirming the failure-mode track record; one who cannot is telling you the candidate’s production exposure is thinner than the resume suggested. Second, remember you can grow this person as well as hire them. A strong software engineer with real curiosity can learn the reliability craft, and sometimes the best move is to hire for the engineering base and the mindset, then develop the four AI skills in the role. The interview still applies. You are just reading it for trajectory instead of finished expertise.

So the whole playbook compresses to one move: stop testing for who can build an AI demo and start testing for who can build an AI system that survives. The take-home reveals it, the rubric scores it, the flags confirm it. Hire that way and you also change what the role attracts over time, because the structure of an AI team and the bar it hires at compound on each other. The broader hiring guidance for AI roles rests on the same foundation as this post: the rare skill is not making AI look good in a room. It is making AI dependable in production, and an interview that does not test for that is a pleasant conversation with the wrong outcome.

hiringai-engineeringapplied-aiinterviewing

About the Author

Amit Kothari is an experienced consultant, advisor, coach, and educator specializing in AI and operations for executives and their companies. With 25+ years of experience, he is the Co-Founder & CEO of Tallyfy® (raised $3.6m, the Workflow Made Easy® platform) and Partner at Blue Sheen, an AI advisory firm for mid-size companies. He helps companies identify, plan, and implement practical AI solutions that actually work. Originally British and now based in St. Louis, MO, Amit combines deep technical expertise with real-world business understanding. Read Amit's full bio →

Disclaimer: The content in this article represents personal opinions based on extensive research and practical experience. While every effort has been made to ensure accuracy through data analysis and source verification, this should not be considered professional advice. Always consult with qualified professionals for decisions specific to your situation.

Contact me More about me

View All Posts »

The applied AI engineer is a reliability engineer

What is an applied AI engineer? Someone who builds reliable production systems on foundation models they did not train. The role is defined less by a skill list than by one trait: failure-mode thinking. Here is what the job is, how it differs from ML engineering, and what makes a good one.

The forgetting curve is the math behind your make-or-buy decision for knowledge work

Humans forget 58% of new information in 20 minutes, 75% in a day, 90% in a week. Ebbinghaus measured this in 1885 and Murre replicated it cleanly in 2015. The forgetting curve is the cognitive-science substrate that decides which retention-critical knowledge work AI can structurally replace at a mid-size company.

Building reliable AI agents - why boring beats brilliant

OpenAI GPT-4o failed 91.4 percent of office tasks in testing. Reliable AI agents require engineering discipline over model brilliance, with proven patterns like circuit breakers and error budgets that turn prototypes into trusted production systems.

Head of AI: the complete hiring guide for mid-size companies

Most mid-size companies need fractional AI leadership before committing to a full-time Chief AI Officer. IBM research shows 76 percent of organizations now have a CAIO, yet MIT CISR found only 7 percent qualify as future-ready for AI. Prove value with part-time strategic guidance before making this hire.

Claude is allowed in regulated finance, but it has no EU data residency

Two objections kill most regulated-finance AI conversations before they start. The first, that Anthropic does not permit Claude for regulated work, is false: Claude for Financial Services exists, banks run it, and the usage policy names finance high-risk, not forbidden. The second is real and almost nobody states it plainly: first-party Claude Enterprise has no EU data residency at all. There is no "eu" inference region and workspace storage is US-only. If you are FCA-regulated, that is the fact to design around, and the only EU route runs through a hyperscaler.

Your locked-down Claude sandbox is a holding pattern, not a destination

Giving everyone Claude inside an isolated VM, no sensitive data allowed, feels like the safe way to start. It is a fine way to start. The trouble is what happens when you leave people there: the leak it was built to stop walks out by copy-paste anyway, the friction recruits the shadow AI you were trying to prevent, and the value never compounds because nothing in an ephemeral box survives the session. A sandbox is a scaffold. Scaffolds come down.