Stop telling Claude it is an expert: describe the work, not the worker

If you remember nothing else:

Anthropic's own prompt guidance is that role prompts control "behavior and tone." Not accuracy. Not reasoning.
Research from Zheng (EMNLP 2024), Hu (USC, March 2026), and Playing Pretend (Dec 2025) all show persona prompts do not improve factual QA on modern models. Some show they hurt.
Persona prompts narrow the response surface. Newer models are increasingly good at leaving the lane. Persona prompting caps that.
The fix is workflow-first prompts: describe the work, not the worker. Pull in adjacent disciplines as needed.

Mid-2026 update: the models I called “coming this summer” arrived. Anthropic announced Claude Fable 5 and Mythos 5 on June 9, 2026, a Mythos class that sits above the Opus tier and is pitched at the hardest knowledge work. That cuts the same way the argument below predicts: the more capable the model, the more a persona prompt costs you by pinning it to one lane. Nothing here needs walking back. The trend got stronger.

I taught a class last week. One of the students was a studio owner renewing a commercial lease. She prompted Claude with “you are a commercial real estate lawyer” and pasted in her 47-page document. Claude went and pulled comparable rents from buildings on her block, work a broker does, not a lawyer. Her actual broker confirmed the numbers were spot on.

The persona told Claude to be a lawyer. Claude did the lawyer work and then did some broker work anyway.

This happens more than people realize, and it is the argument against persona prompting. The persona did not help her. The workflow wording inside the prompt did. As models keep getting better at cross-domain reasoning, the persona constraint becomes more of a tax and less of a benefit.

What persona prompting really does, per Anthropic’s own docs

Open the Anthropic prompt engineering docs and the line on role prompting is unambiguous. “Setting a role in the system prompt focuses Claude’s behavior and tone for your use case” (platform.claude.com).

Behavior and tone. Not accuracy. Not reasoning. Anthropic itself is not claiming persona improves answers. Most prompt-engineering posts ignore this and pitch personas as accuracy boosters anyway.

The canonical pro-persona paper is Kong et al. 2023 (arxiv 2308.07702), which is worth reading in context. It ran on GPT-3.5 and Llama 2, models that were the state of the art in their time but are now three generations behind any production deployment that matters. At that vintage, persona prompts did improve zero-shot reasoning. They had something to add because the underlying models were weak at reasoning without a strong wording nudge. The persona acted as a kind of scaffolding that pushed the model toward more structured output, and the structured output was what carried the accuracy improvement, not the role label itself. That paper is the source of nearly every “you are an expert X” recommendation circulating today, even though the models it studied are no longer the models anyone is running in production. The recommendation outlived the conditions that made it work.

Run the same experiments on GPT-4 and later, the effect mostly vanishes.

Why it gets worse, not better, as models improve

Decision tree comparing persona-prompt narrow response surface to workflow-prompt wide response surface

Three recent papers tell the same story.

Zheng, Pei, and colleagues presented When “A Helpful Assistant” is Not Really Helpful at EMNLP 2024. They tested 162 personas against 2,410 questions across four model families. Adding personas to system prompts did not improve factual QA accuracy. In multiple cases they hurt. The authors had to reverse the abstract of their 2023 preprint after running the larger study.

Hu, Rostami, and Thomason at USC published Expert Personas Improve Alignment But Damage Accuracy in March 2026. On MMLU, the base score was 71.6%. Minimal persona dropped it to 68.0%. Detailed expert persona dropped it to 66.3%. The harder the persona tried to be expert, the worse the answer.

December 2025 brought Playing Pretend, which ran the same kind of experiment on GPT-4o, o3-mini, o4-mini, and Gemini. None improved with expert personas.

The pattern is consistent. Older models (GPT-3.5 and earlier) had weak reasoning by default. A persona prompt nudged them toward more structured output, and the structure helped. Newer models reason better out of the box. The structure is already there. The persona prompt now does mostly one thing, which is narrow the response surface to one lane.

That sounds abstract. It looks like this in practice: a lawyer persona produces output focused on legal clauses and statutes. A broker persona produces output focused on rent comparables and market data. A workflow-first prompt that names the task, “review this lease and tell me what I should ask my lawyer about,” produces output that crosses both lanes because the model is no longer being told to stay in one.

There is a second result in the literature worth knowing about. The 2024 paper Persona is a Double-edged Sword measured the downside directly. Role-play prompts degraded reasoning in 7 of the 12 datasets the authors tested on Llama 3. Their fix was telling: run a persona prompt and a plain prompt side by side, then keep whichever answer holds up better. If you have to hedge a persona against a neutral prompt to claw back the loss, the persona is about as likely to cost you as to help you. Workflow-first prompts sidestep that bet altogether because they don’t ask the model to commit to a role identity that might be wrong for the question.

The other thing happening at the model level is that newer Claude releases (Opus 4.5 onward) have been tuned for restraint and verifiability rather than performance theater. A persona told to be the best at X tends to perform confidence even when the answer is uncertain. The restraint tuning fights with that performance. The persona prompt asks Claude to act sure; the safety training asks Claude to flag uncertainty. The output you get is the compromise between those two signals, and it is usually worse than either pure mode would have been on its own. A workflow-first prompt skips that compromise by never asking Claude to commit to a role it has to defend. The safety training and the workflow prompt point in the same direction: do the work, report uncertainty where it exists, do not pretend to know what you do not know. Persona prompts pull the model the other way.

The lease story, and what cross-domain reasoning looks like in practice

Back to the studio owner.

Her prompt was “you are a commercial real estate lawyer. Review my lease and check the comparables.” Standard persona pattern. The output included four points.

First, she had to renew by July 1 (90 days before the lease end). Lawyer work.

Second, the notification had to be a certified letter, not a phone call or email. Lawyer work.

Third, the rent per square foot on her block ranged from $X to $Y, and her current rate was below the median. Broker work. Specifically: the kind of comparable-rents pull that brokers charge for.

Fourth, the fair-market analysis suggested she had room to negotiate the renewal upward. Pure judgment, pulling the first three points into a single recommendation.

Her broker confirmed the comparable rents were spot on within the margins he would have charged her several hundred dollars for. The lawyer persona did not stop Claude from doing the broker work. It just made the broker work surprising when it showed up.

The reading is straightforward but worth working through carefully. Even with the persona constraint, Claude crossed lanes when the workflow inside the prompt required it. On a less explicit prompt, it might have crossed further or earlier, but the persona did not prevent the crossing, it just made the crossing look like an accident instead of intent. The persona did not help her. The workflow wording in the prompt did. Future models will lean even harder into cross-domain reasoning, because cross-domain reasoning is the capability that distinguishes one generation of model from the next, and the persona constraint will increasingly be the thing that holds them back from doing it. The cost of a persona prompt is what you give up by telling a smarter model to stay in one lane when it could have crossed three lanes for you. That cost grows every six months, and the curve is not in your favor.

I have spent 11 years building Tallyfy around the idea that workflows beat roles. You don’t tell Tallyfy “be an HR director.” You tell it “onboard this new hire.” The workflow-versus-process distinction was my first lesson there, and it applies one-for-one to LLM prompts. The same principle that makes a workflow tool work is the principle that makes Claude work better.

How to write workflow-first prompts instead

Describe the work, not the worker. Three rules.

Rule one: lead with the task. Bad opening: “You are a senior tax accountant.” Better opening: “Review my Q4 books.” The model already knows what good Q4 book review looks like. You don’t need to tell it.

Rule two: name what you want flagged. “Flag any entries that look unusual, name the category, and explain why a CPA would flag them at audit.” Specificity beats persona.

Rule three: explicitly invite adjacent disciplines. “Pull in adjacent disciplines if relevant: bookkeeping, IRS rules, internal controls.” This single line is the inverse of the persona constraint. You are telling the model that crossing lanes is welcome and naming which lanes you want it to cross into.

A workflow-first version of the studio owner’s prompt would have been: “Review this commercial lease. Identify deadlines and notification requirements. Compare the rent per square foot against typical rates for similar buildings if you can. Tell me what I should ask my lawyer about, what I should ask my broker about, and what I should negotiate myself.” That prompt names the work, invites multi-discipline reasoning, and produces the same four points plus a clean handoff to the humans she pays.

Try the same exercise on a hiring task. Old way: “You are a senior recruiter. Help me write a job description for a head of operations.” New way: “Help me write a job description for a head of operations at a 200-person SaaS company. The role needs to cover process design, vendor management, hiring, and cross-functional coordination. Tell me what should be in the JD, what red flags to watch for in applicants, and the comp band for this role at this company size. Pull in adjacent disciplines: HR norms, ops benchmarks, recruiter market intel.” The second version produces a JD plus a comp band plus a screening rubric plus a sourcing recommendation. The first version produces a JD. The difference is that the second prompt names the adjacent disciplines you want Claude to pull in, and Claude does. The recruiter persona stays in lane. The workflow wording lets HR, ops, and finance walk in if needed.

Workflow-first prompts also degrade more gracefully when the model is uncertain. A persona that doesn’t know an answer will often invent one to maintain the role (the performance problem above). A workflow-first prompt that doesn’t have enough information tends to ask clarifying questions instead. The behavior I want from Claude on a hard question is “what else do you need from me?” The behavior I get from a persona prompt is “as a senior accountant I would tell you…” even when the senior accountant should have asked.

This is the same change you make when you go from job-title management to process-aware prompting. It is also the foundation of chain-of-thought prompting for business users and the Claude-specific dos and donts I have written about elsewhere.

When persona still wins, the narrow counter-cases worth knowing

I am not claiming persona prompts never work. Four cases keep coming up where they do.

Safety and red teaming. Persona prompts measurably help in safety evals. A “Safety Monitor” persona has been shown to boost JailbreakBench scores by 17.7 percentage points. The persona is doing what it does best, which is constraining behavior into a narrow lane. Here you want the lane.

Tone matching. If you need Claude to write in the voice of a specific author, a specific brand, or a specific role’s typical register, persona is the right tool. This is exactly what Anthropic’s “behavior and tone” wording covers.

Creative voice. Fiction writing, screenwriting, marketing copy where the character’s perspective matters. The persona becomes a creative constraint, not a knowledge constraint.

Extraction tasks. ExpertPrompting, the auto-generated detailed persona approach, has small wins on narrow extraction tasks where the persona definition itself provides task-specific structure. The benefit there comes from the structure, not the role label.

Outside these cases, default to workflow-first. The instinct to tell Claude what role it is comes from how we think about humans. You hire for a job title. You brief by job title. You organize by job title. AI is not a human. It does not need to be told what its job title is. It needs to be told what work you need done.

In advisory work I keep running into the same pattern. A 200-person company has a prompt library with 40 templates. Maybe 30 of those templates open with “you are a senior X.” Converting the library to workflow-first is one of the highest-impact moves available in the first month of an AI rollout, and it costs nothing except a few hours of editing. The model quality is the same. The prompts are shorter. The outputs are broader. And the team stops asking why Claude keeps missing the cross-domain points the persona told it to ignore.

Test the change on a single workflow before you redo the library. Pick a prompt your team uses three times a week. Strip the persona. Replace it with a workflow description. Run both versions on the same input for a week and compare what you get. The data will tell you which works better for your specific task, and the data will tell you more than this post can, because the persona-versus-workflow tradeoff is task-dependent and small variations in your specific domain matter more than the abstract pattern does. If the persona version wins, keep it. Anthropic was right that role focuses behavior and tone, and some tasks need that combination of constraint and voice. For most tasks, workflow-first wins on a metric that matters in practice: it does not constrain Claude to a role description that was wrong half the time anyway, and it costs nothing to find out which half is which.

aiclaudeprompt-engineeringtallyfy

About the Author

Amit Kothari is an experienced consultant, advisor, coach, and educator specializing in AI and operations for executives and their companies. With 25+ years of experience, he is the Co-Founder & CEO of Tallyfy® (raised $3.6m, the Workflow Made Easy® platform) and Partner at Blue Sheen, an AI advisory firm for mid-size companies. He helps companies identify, plan, and implement practical AI solutions that actually work. Originally British and now based in St. Louis, MO, Amit combines deep technical expertise with real-world business understanding. Read Amit's full bio →

Disclaimer: The content in this article represents personal opinions based on extensive research and practical experience. While every effort has been made to ensure accuracy through data analysis and source verification, this should not be considered professional advice. Always consult with qualified professionals for decisions specific to your situation.

Stop telling Claude it is an expert: describe the work, not the worker

Stop telling Claude it is an expert: describe the work, not the worker

What persona prompting really does, per Anthropic’s own docs

Why it gets worse, not better, as models improve

The lease story, and what cross-domain reasoning looks like in practice

How to write workflow-first prompts instead

When persona still wins, the narrow counter-cases worth knowing

About the Author

Related Posts

AI does tasks. It does not do jobs.

Claude Team vs Enterprise: when 50 seats is not a forced upgrade

How to make AI emails actually sound like you

The forgetting curve is the math behind your make-or-buy decision for knowledge work

What actually saves you cost on the Claude.ai web app

How to cut Claude API costs by up to 95 percent with three features most developers skip