Vibe coding dos and donts for people who actually ship products

Quick answers

Is vibe coding worth it? For prototypes and internal tools, absolutely. For production code touching money or health data, not without serious review.

Does it actually make you faster? The METR randomized trial found experienced developers were 19% slower with AI tools, despite believing they were 20% faster. That is a 39-point perception gap.

What is the biggest risk? Distribution, not code quality. Apple is now pulling vibe-coded apps from the App Store. If everyone can build, the only moat is getting your product in front of people who pay.

What should I do differently? Write a specification before you write a prompt. Martin Fowler calls this spec-as-source. The spec is becoming the actual source code.

Andrej Karpathy posted a tweet in February 2025 that got 4.5 million views. “There is a new kind of coding I call vibe coding,” he wrote, “where you fully give in to the vibes, embrace exponentials, and forget that the code even exists.” He said he just hit Accept All in Cursor without reading the diffs. Collins Dictionary named it Word of the Year.

Eight months later, Karpathy hand-coded his next serious project, Nanochat, from scratch. His explanation? “I tried to use Claude and Codex agents a few times but they just didn’t work well enough at all and net unhelpful.”

The inventor of vibe coding chose not to vibe code.

That should give everyone pause. Not because vibe coding is rubbish. It’s brilliant for the right situations. But the gap between a weekend prototype and a product people pay for is wider than most founders think, and that gap is where all the interesting decisions live.

What the research actually shows

I keep hearing “AI makes you 10x faster” from people selling AI tools. The controlled research tells a different story.

The METR study from July 2025 is the most rigorous test I have seen. Sixteen experienced open-source developers worked on 246 real issues in repositories with 22,000+ stars and over a million lines of code. Randomized controlled trial. Compensation at $150 per hour so nobody was cutting corners. They used Cursor Pro with Claude Sonnet.

The result? AI tools made them 19% slower. Not faster. Slower.

But here is where it gets properly weird. Before starting, the developers predicted AI would reduce their time by 24%. After finishing, they still believed AI had made them about 20% faster. That is a 39-percentage-point gap between what they felt and what actually happened. Domenic Denicola, who maintains jsdom on the Google Chrome team, described the experience as “more engaging, like an interactive game” despite being measurably slower.

Now, this doesn’t mean AI coding tools are useless. A Microsoft and Accenture study across 4,867 developers showed a 26% increase in completed tasks. The GitHub Copilot study found 55.8% faster completion on a specific JavaScript task. Context matters enormously.

Fastly surveyed 791 developers in August 2025 and found senior developers with 10+ years of experience ship 2.5 times more AI-generated code than juniors. But 95% of all developers spend extra time fixing what AI produces. TechCrunch framed it as seniors becoming “AI babysitters.” One developer compared working with AI code to “hiring your stubborn, insolent teenager to help you do something.”

The implication is sort of counterintuitive. Vibe coding works best for people who already know what good code looks like. The ones who can catch the mistakes. For people who can’t tell good code from bad code, it’s the most dangerous tool in the shed.

Specification is the new competitive advantage

Here is something that changed how I think about this. Martin Fowler wrote about spec-as-source, a concept where specifications become the actual source code. Code gets marked // GENERATED FROM SPEC - DO NOT EDIT. The specification IS the source. Humans never touch the generated code.

This inverts 60 years of software engineering. Code used to be the artifact you cared about. Now the specification is the artifact, and code is just the build output.

An academic paper on vibe coding from December 2025 identified “lack of explicit design rationale” as the root cause of vibe coding failures. Not the AI itself. The missing spec. When AI can’t hold an entire codebase in context, the specification becomes the persistent memory layer that keeps everything coherent.

Simon Willison put it sharply in March 2025: “If an LLM wrote the code for you, and you then reviewed it, tested it thoroughly and made sure you could explain how it works to someone else, that isn’t vibe coding, it’s software development.” His golden rule: “I won’t commit any code to my repository if I could not explain exactly what it does to somebody else.”

Later that year he coined vibe engineering: “seasoned professionals accelerate their work with LLMs while staying proudly and confidently accountable for the software they produce.”

The practical implication? Before you write a prompt, write a spec. Requirements. Constraints. Edge cases. Acceptance criteria. What should happen when things go wrong. AI is brilliant at implementation. It’s terrible at knowing what to implement.

In my teaching, I keep seeing the same pattern. Students who spend 30 minutes writing a clear specification before prompting get dramatically better results than students who spend two hours iterating on prompts without one. The spec is the competitive advantage now, not the code.

Everyone can build now. Almost nobody can distribute.

This is the elephant in the room.

Apple began pulling vibe-coded apps from the App Store in March 2026, citing Guideline 2.5.2 on software quality. Apps built with Replit, Vibecode, and other AI tools were affected. The distribution platform itself is the gatekeeper, and the gatekeeper just got pickier.

Think about what this means. If you can’t get into the App Store, your ability to build is completely irrelevant.

A developer on DEV Community put it bluntly: “If you are vibe-coding a generic AI wrapper, you are walking straight into a meat grinder against incumbents with 100x the capital, distribution, and brand recognition. Execution is cheap. Defensibility is hard.”

Honestly, this matches what I have seen firsthand. In building Tallyfy, the product was maybe 20% of the work. Distribution, sales, building trust, customer success, and simply convincing people you will still exist in two years took the other 80%. The same 80/20 rule shows up everywhere.

The Stack Overflow 2025 survey found that 84% of developers are using AI tools, but trust in AI accuracy fell from 40% to 29%. The number one frustration, cited by 45% of respondents? “AI solutions that are almost right, but not quite.”

Before you vibe code anything, ask one question: if this works perfectly, do I have a plan to get it in front of people who will pay for it? If the answer is no, you are optimizing the wrong end of the problem.

The security and cost reality

The numbers here are genuinely alarming.

Veracode tested 100+ AI models across 80 coding tasks in July 2025. When given a choice between secure and insecure code, AI chose the insecure option 45% of the time. Java was worst at 72% failure. Cross-site scripting? 86% failure rate. Log injection? 88%.

It gets worse at the platform level. A security scan of Lovable’s showcase apps found 170 out of 1,645 apps had critical security flaws. The root cause was a row-level security misconfiguration in Supabase that every app inherited from the platform. Full names, email addresses, phone numbers, payment information, and API keys were exposed. It became CVE-2025-48757.

Escape.tech ran an even larger scan across 5,600 apps in October 2025. They found over 2,000 vulnerabilities, 400 exposed secrets, and 175 instances of personally identifiable information including medical records. Across multiple platforms, not just one.

The Tea app breach from July 2025 should be a case study in every CS class. A women’s dating safety app built by a developer with six months of experience using AI tools. It exposed 72,000 images including 13,000 government IDs and 1.1 million private messages covering divorce, abortion, and sexual assault. Firebase left open with defaults. Photos had location metadata mapping to military bases. Multiple class action lawsuits followed.

And then there is the Jason Lemkin incident. The SaaStr founder ran a 12-day experiment with Replit’s AI agent. On day nine, despite ALL CAPS instructions not to make changes during a code freeze, the agent wiped a production database containing 1,206 executives, then created 4,000 fictional records. When asked to roll back, the agent insisted it could not. Lemkin tried anyway. The data was there.

The Google DORA 2025 report found that across the industry, 90% AI adoption correlated with 9% more bugs, 91% more code review time, and 154% bigger pull requests. GitClear analyzed 211 million changed lines and found refactoring dropped from 25% to under 10% of all changes. AI doesn’t refactor. It duplicates. Code blocks with five or more copies increased eightfold.

Mind you, these are not arguments against using AI for code. They are arguments against using it without a safety net.

The practical decision framework

After looking at the research, talking to teams, and watching my students ship projects, here is how I think about it. I might be wrong on the edges, but the core holds up.

The reason a decision framework matters is that most people don’t know which mode they’re operating in. The METR study showed a 39-percentage-point gap between what developers believed about their productivity and what actually happened. That same perceptual blindness applies to risk assessment. Founders vibe-coding a payment flow genuinely believe they’re being careful because the code looks reasonable. But AI-generated code routinely skips input validation, hardcodes secrets, and ignores edge cases that experienced developers catch instinctively. The question isn’t whether you’re using AI. It’s whether you have the expertise to evaluate what AI produces, and the honesty to admit when you don’t. Without that self-awareness, the framework below is just a list you’ll ignore.

Vibe code freely. Prototypes. Internal tools. Personal projects. Learning. Hackathons. Proof-of-concepts. Throwaway experiments. Anything where the worst outcome of a bug is “we start over.”

Vibe code carefully, with review. MVPs for user testing. Internal dashboards. Automation scripts. Content sites. Tools for your own team. Have someone who knows what they are doing read the code before it touches real users.

Do not vibe code. Anything touching money or payments. Health data. Legal compliance. Authentication and authorization. Production databases. Anything Apple or Google will review for their app stores. Anything your customers trust you with.

Companies are already building governance around this. The IAPP reported that organizations are establishing “No-Fly Zones” where vibe coding is explicitly forbidden: pricing engines, payment processing, proprietary algorithms.

But the most useful mental model is Simon Willison’s spectrum. Three modes exist on a continuum:

Vibe coding: AI generates, nobody reviews. Accept All. YOLO. Fine for throwaway projects.
AI-assisted coding: AI generates, human reviews every line. The golden rule applies. This is where most professional work should happen.
AI-directed coding: Human architects the system, writes the spec, AI implements the details under supervision. This is where the real productivity gains live.

Know which mode you are in. The problems start when people think they are in mode two but they are actually in mode one.

Andrew Ng said it well at the LangChain Interrupt conference in May 2025: guiding AI is “a deeply intellectual exercise.” He called telling young engineers not to learn programming “some of the worst career advice ever given.” Even the biggest AI advocates still think you need to understand what the code does.

The question isn’t whether to use AI for coding. It’s whether you have the specification, the review process, and the distribution plan that turn AI-generated code into a product someone will actually pay for.

If you’ve already built something and need to figure out hosting, where to host your app after building with AI covers the architecture decisions, platform comparisons, and costs. For the tools themselves, I wrote about the real Claude features that actually matter versus the viral myths, and how Cursor and Copilot solve different problems.

Turns out, those are the same things that mattered before AI. The tools changed. The fundamentals didn’t.