The Five Levels Are Gated by Verification, Not Intelligence

James Phoenix
James Phoenix

Your autonomy level is a property of the repository, not the developer and not the model.

Source: Dan Shapiro – The Five Levels: From Spicy Autocomplete to the Software Factory | Date: January 2026


Dan’s Ladder

Dan Shapiro borrowed the NHTSA self-driving levels and applied them to AI-assisted development. The mapping is clean enough that I now use it as shorthand in conversations.

Level 0 is manual: not a character hits the disk without your approval, AI as a better search engine. Level 1 is assisted: discrete delegated tasks like unit tests and docstrings. Level 2 is paired: you and the model in a flow loop inside an AI-native editor. Level 3 is supervised: you stop writing code and start reviewing it, managing multiple agent branches at once. “Your life is diffs.” Level 4 is autonomous: you write specs, leave for 12 hours, and check whether the tests pass. Level 5 is the dark factory: specs go in, software comes out, and humans are neither needed nor welcome on the floor.

It is a good ladder. Where I think most readers will go wrong is in how they expect to climb it.

The Misread

The seductive reading is that the levels are a model capability timeline. Claude 4 got you to level 2, Claude 5 gets you to level 3, and somewhere around 2028 a sufficiently smart model promotes everyone to level 5 for free. Under that reading the correct strategy is to wait.

I think that reading is wrong, and the self-driving analogy actually proves it. Waymo does not run driverless because its model got smart enough to handle any road. It runs driverless inside a geofence: mapped streets, curated operating conditions, remote operators, an environment engineered until the residual risk was acceptable. The car got better, but the deployment level was unlocked by the environment around the car.

Software is the same. Each transition up Dan’s ladder is gated by one question: how cheaply and reliably can your environment tell the agent it is wrong? That is a verification problem, and verification machinery is something you build, not something you wait for. This is the same conclusion I keep arriving at from the harness side in The Anatomy of an Agent Harness: the intelligence sits in the model, the leverage sits in the harness.

What Actually Gates Each Transition

Levels 0 to 2 are free. Moving from manual to assisted to paired requires nothing structural. You install a tool and develop some prompting taste. The feedback loop is your own eyeballs in real time, which is why every developer on earth has reached level 2 and why level 2 feels like a destination. It is not. It is the last level where your codebase is allowed to be bad.

Level 2 to 3 is the first real wall. “Your life is diffs” only works if a green build means something. The moment you are reviewing five agent branches instead of pairing on one, you have replaced line-by-line attention with sampling, and sampling is only safe when the deterministic checks underneath you are dense. Typed contracts at every boundary, a test suite you actually trust, linting that encodes your taste so you stop re-litigating it in review. Teams that stall at level 3 with “the AI writes code faster than I can review it” do not have a review bottleneck. They have a codebase with no machine-checkable definition of correct, so every diff demands full human attention. The bottleneck was always there; agents just made it legible.

Level 3 to 4 is stateless verification. If you leave for 12 hours, the agent must be able to verify its own work without you in the loop. That means the entire definition of done has to be executable: tests, type checks, behavioral gates in CI, and an environment where parallel runs cannot corrupt each other. This is why I spend what looks like a silly amount of effort on boring infrastructure: worktree port offsets so parallel agents never collide, test suites that run twice against a dirty database and stay green, CI patterns built for non-deterministic workers. None of that is glamorous. All of it is load-bearing for level 4, because an overnight loop without a scoring function does not converge, it drifts (autonomous loops need a scoring function).

Leanpub Book

Read The Meta-Engineer

A practical book on building autonomous AI systems with Claude Code, context engineering, verification loops, and production harnesses.

Continuously updated
Claude Code + agentic systems
View Book

Level 4 to 5 is QA so complete that humans add no information. Dan’s dark factory reference is Fanuc, and the manufacturing history is instructive. Industrial robots existed decades before lights-out factories did. The gap was never robot capability. It was metrology and process control: the factory could not run dark until measurement was automated to the point where a human inspector caught nothing the instruments missed. The software equivalent is a verification stack so dense that human review is statistically redundant. Almost nobody has this, which is why level 5 stays an asymptote. I climb toward it with stochastic methods rather than certainty: run the same review swarm multiple times, sample the latent space repeatedly, treat correctness as something you converge on (Monte Carlo QA).

Your Level Is a Property of the Repo

Here is the part I have felt most concretely. On my own infrastructure, where mypy runs strict, factories generate all test data, and CI gates behavior rather than just compilation, I operate at level 4. I dispatch agents against specs, sleep, and review results in the morning.

Drop me into a legacy client codebase with no types, a flaky test suite, and tribal knowledge living in someone’s head, and I am back at level 2 instantly. Same developer. Same model. Same week. The only variable that changed is the repository.

That observation cuts both ways. It means you cannot buy your way up the ladder with a better subscription. It also means the ladder is climbable today, with current models, because the gate is infrastructure you control. Every hour spent making your codebase more verifiable is an hour spent moving up a level permanently, which is just building the factory restated: invest in capacity, not output.

Dan ends his post by disclosing where he is, so I will too. I live at level 3 to 4 depending on the repo, and the honest summary of this entire note is that “depending on the repo” is the whole insight. The models are ready for more autonomy than most codebases can safely grant them. If you want to move up a level, stop prompting harder and start making your definition of correct executable.


Related

Topics
Agent ArchitectureAutomationCoding AgentsVerification

Newsletter

Become a better AI engineer

Weekly deep dives on production AI systems, context engineering, and the patterns that compound. No fluff, no tutorials. Just what works.

Join 306K+ developers. No spam. Unsubscribe anytime.


More Insights

Cover Image for Computer Use Kills the Config Tax, Not the Trust Tax

Computer Use Kills the Config Tax, Not the Trust Tax

My sister hates job applications because they make her re-submit information she already has. That is the same pain as API app review, and the same agent that lives in my codebase can dissolve both. This feels insane, and it is the new default shape of the work.

James Phoenix
James Phoenix
Cover Image for Sentry Errors Should Spawn Agents on Your Own Machine

Sentry Errors Should Spawn Agents on Your Own Machine

A new production error is an event. Events should trigger work, not sit in a dashboard. So I wired Sentry to spawn a coding agent on my own hardware, point it at my exact stack, and open a draft PR with a fix.

James Phoenix
James Phoenix