Assume Wrong by Default: Mining LLM Latent Space for Correctness

James Phoenix

A single pass through a coding LLM is a single sample from a probability distribution. You would not ship a system tested once. Do not ship code reviewed once.

Author: James Phoenix | Date: March 2026

Summary

LLM code generation is stochastic. Every output is a sample from a trajectory space, not a deterministic computation. Any single sample has a non-trivial error rate across dimensions you care about: correctness, security, performance, edge cases. The correct default stance is to assume the output is wrong until proven otherwise, then repeatedly prompt the model to audit its own work. Each pass samples a different reasoning trajectory. The compound probability of catching any given defect approaches 1 as passes increase. This is not a hack. It is how probability works. The meta-prompt “find and fix all issues with your work” is the simplest, most effective quality intervention available to anyone using coding agents today.

The Principle

Every time a coding LLM generates output, it is sampling from a high-dimensional probability distribution over possible code. The output is not “the answer.” It is “an answer,” conditioned on the prompt, the context window, and the model’s internal state at generation time.

This means:

The first output is almost never the best output. It is the most likely output given the current context, which is not the same thing.
Different runs surface different failure modes. The model’s attention shifts, different heuristics activate, different edge cases get noticed.
Errors are not uniformly distributed. Some bugs appear in 80% of samples (easy to catch). Others appear in 5% of samples (require many passes to find).

The practical consequence: if you accept the first output without verification, you are accepting whatever error rate that single sample carries. For production code, that is not acceptable.

The Mental Model: Trajectory Space

Think of the LLM as exploring a space of possible code trajectories. Each trajectory is a sequence of decisions: which pattern to use, how to handle errors, what edge cases to consider, how to structure tests.

Trajectory Space
================

         ┌─ Trajectory A (misses null check)
         │
Prompt ──┼─ Trajectory B (catches null, misses race condition)
         │
         ├─ Trajectory C (catches both, adds N+1 query)
         │
         └─ Trajectory D (correct implementation)

A single generation picks one trajectory. You have no way to know if it picked D without checking. But if you run the model again with “review your work and fix all issues,” it samples a new trajectory. The review pass might follow Trajectory B’s reasoning and catch the null check that Trajectory A missed.

The key insight: you are not asking the model to be smarter. You are asking it to sample a different region of the same space. Each sample is an independent draw. Combine enough draws and you cover the space.

The Meta-Prompt

The prompt itself is almost embarrassingly simple:

Fix all of the issues with the work that you have been doing.
Look for critical issues and fix these.
Run agent swarms to check all of the work.
Look at integration test suites and ensure we are passing all critical flows.
Ensure we are testing critical flows with real integration tests, not mocks.
Look for performance issues, testing issues, and infrastructure issues.
Keep checking and refining the work.
Think like a principal engineer.

That is it. No sophisticated prompt engineering. No chain-of-thought scaffolding. Just “assume you made mistakes and go find them.”

Why it works: the prompt shifts the model from generation mode (optimizing for completion) to verification mode (optimizing for correctness). These activate different reasoning patterns. A model generating code thinks “what should I build?” A model reviewing code thinks “what could go wrong?” Different question, different trajectory, different bugs surfaced.

The Math

If p is the probability a single pass catches a specific bug, and you run n independent passes:

P(bug caught) = 1 - (1 - p)^n

Pass (n)	p = 0.1	p = 0.2	p = 0.3
1	10%	20%	30%
3	27%	49%	66%
5	41%	67%	83%
10	65%	89%	97%
20	88%	99%	99.9%

Even for low-probability bugs (p = 0.1), 10 passes give you 65% detection. 20 passes give you 88%. The error rate decreases geometrically with each pass.

You are not finding 100% of truth in a single pass. But you can mine the latent space to push the error rate arbitrarily close to 0%.

Why Developers Skip This

Three reasons:

1. Anchoring bias. The model produced code that looks reasonable. It compiled. Maybe a test passed. The developer anchors on “it works” and moves on. But “it compiled” is not “it is correct.” Compilation checks syntax. It does not check logic, security, performance, or edge cases.

2. Sunk cost of context. The developer spent 20 minutes setting up the prompt, providing context, and getting the first output. Running the meta-prompt again feels like admitting the first pass failed. It did not fail. It did exactly what a single sample does: it gave you one draw from the distribution.

3. Misunderstanding the model’s confidence. LLMs produce fluent, confident-sounding text regardless of whether the underlying code is correct. The model does not know it missed a race condition. It is not withholding information. That failure mode simply was not on the trajectory it sampled.

The Practice

Level 1: Manual Re-prompting

After any significant code generation, immediately re-prompt:

Review everything you just wrote.
Assume it has bugs. Find them. Fix them.
Check edge cases, error handling, security, and performance.

Cost: 2 minutes. Catches the obvious 60-70% of issues.

Level 2: Structured Adversarial Review

Run the full meta-prompt from above. Add specificity for your domain:

Specifically check:
- All database queries use parameterized inputs
- All API endpoints validate input schemas
- All async operations have proper error handling
- All critical flows have integration tests (not mocks)
- No N+1 query patterns
- No hardcoded secrets or credentials

Cost: 5-10 minutes. Catches 80-90% of issues across 3-5 passes.

Level 3: Automated Hardening Loop

Embed the meta-prompt into a scaffold skill that runs autonomously with exit criteria. See Monte Carlo Quality Assurance for the full implementation.

Cost: Compute only. No human time. Catches 95%+ of issues.

When to Apply

Always apply after:

Udemy Bestseller

Learn Prompt Engineering

My O'Reilly book adapted for hands-on learning. Build production-ready prompts with practical exercises.

★ 4.5/5 rating

306,000+ learners

View Course

Implementing a new feature
Fixing a bug (the fix itself may introduce new bugs)
Refactoring (especially cross-cutting changes)
Any change touching authentication, payments, or data integrity

Skip for:

Documentation-only changes
Config file updates with no logic
Single-line typo fixes

The heuristic: if the change could break something in production, assume it did and verify.

The Deeper Point

This is not about LLMs being unreliable. It is about the fundamental nature of sampling from probability distributions. A single sample from any stochastic process carries irreducible variance. You reduce that variance by taking more samples.

Human code review works the same way. A single reviewer catches some bugs. Two reviewers catch more. Three reviewers catch more still. The difference with LLMs is that additional samples are nearly free in terms of cost and take minutes instead of hours.

The default stance for working with coding LLMs:

Assume the output is wrong until you have evidence it is right.
Run the meta-prompt at least once after every significant generation.
Look at integration tests, not mocks. Mocks test your assumptions. Integration tests test reality.
Think like a principal engineer. Would you ship this if your name was on the incident page?

The error rate of LLM-generated code is a function you control. Single pass gets you 70% correctness. Three passes gets you 90%. Ten passes gets you 99%. Choose the number that matches your risk tolerance.

Monte Carlo Quality Assurance – The operational implementation of this principle as an automated hardening loop
Actor-Critic Adversarial Coding – Structured two-agent review that formalizes the “generate then critique” cycle
Verification Sandwich Pattern – Establishing baselines before and after generation
Trust But Verify Protocol – The verification framework for accepting LLM output
Clean Slate Recovery – When re-prompting is not enough and you need a fresh trajectory
Agent Swarm Patterns – Parallelizing the verification across multiple agents
Integration Testing Patterns – Why integration tests over mocks for critical flows
Swarm Convergence Theory – When repeated passes converge vs. when they do not