Monte Carlo Quality Assurance: Brute-Force Stochastic Hardening for AI-Generated Code

James Phoenix
James Phoenix

Repeatedly prompting an agent to “harden everything” for hours works. The question is whether you automate the loop or pay for it with your time.

Author: James Phoenix | Date: March 2026


Summary

After completing a feature with AI coding agents, repeatedly running a hardening prompt (“fix all critical issues, add integration tests, look for bugs and edge cases”) for extended periods dramatically increases the probability of catching real defects. This works because each independent pass samples a different reasoning path, and the probability of catching any given bug approaches 1 as passes increase. The technique is Monte Carlo sampling applied to code quality. It is effective but scales linearly with human time. The principled upgrade is automating the loop into a scaffold-embedded skill with concrete exit criteria, mandatory artifacts, and deterministic test infrastructure.


The Discovery

The workflow looks simple. After implementing a feature, you run a prompt like this:

Fix all critical issues with the work you have done.
Run agent swarms to review all recent changes.
Validate integration test suites cover all critical flows end to end.
Add missing integration tests. No mocks for critical flows.
Look for performance bottlenecks, test runtime issues, infra problems.
Keep iterating until critical flows are tested and features are hardened.
Think like a principal engineer.

Then you run it again. And again. For hours.

And it works. Features that go through this process are significantly more reliable than features that ship after a single pass.


Why It Works: The Math

Each run of the hardening prompt samples a different reasoning path through the model’s probability space. Different runs surface different edge cases, catch different bugs, and explore different failure modes.

If p is the probability that a single run finds a specific critical issue, and n is the number of independent runs, the probability that at least one run catches it is:

P(detected) = 1 - (1 - p)^n

Concrete example. Assume p = 0.2 (20% chance any given run finds a particular bug):

Runs (n) P(detected)
1 20%
5 67%
10 89%
20 99%
50 99.9999%

After 10 independent passes, you have near-90% detection probability per bug class. After 20 passes, you are at 99%. This is why “spam the prompt for 6 hours” works. You are driving detection coverage probability toward 1 through repeated sampling.

This is the same principle behind Monte Carlo integration, random testing, and stochastic search. Nothing magical. Just statistics applied to code review.


What You Are Actually Doing

When you “spam the prompt for 6 hours,” you are implicitly performing:

  1. Re-sampling the model many times across different reasoning paths
  2. Exploring different failure surfaces as the model’s attention shifts
  3. Forcing repeated refactoring cycles that compound quality improvements
  4. Increasing probability mass on correct implementations
  5. Surfacing latent bugs that only appear when the model looks from a different angle

Each run is not identical. The model generates different internal reasoning chains, attends to different parts of the code, and applies different heuristics. This variance is the source of the technique’s power.


The Problem: Linear Time Cost

The technique works, but it has a fundamental scaling problem.

Cost = human_hours_per_feature * features_per_week

If hardening takes 6 hours per feature and you ship 3 features per week, that is 18 hours per week spent watching an agent loop. That is 45% of a full work week on supervision.

The current workflow:

Human scheduler -> model -> diff -> rerun -> repeat

What you actually want:

Controller -> swarm -> scoring -> retry -> merge -> report

No human in the loop during execution. Human reads the summary at the end.


The Upgrade: Autonomous Adversarial Search

Instead of sampling language space (different phrasings of “find bugs”), sample state space (different environmental conditions, different inputs, different execution orders).

From Prompt Repetition to Failure Surface Search

The manual approach only varies the model’s reasoning. The automated approach varies everything:

Manual (Prompt Repetition) Automated (State Space Search)
Re-run the same prompt Generate N adversarial test scenarios
Hope model finds new bugs Mutate inputs randomly
Single execution environment Force race conditions
Happy path infrastructure Force environment failures
Normal CPU/memory Run under CPU throttling
Fixed test order Randomize execution order
Stable dependencies Kill and restart dependencies mid-test
Example-based tests Property-based tests
Single reasoning path 3+ independent reasoning passes compared

The jump is from sampling language space to sampling state space. That gives exponential detection improvement without exponential human hours.


The Hardening Skill

This is the onboarding skill that ships as part of the initial scaffold for new projects. It replaces the manual 6-hour loop with a structured, automatable process.

The Prompt

# Harden: Monte Carlo Quality Assurance

Fix and harden everything you have implemented so far.

## Phase 1: Audit

Audit the entire codebase for critical issues:
- Correctness bugs and broken invariants
- Missing edge cases and unsafe assumptions
- Incomplete features and bad DX
- Security vulnerabilities (OWASP Top 10)

## Phase 2: Adversarial Review

Run agent swarms to review all recent changes and surrounding system.
Treat this as an adversarial principal engineer review, not a happy path review.
Each reviewer agent should take a different perspective:
- Security reviewer
- Performance reviewer
- Reliability reviewer
- DX/API design reviewer

## Phase 3: Integration Test Coverage

Validate integration test suites cover all critical user flows end to end.
Add missing integration tests. No mocks for critical flows.
Every critical flow must have at least one integration test that:
- Exercises the real database
- Exercises real HTTP endpoints
- Verifies actual state changes
- Cleans up after itself deterministically

## Phase 4: Stability

Ensure every critical flow passes locally and in CI.
If anything flakes, make it deterministic. Do not move on until flakes are resolved.

## Phase 5: Performance and Infra

Look for performance bottlenecks, test runtime explosions, infra issues,
and brittle CI or environment coupling. Fix them.

## Exit Criteria

Stop when ALL of the following are true:
- 0 open P0 issues
- Bounded P1 issues (each with a tracking ticket)
- No test flakes in 3 consecutive runs
- All critical flows have integration tests
- Performance regression < 10% vs baseline
- CI passes end to end

Mandatory Artifacts

Every run of the hardening skill must produce these 4 artifacts. Without them, agents will “review” forever without converging.

1. Critical Flows Inventory

A short list of flows with names and entrypoints.

| Flow | Entrypoint | Type |
|------|-----------|------|
| User sign-up | POST /api/auth/register | HTTP |
| User sign-in | POST /api/auth/login | HTTP |
| Create organization | POST /api/orgs | HTTP |
| Process payment | worker:payment-processor | Job |
| Export report | CLI: tx export | CLI |

2. Risk Register

Every issue found, with evidence and resolution.

| Issue | Severity | Evidence | Fix | Test Added |
|-------|----------|----------|-----|------------|
| SQL injection in search | P0 | `WHERE name = '${input}'` | Parameterized query | `search-injection.test.ts` |
| No rate limit on login | P1 | No middleware on route | Added rate limiter | `login-rate-limit.test.ts` |
| N+1 query in org list | P2 | 47 queries for 10 orgs | Added join | `org-list-perf.test.ts` |

3. Integration Test Plan

For each critical flow: what is tested, how, and what is asserted.

| Test Name | Setup | Action | Assertions | Teardown | Runtime |
|-----------|-------|--------|------------|----------|---------|
| sign-up-happy-path | seed DB | POST /register | 201, user in DB, token valid | delete user | <2s |
| sign-up-duplicate | seed user | POST /register same email | 409, no duplicate | delete user | <1s |
| sign-in-rate-limit | seed user | POST /login x 6 | 429 on 6th attempt | reset limiter | <3s |

4. CI Proof

Links or logs showing tests passing in CI, plus stability improvements.

- CI Run: https://github.com/org/repo/actions/runs/12345
- Status: All green
- Flake fixes: Replaced `setTimeout` with event-driven wait in 3 tests
- Runtime: 4m 12s (down from 7m 30s after parallelization)

Hard rule: Every “fix” must come with an integration test that fails before the fix and passes after.


Determinism Requirements

“Integration tests only” can turn into slow, flaky, expensive pain if you do not design for determinism. The hardening skill is useless if it produces tests that flake.

The 4 Pillars of Deterministic Testing

1. Deterministic Test Data

No shared mutable state between tests. Each test creates its own data and cleans it up.

// BAD: Shared test user that other tests mutate
const testUser = globalFixtures.user;

// GOOD: Each test creates its own
const user = await createTestUser({ email: `test-${uuid()}@example.com` });
// ... test logic ...
await deleteTestUser(user.id);

2. Hermetic Containers

Pin versions. No network access unless explicitly allowed. Tests must pass on an airplane.

# docker-compose.test.yml
services:
  postgres:
    image: postgres:16.2  # Pinned, not :latest
    environment:
      POSTGRES_DB: test
    networks:
      - test-internal  # No external network

  app:
    build: .
    depends_on:
      - postgres
    networks:
      - test-internal
    environment:
      DATABASE_URL: postgresql://postgres:postgres@postgres:5432/test

networks:
  test-internal:
    internal: true  # No internet access

3. Time Control

Fake timers only at the boundary if absolutely needed. Otherwise assert on eventual state with bounded retries.

// BAD: Depends on wall clock timing
await sleep(1000);
expect(cache.isExpired()).toBe(true);

// GOOD: Bounded retry with timeout
await waitFor(
  () => expect(cache.isExpired()).toBe(true),
  { timeout: 5000, interval: 100 }
);

4. Consistent Environment Bootstrapping

Single script, idempotent. Running it twice produces the same result.

#!/bin/bash
# scripts/test-setup.sh - Idempotent test environment setup

set -euo pipefail

# Start containers (idempotent - no-op if already running)
docker compose -f docker-compose.test.yml up -d --wait

# Run migrations (idempotent - skips already-applied)
pnpm db:migrate

# Seed baseline data (idempotent - upserts)
pnpm db:seed:test

echo "Test environment ready"

If the scaffold does not include these 4 pillars, the hardening skill becomes a flake generator.


Severity Rubric

Without a shared definition of severity, agents argue about what matters. Define it once.

Level Definition SLA Example
P0 Data loss, security breach, or system down Fix before merge SQL injection, unencrypted passwords
P1 Feature broken for users, no workaround Fix within sprint, ticket required Sign-up flow crashes on valid input
P2 Feature degraded, workaround exists Track in backlog Slow query on large dataset
P3 Cosmetic or minor DX issue Fix opportunistically Inconsistent error message format

Exit criteria for the hardening skill:

  • 0 open P0
  • <=3 open P1 (each with a ticket)
  • P2/P3 logged but not blocking

The Flake Budget Rule

If a test flakes once, it becomes the top priority until fixed. No exceptions.

Flaky tests destroy trust in the test suite. Once developers learn to ignore red CI, the entire quality gate collapses. One flake is a crack in the foundation.

Flake detected
  -> Immediately quarantine the test
  -> Create P0 ticket
  -> Fix root cause (timing, shared state, network dependency)
  -> Verify 10 consecutive green runs before un-quarantining

The Autonomous Hardening Loop

The end state. No human in the loop during execution.

1. Feature branch frozen
2. Spawn hardening swarm:
   a. Static analysis pass (lint, types, security scan)
   b. Integration adversarial pass (swarm of test agents)
   c. Performance regression pass (benchmarks vs baseline)
   d. Infrastructure integrity pass (containers, CI, env)
3. Aggregate issues into Risk Register
4. Auto-fix P0 and P1 issues
5. Re-run until exit criteria met:
   - 0 P0
   - Bounded P1 with tickets
   - No test flakes in 3 runs
   - Perf regression < threshold
6. Human reads summary report
7. Merge or escalate

Total human time: 5 minutes reading the report, instead of 6 hours supervising.


What You Are Optimizing

The objective function:

E[undetected critical bugs] = Σ P(bug_i exists) * P(bug_i not detected)

The manual prompt loop reduces the second term (detection probability) through repeated sampling.

But you can reduce it dramatically more by increasing:

  • Test surface area: More flows covered, more assertions per flow
  • Environmental entropy: Different conditions, different timing, different load
  • Adversarial mutation: Fuzz inputs, kill processes, corrupt state
  • Invariant checking: Formal properties that must hold under all conditions

That gives exponential detection improvement without exponential human hours.


The Meta-Insight

This technique is a way to convert AI compute budget into software reliability. Every dollar spent on repeated hardening passes is a dollar not spent on production incidents, rollbacks, and customer trust damage.

The progression:

Udemy Bestseller

Learn Prompt Engineering

My O'Reilly book adapted for hands-on learning. Build production-ready prompts with practical exercises.

4.5/5 rating
306,000+ learners
View Course
Level 1: Manual code review (scales with human hours)
Level 2: Manual prompt repetition (scales with human supervision)
Level 3: Automated adversarial harness (scales with compute)
Level 4: Scaffold-embedded skill (scales with every new project)

Level 2 is where most people discover the pattern. Level 3 is where the leverage appears. Level 4 is where it compounds across your entire portfolio.

Turning budget into reusable process is the move.


Practical Implementation: The Decision

You have two paths:

A) 6 hours per feature, forever. Works. Catches bugs. Scales linearly with your time. Plateaus when you run out of hours.

B) 3 weeks building the adversarial harness once. Then 20 minutes per feature for the rest of the project’s life. The harness improves with every run. Compounds across projects.

If you are building one throwaway prototype, choose A. If you are building production systems or OSS frameworks, choose B.

The hardening skill described in this article is the bridge. It is structured enough to run autonomously (B) but simple enough to run manually while you build the automation (A).


Related

Topics
Adversarial TestingAgent SwarmsDeterminismExit CriteriaFailure SurfacesHardeningIntegration TestsMonte CarloPrincipal EngineerQuality Assurance

More Insights

Cover Image for ASCII Previews Before Expensive Renders

ASCII Previews Before Expensive Renders

Image and video generation are among the most expensive API calls you can make. A single image render costs $0.02-0.20+, and video generation can cost dollars per clip. Before triggering these renders

James Phoenix
James Phoenix
Cover Image for The Six-Layer Lint Harness: What Actually Scales Agent-Written Code

The Six-Layer Lint Harness: What Actually Scales Agent-Written Code

Rules eliminate entire bug classes permanently. But rules alone aren’t enough. You need the three-legged stool: structural constraints, behavioral verification, and generative scaffolding.

James Phoenix
James Phoenix