Repeatedly prompting an agent to “harden everything” for hours works. The question is whether you automate the loop or pay for it with your time.
Author: James Phoenix | Date: March 2026
Summary
After completing a feature with AI coding agents, repeatedly running a hardening prompt (“fix all critical issues, add integration tests, look for bugs and edge cases”) for extended periods dramatically increases the probability of catching real defects. This works because each independent pass samples a different reasoning path, and the probability of catching any given bug approaches 1 as passes increase. The technique is Monte Carlo sampling applied to code quality. It is effective but scales linearly with human time. The principled upgrade is automating the loop into a scaffold-embedded skill with concrete exit criteria, mandatory artifacts, and deterministic test infrastructure.
The Discovery
The workflow looks simple. After implementing a feature, you run a prompt like this:
Fix all critical issues with the work you have done.
Run agent swarms to review all recent changes.
Validate integration test suites cover all critical flows end to end.
Add missing integration tests. No mocks for critical flows.
Look for performance bottlenecks, test runtime issues, infra problems.
Keep iterating until critical flows are tested and features are hardened.
Think like a principal engineer.
Then you run it again. And again. For hours.
And it works. Features that go through this process are significantly more reliable than features that ship after a single pass.
Why It Works: The Math
Each run of the hardening prompt samples a different reasoning path through the model’s probability space. Different runs surface different edge cases, catch different bugs, and explore different failure modes.
If p is the probability that a single run finds a specific critical issue, and n is the number of independent runs, the probability that at least one run catches it is:
P(detected) = 1 - (1 - p)^n
Concrete example. Assume p = 0.2 (20% chance any given run finds a particular bug):
| Runs (n) | P(detected) |
|---|---|
| 1 | 20% |
| 5 | 67% |
| 10 | 89% |
| 20 | 99% |
| 50 | 99.9999% |
After 10 independent passes, you have near-90% detection probability per bug class. After 20 passes, you are at 99%. This is why “spam the prompt for 6 hours” works. You are driving detection coverage probability toward 1 through repeated sampling.
This is the same principle behind Monte Carlo integration, random testing, and stochastic search. Nothing magical. Just statistics applied to code review.
What You Are Actually Doing
When you “spam the prompt for 6 hours,” you are implicitly performing:
- Re-sampling the model many times across different reasoning paths
- Exploring different failure surfaces as the model’s attention shifts
- Forcing repeated refactoring cycles that compound quality improvements
- Increasing probability mass on correct implementations
- Surfacing latent bugs that only appear when the model looks from a different angle
Each run is not identical. The model generates different internal reasoning chains, attends to different parts of the code, and applies different heuristics. This variance is the source of the technique’s power.
The Problem: Linear Time Cost
The technique works, but it has a fundamental scaling problem.
Cost = human_hours_per_feature * features_per_week
If hardening takes 6 hours per feature and you ship 3 features per week, that is 18 hours per week spent watching an agent loop. That is 45% of a full work week on supervision.
The current workflow:
Human scheduler -> model -> diff -> rerun -> repeat
What you actually want:
Controller -> swarm -> scoring -> retry -> merge -> report
No human in the loop during execution. Human reads the summary at the end.
The Upgrade: Autonomous Adversarial Search
Instead of sampling language space (different phrasings of “find bugs”), sample state space (different environmental conditions, different inputs, different execution orders).
From Prompt Repetition to Failure Surface Search
The manual approach only varies the model’s reasoning. The automated approach varies everything:
| Manual (Prompt Repetition) | Automated (State Space Search) |
|---|---|
| Re-run the same prompt | Generate N adversarial test scenarios |
| Hope model finds new bugs | Mutate inputs randomly |
| Single execution environment | Force race conditions |
| Happy path infrastructure | Force environment failures |
| Normal CPU/memory | Run under CPU throttling |
| Fixed test order | Randomize execution order |
| Stable dependencies | Kill and restart dependencies mid-test |
| Example-based tests | Property-based tests |
| Single reasoning path | 3+ independent reasoning passes compared |
The jump is from sampling language space to sampling state space. That gives exponential detection improvement without exponential human hours.
The Hardening Skill
This is the onboarding skill that ships as part of the initial scaffold for new projects. It replaces the manual 6-hour loop with a structured, automatable process.
The Prompt
# Harden: Monte Carlo Quality Assurance
Fix and harden everything you have implemented so far.
## Phase 1: Audit
Audit the entire codebase for critical issues:
- Correctness bugs and broken invariants
- Missing edge cases and unsafe assumptions
- Incomplete features and bad DX
- Security vulnerabilities (OWASP Top 10)
## Phase 2: Adversarial Review
Run agent swarms to review all recent changes and surrounding system.
Treat this as an adversarial principal engineer review, not a happy path review.
Each reviewer agent should take a different perspective:
- Security reviewer
- Performance reviewer
- Reliability reviewer
- DX/API design reviewer
## Phase 3: Integration Test Coverage
Validate integration test suites cover all critical user flows end to end.
Add missing integration tests. No mocks for critical flows.
Every critical flow must have at least one integration test that:
- Exercises the real database
- Exercises real HTTP endpoints
- Verifies actual state changes
- Cleans up after itself deterministically
## Phase 4: Stability
Ensure every critical flow passes locally and in CI.
If anything flakes, make it deterministic. Do not move on until flakes are resolved.
## Phase 5: Performance and Infra
Look for performance bottlenecks, test runtime explosions, infra issues,
and brittle CI or environment coupling. Fix them.
## Exit Criteria
Stop when ALL of the following are true:
- 0 open P0 issues
- Bounded P1 issues (each with a tracking ticket)
- No test flakes in 3 consecutive runs
- All critical flows have integration tests
- Performance regression < 10% vs baseline
- CI passes end to end
Mandatory Artifacts
Every run of the hardening skill must produce these 4 artifacts. Without them, agents will “review” forever without converging.
1. Critical Flows Inventory
A short list of flows with names and entrypoints.
| Flow | Entrypoint | Type |
|------|-----------|------|
| User sign-up | POST /api/auth/register | HTTP |
| User sign-in | POST /api/auth/login | HTTP |
| Create organization | POST /api/orgs | HTTP |
| Process payment | worker:payment-processor | Job |
| Export report | CLI: tx export | CLI |
2. Risk Register
Every issue found, with evidence and resolution.
| Issue | Severity | Evidence | Fix | Test Added |
|-------|----------|----------|-----|------------|
| SQL injection in search | P0 | `WHERE name = '${input}'` | Parameterized query | `search-injection.test.ts` |
| No rate limit on login | P1 | No middleware on route | Added rate limiter | `login-rate-limit.test.ts` |
| N+1 query in org list | P2 | 47 queries for 10 orgs | Added join | `org-list-perf.test.ts` |
3. Integration Test Plan
For each critical flow: what is tested, how, and what is asserted.
| Test Name | Setup | Action | Assertions | Teardown | Runtime |
|-----------|-------|--------|------------|----------|---------|
| sign-up-happy-path | seed DB | POST /register | 201, user in DB, token valid | delete user | <2s |
| sign-up-duplicate | seed user | POST /register same email | 409, no duplicate | delete user | <1s |
| sign-in-rate-limit | seed user | POST /login x 6 | 429 on 6th attempt | reset limiter | <3s |
4. CI Proof
Links or logs showing tests passing in CI, plus stability improvements.
- CI Run: https://github.com/org/repo/actions/runs/12345
- Status: All green
- Flake fixes: Replaced `setTimeout` with event-driven wait in 3 tests
- Runtime: 4m 12s (down from 7m 30s after parallelization)
Hard rule: Every “fix” must come with an integration test that fails before the fix and passes after.
Determinism Requirements
“Integration tests only” can turn into slow, flaky, expensive pain if you do not design for determinism. The hardening skill is useless if it produces tests that flake.
The 4 Pillars of Deterministic Testing
1. Deterministic Test Data
No shared mutable state between tests. Each test creates its own data and cleans it up.
// BAD: Shared test user that other tests mutate
const testUser = globalFixtures.user;
// GOOD: Each test creates its own
const user = await createTestUser({ email: `test-${uuid()}@example.com` });
// ... test logic ...
await deleteTestUser(user.id);
2. Hermetic Containers
Pin versions. No network access unless explicitly allowed. Tests must pass on an airplane.
# docker-compose.test.yml
services:
postgres:
image: postgres:16.2 # Pinned, not :latest
environment:
POSTGRES_DB: test
networks:
- test-internal # No external network
app:
build: .
depends_on:
- postgres
networks:
- test-internal
environment:
DATABASE_URL: postgresql://postgres:postgres@postgres:5432/test
networks:
test-internal:
internal: true # No internet access
3. Time Control
Fake timers only at the boundary if absolutely needed. Otherwise assert on eventual state with bounded retries.
// BAD: Depends on wall clock timing
await sleep(1000);
expect(cache.isExpired()).toBe(true);
// GOOD: Bounded retry with timeout
await waitFor(
() => expect(cache.isExpired()).toBe(true),
{ timeout: 5000, interval: 100 }
);
4. Consistent Environment Bootstrapping
Single script, idempotent. Running it twice produces the same result.
#!/bin/bash
# scripts/test-setup.sh - Idempotent test environment setup
set -euo pipefail
# Start containers (idempotent - no-op if already running)
docker compose -f docker-compose.test.yml up -d --wait
# Run migrations (idempotent - skips already-applied)
pnpm db:migrate
# Seed baseline data (idempotent - upserts)
pnpm db:seed:test
echo "Test environment ready"
If the scaffold does not include these 4 pillars, the hardening skill becomes a flake generator.
Severity Rubric
Without a shared definition of severity, agents argue about what matters. Define it once.
| Level | Definition | SLA | Example |
|---|---|---|---|
| P0 | Data loss, security breach, or system down | Fix before merge | SQL injection, unencrypted passwords |
| P1 | Feature broken for users, no workaround | Fix within sprint, ticket required | Sign-up flow crashes on valid input |
| P2 | Feature degraded, workaround exists | Track in backlog | Slow query on large dataset |
| P3 | Cosmetic or minor DX issue | Fix opportunistically | Inconsistent error message format |
Exit criteria for the hardening skill:
- 0 open P0
- <=3 open P1 (each with a ticket)
- P2/P3 logged but not blocking
The Flake Budget Rule
If a test flakes once, it becomes the top priority until fixed. No exceptions.
Flaky tests destroy trust in the test suite. Once developers learn to ignore red CI, the entire quality gate collapses. One flake is a crack in the foundation.
Flake detected
-> Immediately quarantine the test
-> Create P0 ticket
-> Fix root cause (timing, shared state, network dependency)
-> Verify 10 consecutive green runs before un-quarantining
The Autonomous Hardening Loop
The end state. No human in the loop during execution.
1. Feature branch frozen
2. Spawn hardening swarm:
a. Static analysis pass (lint, types, security scan)
b. Integration adversarial pass (swarm of test agents)
c. Performance regression pass (benchmarks vs baseline)
d. Infrastructure integrity pass (containers, CI, env)
3. Aggregate issues into Risk Register
4. Auto-fix P0 and P1 issues
5. Re-run until exit criteria met:
- 0 P0
- Bounded P1 with tickets
- No test flakes in 3 runs
- Perf regression < threshold
6. Human reads summary report
7. Merge or escalate
Total human time: 5 minutes reading the report, instead of 6 hours supervising.
What You Are Optimizing
The objective function:
E[undetected critical bugs] = Σ P(bug_i exists) * P(bug_i not detected)
The manual prompt loop reduces the second term (detection probability) through repeated sampling.
But you can reduce it dramatically more by increasing:
- Test surface area: More flows covered, more assertions per flow
- Environmental entropy: Different conditions, different timing, different load
- Adversarial mutation: Fuzz inputs, kill processes, corrupt state
- Invariant checking: Formal properties that must hold under all conditions
That gives exponential detection improvement without exponential human hours.
The Meta-Insight
This technique is a way to convert AI compute budget into software reliability. Every dollar spent on repeated hardening passes is a dollar not spent on production incidents, rollbacks, and customer trust damage.
The progression:
Level 1: Manual code review (scales with human hours)
Level 2: Manual prompt repetition (scales with human supervision)
Level 3: Automated adversarial harness (scales with compute)
Level 4: Scaffold-embedded skill (scales with every new project)
Level 2 is where most people discover the pattern. Level 3 is where the leverage appears. Level 4 is where it compounds across your entire portfolio.
Turning budget into reusable process is the move.
Practical Implementation: The Decision
You have two paths:
A) 6 hours per feature, forever. Works. Catches bugs. Scales linearly with your time. Plateaus when you run out of hours.
B) 3 weeks building the adversarial harness once. Then 20 minutes per feature for the rest of the project’s life. The harness improves with every run. Compounds across projects.
If you are building one throwaway prototype, choose A. If you are building production systems or OSS frameworks, choose B.
The hardening skill described in this article is the bridge. It is structured enough to run autonomously (B) but simple enough to run manually while you build the automation (A).
Related
- Actor-Critic Adversarial Coding – The two-agent review loop that forms one pass within Monte Carlo QA
- Trust But Verify Protocol – Verification through test output review, not code review
- Synthetic Loss Functions – Treating the hardening loop as loss minimization
- Six-Layer Lint Harness – Structural constraints that complement behavioral testing
- Agent Swarm Patterns – Multiple agents for coverage, the execution model for the hardening swarm
- Stateless Verification Loops – Deterministic verification without state accumulation
- Prevention Protocol – Turning every bug found during hardening into a permanent constraint
- Constraint-First Development – Making the rules that eliminate bug classes permanently
- Swarm Convergence Theory – Why the hardening loop converges (or does not)
- Integration Testing Patterns – End-to-end testing patterns used in the hardening skill
- Test Custom Infrastructure – Testing the harness itself

