Separating generation from evaluation is far more tractable than making a generator critical of its own work.
Author: James Phoenix | Date: March 2026
Summary
Anthropic’s Prithvi Rajasekaran published a new engineering post on harness design for long-running application development. The core idea: borrow the adversarial structure from GANs and split agent work into a generator that builds and an evaluator that judges. This is an interesting architecture that Anthropic recommends for autonomous app generation. I will likely try this on a future project.
Source: Harness Design for Long-Running Application Development
The Self-Evaluation Problem
The fundamental issue: when you ask an LLM to generate code and then evaluate its own output, it tends to confidently praise the work, even when quality is obviously mediocre. This is the same problem Actor-Critic Adversarial Coding addresses. Anthropic’s solution is structural separation. One agent builds, another agent judges. The evaluator never saw the generation process, so it has no sunk-cost bias.
The Three-Agent Architecture
The full harness uses three specialized agents with distinct responsibilities.
Planner Agent
Takes a brief 1-4 sentence user prompt and expands it into a comprehensive product specification. The planner prioritizes ambitious scope while avoiding granular technical over-specification. It proactively weaves AI capabilities into specs. Output: detailed feature lists and technical architecture guidance.
Generator Agent
Implements features iteratively using a React + Vite + FastAPI + SQLite/PostgreSQL stack. Works in sprint-based decomposition rather than monolithic builds. Maintains git version control and self-evaluates before handing off to QA.
Evaluator Agent
Uses Playwright MCP to interact with the running application like a real user. Tests UI features, API endpoints, and database state. Produces detailed, actionable feedback rather than surface-level approvals. The evaluator negotiates “sprint contracts” with the generator before implementation begins, defining explicit testable success criteria.
User prompt (1-4 sentences)
|
v
┌──────────────────┐
│ Planner Agent │ Expands to full product spec
└──────────────────┘
|
v
┌──────────────────┐ sprint contract ┌──────────────────┐
│ Generator Agent │ <--------------------> │ Evaluator Agent │
│ (builds code) │ -----> running app ---> │ (Playwright QA) │
│ │ <----- feedback ------ │ │
└──────────────────┘ 5-15 iterations └──────────────────┘
Sprint Contracts
Before the generator writes code, the evaluator negotiates explicit testable outcomes. This is the mechanism that prevents vague pass/fail judgments.
Example criterion: “Rectangle fill tool allows click-drag to fill a rectangular area with selected tile.” The evaluator identified this as FAIL, citing the specific code location (LevelEditor.tsx:892) and exact issue diagnosis.
This is similar to Test-Driven Prompting, but the tests are negotiated between agents rather than written by a human.
Frontend Design: Subjective Quality as a Scoring Function
The most novel part of the article is applying this pattern to frontend design quality, which is inherently subjective.
Four Grading Criteria
| Criterion | What It Measures |
|---|---|
| Design Quality | Coherence, mood creation, unified visual identity |
| Originality | Custom decisions vs. template patterns. Penalizes “AI slop” markers |
| Craft | Typography hierarchy, spacing, color harmony, contrast ratios |
| Functionality | User comprehension, task completion, action discoverability |
The author weighted design and originality higher because Claude already scored well on craft and functionality by default. This weighting pushed the model toward more aesthetic risk-taking.
This connects directly to Evaluation-Driven Development and Synthetic Loss Functions. The grading rubric is the loss function. The evaluator is the critic. The generator minimizes the loss through iteration.
Calibration Through Few-Shot
The evaluator required few-shot calibration to align its judgment with human preferences. Early iterations exhibited leniency: it would identify legitimate issues, then talk itself into deciding they were not a big deal. Multiple refinement cycles were necessary before producing useful QA verdicts.
Results: Solo Agent vs. Full Harness
For a retro game maker prompt:
| Approach | Duration | Cost | Outcome |
|---|---|---|---|
| Solo agent | 20 min | $9 | Broken. Nothing responded to input. Wiring between entity definitions and game runtime was broken |
| Full harness | 6 hours | $200 | Playable mechanics, refineable edge cases |
The 22x cost increase bought a working product vs. a broken one. For throwaway prototypes, the solo agent is fine. For anything that needs to actually work, the harness pays for itself.
Context Management: Resets Beat Compaction (Until Opus)
Sonnet 4.5 exhibited “context anxiety,” prematurely wrapping work as it approached perceived context limits. Full context resets (clearing windows entirely) outperformed in-place compaction because they provided a clean slate without persistent anxiety. But resets added orchestration complexity and token overhead.
Opus 4.5 largely eliminated this behavior, enabling one continuous session across the whole build with automatic compaction handling context growth.
The Simplified V2 Harness (Opus 4.6)
With improved models, the author methodically removed harness components. The result with Opus 4.6:
| Phase | Duration | Cost |
|---|---|---|
| Planner | 4.7 min | $0.46 |
| Build Round 1 | 2 hr 7 min | $71.08 |
| QA Round 1 | 8.8 min | $3.24 |
| Build Round 2 | 1 hr 2 min | $36.89 |
| Build Round 3 | 10.9 min | $5.88 |
| Total | 3 hr 50 min | $124.70 |
This is a key insight about harness design in general: every component in a harness encodes an assumption about what the model cannot do on its own. Those assumptions decay as models improve. What was load-bearing scaffolding six months ago might be dead weight today.
Key Takeaways
Separation works. Splitting generation from evaluation is more effective than self-evaluation. This validates the adversarial pattern from Actor-Critic Adversarial Coding at a larger scale.
Rubrics are loss functions. Defining weighted grading criteria for subjective qualities (design, originality) turns aesthetic judgment into a scoring function the generator can optimize against.
Sprint contracts prevent drift. Negotiating explicit testable outcomes before coding prevents both premature victory declarations and vague evaluations.
Harness assumptions decay. As models improve, complexity does not shrink. It moves. Components that were essential become unnecessary, but new capabilities open new harness possibilities. Stress-test your assumptions regularly.
Language encodes values. Including phrases like “the best designs are museum quality” in prompts steered generations toward particular visual convergence. Prompt language does more than instruct. It shapes the distribution of outputs.
What I Want to Try
This pattern maps well onto the existing harness infrastructure described in Building the Harness. The planner-generator-evaluator trio could slot into the Meta Engineering layer as a specialized workflow for full-stack app generation. The sprint contract pattern in particular looks like a practical upgrade over the simpler pass/fail verification in Trust But Verify Protocol.
The frontend design grading rubric is the most transferable idea here. Defining weighted criteria for subjective quality, then using few-shot calibration to align the evaluator. That pattern could apply to documentation quality, API design, or any domain where “good enough” is hard to define programmatically.
Related
- Actor-Critic Adversarial Coding – The two-agent adversarial pattern this scales up
- Building the Harness – The four-layer harness where this pattern fits
- Evaluation-Driven Development – AI vision for qualitative evaluation
- Synthetic Loss Functions – Rubrics as loss functions for agent optimization
- Trust But Verify Protocol – Simpler verification that sprint contracts upgrade
- Test-Driven Prompting – Human-written test specs, complementary to agent-negotiated sprint contracts
- Monte Carlo QA – Repeated evaluation passes, same principle applied differently
- Long-Running Agent Patterns – Shell, skills, and compaction for extended runs

