Generator-Evaluator Harness Design: Anthropic’s GAN-Inspired Architecture for Long-Running Apps

James Phoenix
James Phoenix

Separating generation from evaluation is far more tractable than making a generator critical of its own work.

Author: James Phoenix | Date: March 2026


Summary

Anthropic’s Prithvi Rajasekaran published a new engineering post on harness design for long-running application development. The core idea: borrow the adversarial structure from GANs and split agent work into a generator that builds and an evaluator that judges. This is an interesting architecture that Anthropic recommends for autonomous app generation. I will likely try this on a future project.

Source: Harness Design for Long-Running Application Development


The Self-Evaluation Problem

The fundamental issue: when you ask an LLM to generate code and then evaluate its own output, it tends to confidently praise the work, even when quality is obviously mediocre. This is the same problem Actor-Critic Adversarial Coding addresses. Anthropic’s solution is structural separation. One agent builds, another agent judges. The evaluator never saw the generation process, so it has no sunk-cost bias.


The Three-Agent Architecture

The full harness uses three specialized agents with distinct responsibilities.

Planner Agent

Takes a brief 1-4 sentence user prompt and expands it into a comprehensive product specification. The planner prioritizes ambitious scope while avoiding granular technical over-specification. It proactively weaves AI capabilities into specs. Output: detailed feature lists and technical architecture guidance.

Generator Agent

Implements features iteratively using a React + Vite + FastAPI + SQLite/PostgreSQL stack. Works in sprint-based decomposition rather than monolithic builds. Maintains git version control and self-evaluates before handing off to QA.

Evaluator Agent

Uses Playwright MCP to interact with the running application like a real user. Tests UI features, API endpoints, and database state. Produces detailed, actionable feedback rather than surface-level approvals. The evaluator negotiates “sprint contracts” with the generator before implementation begins, defining explicit testable success criteria.

User prompt (1-4 sentences)
        |
        v
┌──────────────────┐
│   Planner Agent   │  Expands to full product spec
└──────────────────┘
        |
        v
┌──────────────────┐     sprint contract     ┌──────────────────┐
│  Generator Agent  │ <--------------------> │  Evaluator Agent  │
│  (builds code)    │ -----> running app ---> │  (Playwright QA)  │
│                   │ <----- feedback ------  │                   │
└──────────────────┘    5-15 iterations       └──────────────────┘

Sprint Contracts

Before the generator writes code, the evaluator negotiates explicit testable outcomes. This is the mechanism that prevents vague pass/fail judgments.

Example criterion: “Rectangle fill tool allows click-drag to fill a rectangular area with selected tile.” The evaluator identified this as FAIL, citing the specific code location (LevelEditor.tsx:892) and exact issue diagnosis.

This is similar to Test-Driven Prompting, but the tests are negotiated between agents rather than written by a human.


Frontend Design: Subjective Quality as a Scoring Function

The most novel part of the article is applying this pattern to frontend design quality, which is inherently subjective.

Udemy Bestseller

Learn Prompt Engineering

My O'Reilly book adapted for hands-on learning. Build production-ready prompts with practical exercises.

4.5/5 rating
306,000+ learners
View Course

Four Grading Criteria

Criterion What It Measures
Design Quality Coherence, mood creation, unified visual identity
Originality Custom decisions vs. template patterns. Penalizes “AI slop” markers
Craft Typography hierarchy, spacing, color harmony, contrast ratios
Functionality User comprehension, task completion, action discoverability

The author weighted design and originality higher because Claude already scored well on craft and functionality by default. This weighting pushed the model toward more aesthetic risk-taking.

This connects directly to Evaluation-Driven Development and Synthetic Loss Functions. The grading rubric is the loss function. The evaluator is the critic. The generator minimizes the loss through iteration.

Calibration Through Few-Shot

The evaluator required few-shot calibration to align its judgment with human preferences. Early iterations exhibited leniency: it would identify legitimate issues, then talk itself into deciding they were not a big deal. Multiple refinement cycles were necessary before producing useful QA verdicts.


Results: Solo Agent vs. Full Harness

For a retro game maker prompt:

Approach Duration Cost Outcome
Solo agent 20 min $9 Broken. Nothing responded to input. Wiring between entity definitions and game runtime was broken
Full harness 6 hours $200 Playable mechanics, refineable edge cases

The 22x cost increase bought a working product vs. a broken one. For throwaway prototypes, the solo agent is fine. For anything that needs to actually work, the harness pays for itself.


Context Management: Resets Beat Compaction (Until Opus)

Sonnet 4.5 exhibited “context anxiety,” prematurely wrapping work as it approached perceived context limits. Full context resets (clearing windows entirely) outperformed in-place compaction because they provided a clean slate without persistent anxiety. But resets added orchestration complexity and token overhead.

Opus 4.5 largely eliminated this behavior, enabling one continuous session across the whole build with automatic compaction handling context growth.


The Simplified V2 Harness (Opus 4.6)

With improved models, the author methodically removed harness components. The result with Opus 4.6:

Phase Duration Cost
Planner 4.7 min $0.46
Build Round 1 2 hr 7 min $71.08
QA Round 1 8.8 min $3.24
Build Round 2 1 hr 2 min $36.89
Build Round 3 10.9 min $5.88
Total 3 hr 50 min $124.70

This is a key insight about harness design in general: every component in a harness encodes an assumption about what the model cannot do on its own. Those assumptions decay as models improve. What was load-bearing scaffolding six months ago might be dead weight today.


Key Takeaways

Separation works. Splitting generation from evaluation is more effective than self-evaluation. This validates the adversarial pattern from Actor-Critic Adversarial Coding at a larger scale.

Rubrics are loss functions. Defining weighted grading criteria for subjective qualities (design, originality) turns aesthetic judgment into a scoring function the generator can optimize against.

Sprint contracts prevent drift. Negotiating explicit testable outcomes before coding prevents both premature victory declarations and vague evaluations.

Harness assumptions decay. As models improve, complexity does not shrink. It moves. Components that were essential become unnecessary, but new capabilities open new harness possibilities. Stress-test your assumptions regularly.

Language encodes values. Including phrases like “the best designs are museum quality” in prompts steered generations toward particular visual convergence. Prompt language does more than instruct. It shapes the distribution of outputs.


What I Want to Try

This pattern maps well onto the existing harness infrastructure described in Building the Harness. The planner-generator-evaluator trio could slot into the Meta Engineering layer as a specialized workflow for full-stack app generation. The sprint contract pattern in particular looks like a practical upgrade over the simpler pass/fail verification in Trust But Verify Protocol.

The frontend design grading rubric is the most transferable idea here. Defining weighted criteria for subjective quality, then using few-shot calibration to align the evaluator. That pattern could apply to documentation quality, API design, or any domain where “good enough” is hard to define programmatically.


Related

Topics
AnthropicFrontend DesignGan InspiredGenerator EvaluatorHarness DesignLong Running AgentsMulti AgentPlanner Generator EvaluatorSprint Contracts

More Insights

Cover Image for Zero-Cost Divergence: Generate Ten, Ship One

Zero-Cost Divergence: Generate Ten, Ship One

The cost of exploring bad ideas has dropped to zero. The winning strategy is no longer “design carefully, build once.” It is “build many cheaply, pick the best.”

James Phoenix
James Phoenix
Cover Image for Tree-Sitter Turned Everyone Into a Toolsmith

Tree-Sitter Turned Everyone Into a Toolsmith

Writing a parser used to mean writing a compiler. Tree-sitter reduced that to writing a `.scm` file. The result is a wave of desktop tools that understand code structurally, built by application developers who never took a compilers course.

James Phoenix
James Phoenix