Generator-Evaluator Harness Design: Anthropic’s GAN-Inspired Architecture for Long-Running Apps

James Phoenix

Separating generation from evaluation is far more tractable than making a generator critical of its own work.

Author: James Phoenix | Date: March 2026

Summary

Anthropic’s Prithvi Rajasekaran published a new engineering post on harness design for long-running application development. The core idea: borrow the adversarial structure from GANs and split agent work into a generator that builds and an evaluator that judges. This is an interesting architecture that Anthropic recommends for autonomous app generation. I will likely try this on a future project.

Source: Harness Design for Long-Running Application Development

The Self-Evaluation Problem

The fundamental issue: when you ask an LLM to generate code and then evaluate its own output, it tends to confidently praise the work, even when quality is obviously mediocre. This is the same problem Actor-Critic Adversarial Coding addresses. Anthropic’s solution is structural separation. One agent builds, another agent judges. The evaluator never saw the generation process, so it has no sunk-cost bias.

The Three-Agent Architecture

The full harness uses three specialized agents with distinct responsibilities.

Planner Agent

Takes a brief 1-4 sentence user prompt and expands it into a comprehensive product specification. The planner prioritizes ambitious scope while avoiding granular technical over-specification. It proactively weaves AI capabilities into specs. Output: detailed feature lists and technical architecture guidance.

Generator Agent

Implements features iteratively using a React + Vite + FastAPI + SQLite/PostgreSQL stack. Works in sprint-based decomposition rather than monolithic builds. Maintains git version control and self-evaluates before handing off to QA.

Evaluator Agent

Uses Playwright MCP to interact with the running application like a real user. Tests UI features, API endpoints, and database state. Produces detailed, actionable feedback rather than surface-level approvals. The evaluator negotiates “sprint contracts” with the generator before implementation begins, defining explicit testable success criteria.

User prompt (1-4 sentences)
        |
        v
┌──────────────────┐
│   Planner Agent   │  Expands to full product spec
└──────────────────┘
        |
        v
┌──────────────────┐     sprint contract     ┌──────────────────┐
│  Generator Agent  │ <--------------------> │  Evaluator Agent  │
│  (builds code)    │ -----> running app ---> │  (Playwright QA)  │
│                   │ <----- feedback ------  │                   │
└──────────────────┘    5-15 iterations       └──────────────────┘

Sprint Contracts

Before the generator writes code, the evaluator negotiates explicit testable outcomes. This is the mechanism that prevents vague pass/fail judgments.

Example criterion: “Rectangle fill tool allows click-drag to fill a rectangular area with selected tile.” The evaluator identified this as FAIL, citing the specific code location (LevelEditor.tsx:892) and exact issue diagnosis.

This is similar to Test-Driven Prompting, but the tests are negotiated between agents rather than written by a human.

Frontend Design: Subjective Quality as a Scoring Function

The most novel part of the article is applying this pattern to frontend design quality, which is inherently subjective.

Udemy Bestseller

Learn Prompt Engineering

My O'Reilly book adapted for hands-on learning. Build production-ready prompts with practical exercises.

★ 4.5/5 rating

306,000+ learners

View Course

Four Grading Criteria

Criterion	What It Measures
Design Quality	Coherence, mood creation, unified visual identity
Originality	Custom decisions vs. template patterns. Penalizes “AI slop” markers
Craft	Typography hierarchy, spacing, color harmony, contrast ratios
Functionality	User comprehension, task completion, action discoverability

The author weighted design and originality higher because Claude already scored well on craft and functionality by default. This weighting pushed the model toward more aesthetic risk-taking.

This connects directly to Evaluation-Driven Development and Synthetic Loss Functions. The grading rubric is the loss function. The evaluator is the critic. The generator minimizes the loss through iteration.

Calibration Through Few-Shot

The evaluator required few-shot calibration to align its judgment with human preferences. Early iterations exhibited leniency: it would identify legitimate issues, then talk itself into deciding they were not a big deal. Multiple refinement cycles were necessary before producing useful QA verdicts.

Results: Solo Agent vs. Full Harness

For a retro game maker prompt:

Approach	Duration	Cost	Outcome
Solo agent	20 min	$9	Broken. Nothing responded to input. Wiring between entity definitions and game runtime was broken
Full harness	6 hours	$200	Playable mechanics, refineable edge cases

The 22x cost increase bought a working product vs. a broken one. For throwaway prototypes, the solo agent is fine. For anything that needs to actually work, the harness pays for itself.

Context Management: Resets Beat Compaction (Until Opus)

Sonnet 4.5 exhibited “context anxiety,” prematurely wrapping work as it approached perceived context limits. Full context resets (clearing windows entirely) outperformed in-place compaction because they provided a clean slate without persistent anxiety. But resets added orchestration complexity and token overhead.

Opus 4.5 largely eliminated this behavior, enabling one continuous session across the whole build with automatic compaction handling context growth.

The Simplified V2 Harness (Opus 4.6)

With improved models, the author methodically removed harness components. The result with Opus 4.6:

Phase	Duration	Cost
Planner	4.7 min	$0.46
Build Round 1	2 hr 7 min	$71.08
QA Round 1	8.8 min	$3.24
Build Round 2	1 hr 2 min	$36.89
Build Round 3	10.9 min	$5.88
Total	3 hr 50 min	$124.70

This is a key insight about harness design in general: every component in a harness encodes an assumption about what the model cannot do on its own. Those assumptions decay as models improve. What was load-bearing scaffolding six months ago might be dead weight today.

Key Takeaways

Separation works. Splitting generation from evaluation is more effective than self-evaluation. This validates the adversarial pattern from Actor-Critic Adversarial Coding at a larger scale.

Rubrics are loss functions. Defining weighted grading criteria for subjective qualities (design, originality) turns aesthetic judgment into a scoring function the generator can optimize against.

Sprint contracts prevent drift. Negotiating explicit testable outcomes before coding prevents both premature victory declarations and vague evaluations.

Harness assumptions decay. As models improve, complexity does not shrink. It moves. Components that were essential become unnecessary, but new capabilities open new harness possibilities. Stress-test your assumptions regularly.

Language encodes values. Including phrases like “the best designs are museum quality” in prompts steered generations toward particular visual convergence. Prompt language does more than instruct. It shapes the distribution of outputs.

What I Want to Try

This pattern maps well onto the existing harness infrastructure described in Building the Harness. The planner-generator-evaluator trio could slot into the Meta Engineering layer as a specialized workflow for full-stack app generation. The sprint contract pattern in particular looks like a practical upgrade over the simpler pass/fail verification in Trust But Verify Protocol.

The frontend design grading rubric is the most transferable idea here. Defining weighted criteria for subjective quality, then using few-shot calibration to align the evaluator. That pattern could apply to documentation quality, API design, or any domain where “good enough” is hard to define programmatically.

Actor-Critic Adversarial Coding – The two-agent adversarial pattern this scales up
Building the Harness – The four-layer harness where this pattern fits
Evaluation-Driven Development – AI vision for qualitative evaluation
Synthetic Loss Functions – Rubrics as loss functions for agent optimization
Trust But Verify Protocol – Simpler verification that sprint contracts upgrade
Test-Driven Prompting – Human-written test specs, complementary to agent-negotiated sprint contracts
Monte Carlo QA – Repeated evaluation passes, same principle applied differently
Long-Running Agent Patterns – Shell, skills, and compaction for extended runs