Auto-Harness Synthesis

James Phoenix

The future of agent systems is agents that write their own constraint layers, not humans hand-coding guardrails.

Source: AutoHarness: Improving LLM Agents by Automatically Synthesizing a Code Harness (Lou et al., Google DeepMind, March 2026)

The Problem

LLMs are brittle actors. They reason well but constantly violate the rules of the environment they operate in. In the Kaggle GameArena chess competition, 78% of Gemini-2.5-Flash losses were from illegal moves, not bad strategy. The model understood chess. It just couldn’t reliably produce valid moves.

The standard fix: humans write “harnesses” around the LLM. Code that intercepts the model’s output, validates it against the environment’s rules, and rejects or corrects invalid actions. This works but doesn’t scale. Every new environment needs a new hand-coded harness. The harness itself is brittle and labour-intensive to maintain.

The Insight

What if the LLM writes its own harness?

AutoHarness does exactly this. You give the model a function signature like is_legal_action(board, action) -> bool and let it iteratively synthesise the implementation through a code-search process. The LLM proposes code, runs it against the environment, gets feedback on what failed, and refines. The search uses Thompson sampling over a tree of code hypotheses to balance exploration (trying new approaches) and exploitation (refining what partially works).

Two modes:

Harness-as-action-verifier. The LLM still reasons at inference time, but every proposed action gets checked by the synthesised is_legal_action() function. Invalid actions get rejected and the model retries.
Harness-as-policy. The entire decision-making policy is compiled into code. No LLM call needed at inference time. Pure Python with numpy. The model’s strategic knowledge gets distilled into an executable program.

Results That Matter

Small model + auto-harness beats large model without one. Gemini-2.5-Flash with AutoHarness beat Gemini-2.5-Pro across 145 TextArena games. A cheaper, faster model with the right constraints outperforms a more expensive model running naked.
100% legal action rate. Across all 145 games, the synthesised harness achieved a perfect legal action rate. Hand-tuning couldn’t do better.
Harness-as-policy beats everything. When pushed to the extreme (compiling the entire policy into code), the system achieved higher average reward than Gemini-2.5-Pro, GPT-5.2, and GPT-5.2-High on 1-player games, with near-zero inference cost.
Fast convergence. Training ends after an average of 14.5 tree search iterations. 19 of 32 evaluated games converged in under 10 iterations. This isn’t months of training. It’s an afternoon.

Why Auto-Harness Is the Future Direction

1. Constraints are the real product, not the model

The paper proves something practitioners already suspect: the gap between models is smaller than the gap between harnesses. A well-constrained small model beats a poorly-constrained large model. This means the competitive advantage in agent systems shifts from “which model do you use?” to “how good are your constraints?” And if agents can write their own constraints, the entire harness layer becomes auto-generated infrastructure.

2. It generalises beyond games

The paper uses TextArena games, but the pattern is universal. Any domain where an agent must produce valid outputs has the same structure:

API calls with required schemas and valid parameter ranges
Database mutations that must satisfy integrity constraints
Infrastructure changes that must pass policy checks
Code generation that must compile and pass tests

In each case, you could hand-code the validation layer, or you could let the model synthesise it from environment feedback. The second approach scales.

Udemy Bestseller

Learn Prompt Engineering

My O'Reilly book adapted for hands-on learning. Build production-ready prompts with practical exercises.

★ 4.5/5 rating

306,000+ learners

View Course

3. Code is the right representation for constraints

The harness is Python code, not a prompt. This matters. Prompts are probabilistic. Code is deterministic. Once the harness is synthesised and verified, it acts as a hard boundary that the model cannot violate regardless of how it reasons. This is the difference between “please don’t make illegal moves” (prompt) and if not is_legal(action): reject() (code).

4. Policy distillation into code eliminates inference cost

The harness-as-policy result is the most radical finding. The model’s entire strategic knowledge for a domain gets compiled into an executable Python program. At inference time you run numpy, not an LLM. This is the endgame: use expensive models to synthesise cheap, fast, deterministic programs that run forever without API calls.

5. It inverts the development loop

Traditional agent development: human writes harness, then LLM operates within it. AutoHarness: LLM writes harness, then (optionally) LLM operates within it. The model bootstraps its own operating constraints. This connects directly to Function-Driven Development. The agent doesn’t just express what tools it needs. It writes the tools.

Connection to Compound Engineering

This paper validates a core thesis: the meta-layer compounds faster than the object layer. Building a chess harness by hand is object-level work. Building a system that synthesises harnesses automatically is meta-level work. The meta-layer produces harnesses for 145 games. The hand-coded approach produces a harness for one.

The right investment is always in the harness-generation machinery, not in individual harnesses. This is the same principle behind owning your control plane. You don’t want to own the agents. You want to own the system that produces and constrains agents.

Key Technical Details

Search method: Thompson sampling over a tree of code hypotheses. Each node is a code version with a heuristic value (fraction of legal moves). The tree balances trying new code structures vs refining promising ones.
Feedback loop: 10 parallel environments, rollouts up to 1000 steps. Failed steps get fed to a Critic that consolidates error types. The Critic output plus original code goes to the Refiner to produce improved code.
Training model: Gemini-2.5-Flash (cheap) for synthesis. The resulting harness works with any model at inference time.
Convergence: Average 14.5 iterations. Most games under 10. Chess and Othello take longest (~60 iterations).

Function-Driven Development – Auto-harness is FDD taken to completion: the agent doesn’t just spec tools, it writes them
Agent-Driven Development
AI-Native Principles
Own Your Control Plane
Zero-Cost Knowledge Extraction