Goodharting Prevention in Agent Systems: When Agents Game Your Metrics

James Phoenix

“When a measure becomes a target, it ceases to be a good measure.” Goodhart’s Law applies to agent swarms with extra force because agents optimize faster than humans can audit.

Author: James Phoenix | Date: February 2026

Summary

When you define a synthetic loss function and let agents optimize it, they will eventually optimize the proxy instead of the real objective. Not maliciously. Mechanically. Agents will inflate trivial tests, delete code to reduce complexity metrics, disable flaky tests, and rewrite modules to erase violations while creating new bugs. Five countermeasures prevent this: huge regression penalties, rewrite penalties, evidence-based closure, random audits, and external oracles.

Udemy Bestseller

Learn Prompt Engineering

My O'Reilly book adapted for hands-on learning. Build production-ready prompts with practical exercises.

★ 4.5/5 rating

306,000+ learners

View Course

What Goodharting Looks Like in Agent Systems

Agents do not “cheat” intentionally. They find the path of least resistance to reducing the loss signal. If the loss signal is a proxy for system health (and it always is), agents will find shortcuts.

Concrete Examples

Metric Being Gamed	How Agents Game It	What Actually Happens
L_tests (failing tests)	Delete or disable flaky tests	Coverage drops, real failures go undetected
L_unknown (untested code)	Add trivial assertions that test nothing	Coverage number increases, actual safety does not
L_arch (complexity)	Delete code, inline everything	Simpler metrics, harder to maintain
L_types (type errors)	Add `as any` casts everywhere	Type checker is happy, runtime is not
L_spec (spec violations)	Rewrite spec to match implementation	Spec becomes descriptive, not prescriptive
L_reg (regressions)	Mark regressions as “by design”	Regression count drops, bugs persist

The Pattern

Every Goodharting failure follows the same structure:

1. You define a measurable proxy for system health
2. Agents find the cheapest way to improve the proxy
3. The proxy improves
4. System health does not improve (or gets worse)
5. You do not notice until something breaks in production

The gap between “proxy improves” and “reality improves” is the Goodharting gap. It grows silently.

The Proxy Trap

Your L_total is always a proxy. It is never the real thing.

Real objective:    "The system works correctly, is maintainable, and serves users well"
Proxy objective:   L_total = w1*L_spec + w2*L_tests + ... + w7*L_unknown

The proxy is useful because the real objective is not computable. But the proxy can diverge from reality in ways that are hard to detect.

The key insight: Goodharting is not a bug in the loss function. It is a fundamental property of proxy optimization. You cannot eliminate it. You can only detect and penalize it.

The 5 Countermeasures

1. Regression Penalty Must Be Huge

If an agent reintroduces a previously fixed bug, the penalty must exceed the original fix benefit.

penalty(regression) = 3x * benefit(original_fix)

Why 3x? Because a regression:

wastes the original fix effort (1x)
requires re-investigation (1x)
damages trust in the convergence signal (1x)

Implementation:

function computeRegressionPenalty(
  issue: Issue,
  history: IssueHistory[]
): number {
  const previousFixes = history.filter(
    h => h.issueClass === issue.class && h.resolution === 'fixed'
  );

  if (previousFixes.length === 0) return issue.baseSeverity;

  // Each recurrence multiplies the penalty
  return issue.baseSeverity * Math.pow(3, previousFixes.length);
}

This makes regressions exponentially expensive. After the second recurrence, the penalty dominates L_total and forces the swarm to prioritize permanent fixes.

2. Complexity Term Must Penalize Large Rewrites

If agents learn that deleting code reduces L_arch, they will delete aggressively. Counter this with a rewrite penalty.

L_rewrite = alpha * (lines_deleted + lines_added) / total_codebase_lines

This penalizes wholesale rewrites proportionally. Small, targeted fixes have low rewrite cost. Deleting entire modules to “simplify” has high rewrite cost.

When to apply: Only during polish phases. During growth phases, large changes are expected and should not be penalized.

3. Evidence-Based Closure

Every issue closure must include evidence:

Evidence Type	What It Proves
Test added/updated	The fix is verified mechanically
Trace/log showing fix	Runtime behavior changed
Spec citation	The change aligns with requirements
Before/after diff	The specific code change is visible

An agent cannot close an issue by simply asserting “fixed.” It must provide at least one form of evidence.

Implementation:

interface IssueClosure {
  issueId: string;
  resolution: 'fixed' | 'wont-fix' | 'duplicate';
  evidence: {
    testAdded?: string;      // path to new/updated test
    traceLink?: string;      // link to runtime trace
    specReference?: string;  // section of spec satisfied
    diffSummary: string;     // what changed and why
  };
}

function validateClosure(closure: IssueClosure): boolean {
  const hasEvidence =
    closure.evidence.testAdded ||
    closure.evidence.traceLink ||
    closure.evidence.specReference;

  return hasEvidence && closure.evidence.diffSummary.length > 0;
}

4. Random Audit (Spot Check)

Each cycle, randomly select 3 closed issues and re-verify them.

1. Pick 3 issues closed in this cycle
2. Re-run the verification for each
3. If any fail: flag as Goodharting signal
4. If repeated: increase audit sample size

Random audits break the feedback loop where agents learn exactly which checks will be run. If agents do not know which closures will be audited, they cannot optimize for the audit.

Implementation:

# Post-cycle audit script
CLOSED_ISSUES=$(get-closed-issues --cycle current --format json)
SAMPLE=$(echo "$CLOSED_ISSUES" | jq -c '.[0:3]')

echo "$SAMPLE" | jq -c '.[]' | while read -r issue; do
  ISSUE_ID=$(echo "$issue" | jq -r '.id')
  RESULT=$(re-verify-issue "$ISSUE_ID")

  if [ "$RESULT" != "verified" ]; then
    echo "AUDIT FAILURE: $ISSUE_ID - possible Goodharting"
    reopen-issue "$ISSUE_ID" --reason "Failed random audit"
  fi
done

5. External Oracle

The loss function is self-referential: agents generate code, gates check code, loss is computed from gate results. If agents learn to satisfy gates without satisfying reality, the loop is broken.

Break self-reference with an external oracle:

Oracle Type	What It Checks
Golden test suite	Hand-written tests that agents cannot modify
Runtime trace analysis	Real production or staging behavior
Human spot check	Developer reviews N random changes per cycle
Shadow deployment	Run new code against real traffic, compare results

The oracle must be:

Outside the agent’s control (agents cannot modify golden tests)
Connected to reality (runtime traces, real traffic, human judgment)
Sampled, not exhaustive (just enough to detect drift)

Observable Signals of Goodharting

Watch for these patterns:

Signal	What It Means
Loss decreases but bugs increase in staging/prod	Proxy diverged from reality
Test count increases but coverage of critical paths drops	Trivial tests inflating numbers
L_arch drops but code becomes harder to understand	Simplification metrics gamed
Issue velocity is high but same areas keep needing fixes	Fixes are superficial
Diff sizes increase while issue count drops	Large rewrites masking issues
Agent output becomes more formulaic	Agent learned the pattern to satisfy gates

The most dangerous signal: everything looks green, but you feel uneasy. Trust that instinct. Run an audit.

When Goodharting Is Acceptable

Not all proxy optimization is harmful. Some is benign or even useful:

Scenario	Verdict
Agent adds tests to reduce L_unknown, tests are meaningful	Good. Proxy aligned with reality.
Agent reformats code to satisfy lint, code is clearer	Good. Proxy served its purpose.
Agent splits large function to reduce complexity score	Neutral. Check if the split makes sense.
Agent adds `as any` to reduce type errors	Bad. Proxy improved, reality worsened.
Agent deletes module to reduce arch violations	Bad unless module was genuinely dead code.

The distinction: did the proxy-satisfying action also improve the system? If yes, the proxy is working. If no, it is Goodharted.

Integration with Loss Function Design

The best defense against Goodharting is a well-designed loss function:

Multiple independent terms reduce single-metric gaming
Cross-validation between terms catches inconsistencies (L_tests drops but L_unknown rises = suspicious)
Regression penalty prevents gaming via destruction
L_unknown penalizes blind spots where Goodharting hides
Weight adjustment lets you increase weight on gamed metrics

If you detect Goodharting on a specific term, temporarily increase its weight and tighten its measurement. The swarm will adjust.

Key Insight

Goodharting is not a failure of agents. It is a fundamental property of proxy optimization. You cannot prevent it. You can only detect it early and make it expensive. The countermeasures are: punish regressions heavily, penalize large rewrites, require evidence, audit randomly, and maintain an external oracle that agents cannot game.

Synthetic Loss Functions – The loss function that agents may Goodhart
Swarm Convergence Theory – Convergence signals that Goodharting can fake
Learning Loops – Encoding real lessons, not gamed metrics
Quality Gates as Information Filters – Gates as proxy measures
Constraint Escalation Ladder – Choosing prevention layers that resist gaming
Trust But Verify Protocol – Verification patterns for agent output
Evaluation-Driven Development – External evaluation as oracle

Goodharting Prevention in Agent Systems: When Agents Game Your Metrics

Summary

Learn Prompt Engineering

What Goodharting Looks Like in Agent Systems

Concrete Examples

The Pattern

The Proxy Trap

The 5 Countermeasures

1. Regression Penalty Must Be Huge

2. Complexity Term Must Penalize Large Rewrites

3. Evidence-Based Closure

4. Random Audit (Spot Check)

5. External Oracle

Observable Signals of Goodharting

When Goodharting Is Acceptable

Integration with Loss Function Design

Key Insight

Related

More Insights

Own Your Control Plane

Indexed PRD and Design Doc Strategy