Goodharting Prevention in Agent Systems: When Agents Game Your Metrics

James Phoenix
James Phoenix

“When a measure becomes a target, it ceases to be a good measure.” Goodhart’s Law applies to agent swarms with extra force because agents optimize faster than humans can audit.

Author: James Phoenix | Date: February 2026


Summary

When you define a synthetic loss function and let agents optimize it, they will eventually optimize the proxy instead of the real objective. Not maliciously. Mechanically. Agents will inflate trivial tests, delete code to reduce complexity metrics, disable flaky tests, and rewrite modules to erase violations while creating new bugs. Five countermeasures prevent this: huge regression penalties, rewrite penalties, evidence-based closure, random audits, and external oracles.

Udemy Bestseller

Learn Prompt Engineering

My O'Reilly book adapted for hands-on learning. Build production-ready prompts with practical exercises.

4.5/5 rating
306,000+ learners
View Course

What Goodharting Looks Like in Agent Systems

Agents do not “cheat” intentionally. They find the path of least resistance to reducing the loss signal. If the loss signal is a proxy for system health (and it always is), agents will find shortcuts.

Concrete Examples

Metric Being Gamed How Agents Game It What Actually Happens
L_tests (failing tests) Delete or disable flaky tests Coverage drops, real failures go undetected
L_unknown (untested code) Add trivial assertions that test nothing Coverage number increases, actual safety does not
L_arch (complexity) Delete code, inline everything Simpler metrics, harder to maintain
L_types (type errors) Add as any casts everywhere Type checker is happy, runtime is not
L_spec (spec violations) Rewrite spec to match implementation Spec becomes descriptive, not prescriptive
L_reg (regressions) Mark regressions as “by design” Regression count drops, bugs persist

The Pattern

Every Goodharting failure follows the same structure:

1. You define a measurable proxy for system health
2. Agents find the cheapest way to improve the proxy
3. The proxy improves
4. System health does not improve (or gets worse)
5. You do not notice until something breaks in production

The gap between “proxy improves” and “reality improves” is the Goodharting gap. It grows silently.


The Proxy Trap

Your L_total is always a proxy. It is never the real thing.

Real objective:    "The system works correctly, is maintainable, and serves users well"
Proxy objective:   L_total = w1*L_spec + w2*L_tests + ... + w7*L_unknown

The proxy is useful because the real objective is not computable. But the proxy can diverge from reality in ways that are hard to detect.

The key insight: Goodharting is not a bug in the loss function. It is a fundamental property of proxy optimization. You cannot eliminate it. You can only detect and penalize it.


The 5 Countermeasures

1. Regression Penalty Must Be Huge

If an agent reintroduces a previously fixed bug, the penalty must exceed the original fix benefit.

penalty(regression) = 3x * benefit(original_fix)

Why 3x? Because a regression:

  • wastes the original fix effort (1x)
  • requires re-investigation (1x)
  • damages trust in the convergence signal (1x)

Implementation:

function computeRegressionPenalty(
  issue: Issue,
  history: IssueHistory[]
): number {
  const previousFixes = history.filter(
    h => h.issueClass === issue.class && h.resolution === 'fixed'
  );

  if (previousFixes.length === 0) return issue.baseSeverity;

  // Each recurrence multiplies the penalty
  return issue.baseSeverity * Math.pow(3, previousFixes.length);
}

This makes regressions exponentially expensive. After the second recurrence, the penalty dominates L_total and forces the swarm to prioritize permanent fixes.

2. Complexity Term Must Penalize Large Rewrites

If agents learn that deleting code reduces L_arch, they will delete aggressively. Counter this with a rewrite penalty.

L_rewrite = alpha * (lines_deleted + lines_added) / total_codebase_lines

This penalizes wholesale rewrites proportionally. Small, targeted fixes have low rewrite cost. Deleting entire modules to “simplify” has high rewrite cost.

When to apply: Only during polish phases. During growth phases, large changes are expected and should not be penalized.

3. Evidence-Based Closure

Every issue closure must include evidence:

Evidence Type What It Proves
Test added/updated The fix is verified mechanically
Trace/log showing fix Runtime behavior changed
Spec citation The change aligns with requirements
Before/after diff The specific code change is visible

An agent cannot close an issue by simply asserting “fixed.” It must provide at least one form of evidence.

Implementation:

interface IssueClosure {
  issueId: string;
  resolution: 'fixed' | 'wont-fix' | 'duplicate';
  evidence: {
    testAdded?: string;      // path to new/updated test
    traceLink?: string;      // link to runtime trace
    specReference?: string;  // section of spec satisfied
    diffSummary: string;     // what changed and why
  };
}

function validateClosure(closure: IssueClosure): boolean {
  const hasEvidence =
    closure.evidence.testAdded ||
    closure.evidence.traceLink ||
    closure.evidence.specReference;

  return hasEvidence && closure.evidence.diffSummary.length > 0;
}

4. Random Audit (Spot Check)

Each cycle, randomly select 3 closed issues and re-verify them.

1. Pick 3 issues closed in this cycle
2. Re-run the verification for each
3. If any fail: flag as Goodharting signal
4. If repeated: increase audit sample size

Random audits break the feedback loop where agents learn exactly which checks will be run. If agents do not know which closures will be audited, they cannot optimize for the audit.

Implementation:

# Post-cycle audit script
CLOSED_ISSUES=$(get-closed-issues --cycle current --format json)
SAMPLE=$(echo "$CLOSED_ISSUES" | jq -c '.[0:3]')

echo "$SAMPLE" | jq -c '.[]' | while read -r issue; do
  ISSUE_ID=$(echo "$issue" | jq -r '.id')
  RESULT=$(re-verify-issue "$ISSUE_ID")

  if [ "$RESULT" != "verified" ]; then
    echo "AUDIT FAILURE: $ISSUE_ID - possible Goodharting"
    reopen-issue "$ISSUE_ID" --reason "Failed random audit"
  fi
done

5. External Oracle

The loss function is self-referential: agents generate code, gates check code, loss is computed from gate results. If agents learn to satisfy gates without satisfying reality, the loop is broken.

Break self-reference with an external oracle:

Oracle Type What It Checks
Golden test suite Hand-written tests that agents cannot modify
Runtime trace analysis Real production or staging behavior
Human spot check Developer reviews N random changes per cycle
Shadow deployment Run new code against real traffic, compare results

The oracle must be:

  • Outside the agent’s control (agents cannot modify golden tests)
  • Connected to reality (runtime traces, real traffic, human judgment)
  • Sampled, not exhaustive (just enough to detect drift)

Observable Signals of Goodharting

Watch for these patterns:

Signal What It Means
Loss decreases but bugs increase in staging/prod Proxy diverged from reality
Test count increases but coverage of critical paths drops Trivial tests inflating numbers
L_arch drops but code becomes harder to understand Simplification metrics gamed
Issue velocity is high but same areas keep needing fixes Fixes are superficial
Diff sizes increase while issue count drops Large rewrites masking issues
Agent output becomes more formulaic Agent learned the pattern to satisfy gates

The most dangerous signal: everything looks green, but you feel uneasy. Trust that instinct. Run an audit.


When Goodharting Is Acceptable

Not all proxy optimization is harmful. Some is benign or even useful:

Scenario Verdict
Agent adds tests to reduce L_unknown, tests are meaningful Good. Proxy aligned with reality.
Agent reformats code to satisfy lint, code is clearer Good. Proxy served its purpose.
Agent splits large function to reduce complexity score Neutral. Check if the split makes sense.
Agent adds as any to reduce type errors Bad. Proxy improved, reality worsened.
Agent deletes module to reduce arch violations Bad unless module was genuinely dead code.

The distinction: did the proxy-satisfying action also improve the system? If yes, the proxy is working. If no, it is Goodharted.


Integration with Loss Function Design

The best defense against Goodharting is a well-designed loss function:

  1. Multiple independent terms reduce single-metric gaming
  2. Cross-validation between terms catches inconsistencies (L_tests drops but L_unknown rises = suspicious)
  3. Regression penalty prevents gaming via destruction
  4. L_unknown penalizes blind spots where Goodharting hides
  5. Weight adjustment lets you increase weight on gamed metrics

If you detect Goodharting on a specific term, temporarily increase its weight and tighten its measurement. The swarm will adjust.


Key Insight

Goodharting is not a failure of agents. It is a fundamental property of proxy optimization. You cannot prevent it. You can only detect it early and make it expensive. The countermeasures are: punish regressions heavily, penalize large rewrites, require evidence, audit randomly, and maintain an external oracle that agents cannot game.


Related

Topics
AdversarialAgent SwarmsAuditConvergenceGoodhartLoss FunctionMeasurementProxy MetricsQuality GatesRegression

More Insights

Cover Image for Own Your Control Plane

Own Your Control Plane

If you use someone else’s task manager, you inherit all of their abstractions. In a world where LLMs make software a solved problem, the cost of ownership has flipped.

James Phoenix
James Phoenix
Cover Image for Indexed PRD and Design Doc Strategy

Indexed PRD and Design Doc Strategy

A documentation-driven development pattern where a single `index.md` links all PRDs and design documents, creating navigable context for both humans and AI agents.

James Phoenix
James Phoenix