How to Protect Your Coding Harness from Longer Integration Test Runs

James Phoenix
James Phoenix

Summary

Integration test suites get slower under LLM-assisted development. Not because the tests themselves are fundamentally slow, but because agents have every incentive to bump the timeout when they hit it and zero incentive to refactor the slow test underneath. Each session adds a little rope. The baseline only moves one direction. The fix is a hook that blocks the escape hatch for agents while leaving it open for humans, and a redirect message that points the agent at the real fix instead of just rejecting the work.

The moment it almost happened to me

I was deep in a billing migration session. Eighteen integration tests failing, suite budget at 120 seconds, the runner printing “slow tests detected” on timeout. My agent was one keystroke away from the obvious move:

INTEGRATION_TIMEOUT_SECONDS=180 pnpm test:integration api

One flag. One character of effort. The suite would pass, the migration would land, and the slow test would keep being slow. Nobody would notice because nobody would look.

If I let the agent take that shortcut once, it took it every session forever. That is the entire argument of this article.

The ratchet

The baseline of an integration suite is a one-way valve. Every agent session adds tests. Every new test inherits whichever budget the suite currently runs under. Nobody refactors the slow ones because the suite is “passing.” The only force pushing the baseline back down is deliberate human attention, and that force is weak and episodic.

The integration test ratchet
The integration test ratchet

Five sessions in, the suite is four to five times slower than where it started and the budget has been bumped three times. Each bump felt locally justified. Each bump was reversible in theory. In practice, nobody walks it back.

The ratchet is not a test-hygiene failure. It is the default outcome of a system where adding rope is cheap and tightening rope is expensive.

Why agents make this worse than humans do

Humans feel the pain of a slow suite. A 200-second feedback loop on a refactor is awful enough that the person doing the refactor will eventually profile the slow test and fix it. The negative feedback loop exists. It runs slowly but it runs.

Agents have no such loop. Investigating a slow test costs tokens, context window, and a detour from the task they were sent to do. Bumping the timeout costs nothing. Gradient descent through the harness picks the zero-cost path every time.

This is not a model alignment problem. It is a harness design problem. Whichever knob you leave open is the one the agent pulls. The escape hatch was not placed there for the agent, but the agent does not know that, and the model has no way to weigh the long-term cost of baseline drift against the immediate cost of spending another thirty thousand tokens on profiling a Vitest file.

The protection pattern

The fix is not removing the escape hatch. Humans legitimately need it. Diagnosing a real hang in a new test, debugging a CI flake, running a long soak test on a migration, all of these are cases where a thoughtful human should be able to extend the budget for one invocation.

The fix is placing a hook in front of the escape hatch that behaves differently for agents than for humans. A PreToolUse hook pattern-matches on INTEGRATION_TIMEOUT_SECONDS= in any Bash command. If the agent tries the shortcut, the hook rejects the call with a redirect message, not a veto. The message does not say “no.” It says “here is what to do instead.”

Hooks as trajectory corrections
Hooks as trajectory corrections

The left side of the diagram is the default ratchet: agent runs the suite, suite times out, agent bumps the timeout, baseline climbs. The right side is the same flow with the hook in place: the agent tries to bump the timeout, the hook intercepts, the redirect message tells the agent that the runner already printed the slow tests and the fix is to refactor the one over five seconds. The agent reads the message and does the actual work.

Leanpub Book

Read The Meta-Engineer

A practical book on building autonomous AI systems with Claude Code, context engineering, verification loops, and production harnesses.

Continuously updated
Claude Code + agentic systems
View Book

The key insight: the hook is not a locked gate. It is a trajectory correction. The agent is not told to stop, it is told where to go.

Concrete implementation

Claude Code hooks live in .claude/settings.json under the hooks.PreToolUse array. A shell script reads the tool call off stdin as JSON, pattern-matches on the command, and returns a permission decision as JSON on stdout. The redirect lives in the permissionDecisionReason field. For my billing project the essential piece is:

# .claude/hooks/block-integration-timeout-bump.sh
cmd=$(echo "$input" | jq -r .tool_input.command)
if echo "$cmd" | grep -qE 'INTEGRATION_TIMEOUT_SECONDS='; then
  cat <<'EOF'
{
  "hookSpecificOutput": {
    "hookEventName": "PreToolUse",
    "permissionDecision": "deny",
    "permissionDecisionReason": "Integration suite must stay under budget. The runner already printed the slow tests on timeout. Refactor the one over 5s: shared fixtures, event-driven waits, drop wall-clock sleeps. Target is <90s full suite."
  }
}
EOF
fi

Codex has the same primitive under a different name, configured through its hooks file. The shape is identical: intercept the command, return a redirect, trust the model to read the message and change course.

The one piece that makes the redirect actually useful is the runner itself printing which test was slow when the suite times out. Without that, “refactor the slow test” is hand-waving. With it, the agent has a filename and a line number and can immediately start the right work.

The pattern generalizes

This is not really about integration test budgets. The same shape applies to every knob in your harness that lets an agent trade correctness for progress:

  • git commit --no-verify skips pre-commit hooks
  • as any in TypeScript silences the type checker
  • .skip() and .only() on tests mute the suite without fixing it
  • --force on anything git-adjacent overwrites state instead of reconciling it
  • eslint-disable-next-line locally pretends a rule does not apply

Each of these is a knob humans sometimes need. Each is a shortcut agents will take unless you put friction on the path that leads there. The friction does not need to be a lock. It just needs to be a message that redirects the agent to the thing you actually wanted.

Name the pattern for yourself so you can apply it to the next knob you notice: leave the hatch, trap the path, redirect on rejection.

The honest tradeoff

Sometimes a test legitimately needs more budget. A genuine external API round-trip, irreducible fixture setup, a one-off investigation into a flaky test that is not yet reproducible under normal conditions. The hook blocks all of these too.

I handle the real cases with a per-test exemption that I grant deliberately, not with a blanket env var. The default is hard, the exception is auditable, and every exception becomes a small todo to either refactor the test or remove the exemption. Friction is the feature, not a bug. The point of the hook is to make sure the ratchet only turns when a human deliberately turns it.

Related

Topics
Agent HarnessClaude CodeCodexHooksIntegration TestsLlm VerificationQuality GatesTest BudgetTest DriftTrajectory Correction

Newsletter

Become a better AI engineer

Weekly deep dives on production AI systems, context engineering, and the patterns that compound. No fluff, no tutorials. Just what works.

Join 306K+ developers. No spam. Unsubscribe anytime.


More Insights

Cover Image for DRY: Dev Utils Panels Beat Manual State Setup

DRY: Dev Utils Panels Beat Manual State Setup

Every repeated setup ritual is an undeclared API waiting to be formalised. Build the panel once, skip the ritual forever.

James Phoenix
James Phoenix
Cover Image for Skills Need Evals, Not Vibes

Skills Need Evals, Not Vibes

A skill you cannot measure against no skill is just a prompt you felt good about. The whole value of a skill is the delta it buys you over a bare model, and that delta only shows up when you run both sides of the comparison.

James Phoenix
James Phoenix