Mock the LLM, Keep the Tools Real

James Phoenix

Agent systems have exactly one non-deterministic component: the model’s choice of tool call. Stub that. Let everything else run.

The Setup

I hand-rolled a state machine for a slideshow pipeline. Idea approval, render proposal approval, render approval, publish. Rejections route back to the right earlier stage. Human in the loop at every boundary. The whole thing is about three hundred lines of reducer plus an effects interpreter.

Once the machine existed, the next problem was testing it. Every transition fires an effect that calls a large model. A run to PUBLISHED involves four or five generations, three of which are agent loops with tool registries attached. Running the real pipeline against a provider for each test takes ninety seconds and costs real money. I wanted a test that exercised the full flow, boots to boots, and cost nothing.

The Wrong Fix

The reflex move is to mock everything. Mock the LLM, mock the tools, stub the renderer, fake the filesystem writes. The test becomes fast and green.

The problem is that passing tells me very little. I’ve covered the state transitions but none of the wiring that matters. If the render tool’s Satori layout changes, if the theme resolver breaks, if the deck mutation tools forget to update slide ordering, if the schema validation regresses, the fully mocked test sails through.

At that point the test is a ritual, not a check. It runs in CI, it turns green, and it protects nothing that’s actually load-bearing.

The Right Split

The split I landed on is narrower and more useful. Mock the LLM’s decisions. Keep every tool real.

The LLM’s job inside an agent step is to choose tool calls. That’s all it is. Given a prompt and a tool registry, the model emits a sequence of {tool, arguments} tuples, interleaved with tool outputs, until a stop condition. That sequence is the trajectory.

A trajectory is data. Once I write it down, I don’t need a model to produce it.

Leanpub Book

Read The Meta-Engineer

A practical book on building autonomous AI systems with Claude Code, context engineering, verification loops, and production harnesses.

Continuously updated

Claude Code + agentic systems

View Book

type ToolCallEvent = { tool: string; arguments: Record<string, unknown> };
type Trajectory = ToolCallEvent[];

My test harness installs a fake OpenRouter client. When the state machine calls getClient().callModel({...}) with tools attached, the fake pops the next trajectory off a queue and streams it back as if the LLM had just produced it. Each step is run through the real tool registry. The returned stream has the exact function_call and function_call_output items the real client would have emitted, so the state machine’s harvest code sees them and doesn’t know the difference.

for (const step of trajectory) {
  const tool = registry[step.tool];
  const parsedInput = tool.function.inputSchema.parse(step.arguments);
  const output = await tool.function.execute(parsedInput, ctx);
  yield { type: "function_call", name: step.tool, arguments: JSON.stringify(parsedInput) };
  yield { type: "function_call_output", name: step.tool, output: JSON.stringify(output) };
}

The tool executes for real. Satori runs. JPGs land on disk. The input schema validates. The reducer harvests the output the same way it would with a live model. The only thing I substituted is the decision of which tool to call.

Shared Context Keeps Fixtures Lean

A trajectory with one tool call is easy. A revision trajectory with three chained mutations like remove_slide, then move_slide, then set_slide_text is where naive fixture threading falls apart, because the second tool needs the deck as it looked after the first, and the third needs it as it looked after the second.

I avoided threading this by hand. The real tool implementations already read and write a shared state object passed through ctx.shared. The harness reuses the same object reference across every step of a trajectory:

const ctx = { shared: (req.context as any)?.shared ?? {} };
for (const step of trajectory) {
  await tool.function.execute(parsedInput, ctx); // mutates ctx.shared.state.deck
}

Because ctx.shared.state.deck is the same object across every step, a mutation from step one is visible to step two without any glue. The trajectory stays lean. The arguments list the intent of each step and nothing about the current deck.

[
  { tool: "remove_slide",   arguments: { slideIndex: 1 } },
  { tool: "move_slide",     arguments: { slideIndex: 0, position: "end" } },
  { tool: "set_slide_text", arguments: { slideIndex: 1, text: "Mutated primary text." } },
]

A fixture trajectory reads like a storyboard. I can eyeball what the agent is supposed to do on this turn, and the test asserts that the state machine drove the right tools with the right shape of inputs at the right time, against the right live deck.

Stubbing the Output Too

The default trajectory runs every tool live. That’s what buys the integration coverage. But every test suite eventually has a tool that is too expensive, too slow, or too side-effecting to run on every pass. Image generation APIs cost money. A two-hundred-slide render is forty seconds. A tool that hits a third-party webhook has no business firing from CI.

The escape hatch is an optional output on the trajectory step. Inputs are still fixtured the same way. Outputs become a per-step opt-in:

type ToolCallEvent = {
  tool: string;
  arguments: Record<string, unknown>;
  output?: unknown; // if present, skip real execution
};

The harness checks for it before calling the tool:

const parsedInput = tool.function.inputSchema.parse(step.arguments);
const output = step.output !== undefined
  ? tool.function.outputSchema?.parse(step.output) ?? step.output
  : await tool.function.execute(parsedInput, ctx);

Two behaviours fall out of this. When output is present, the real execution is skipped but the stub is still type-checked against the tool’s declared output schema. When it’s absent, the tool runs live, same as before. Schema validation is the thing that keeps the escape hatch honest. A malformed stub fails loudly in the test instead of silently corrupting the state machine’s downstream harvest.

The trajectory format becomes a gradient instead of a binary. Per step, per tool, you pick whether to run live or stub. A revision test that wants full integration on the mutation tools but stubs the costly render looks like this:

[
  { tool: "remove_slide", arguments: { slideIndex: 1 } },
  { tool: "add_slide",    arguments: { position: "end", role: "body", /* ... */ } },
  {
    tool: "render_deck",
    arguments: { conceptId, themeName: "dark-editorial", slides: allSpecs },
    output: { slides: allSpecs.map((_, i) => ({ path: `/tmp/fake-${i}.jpg`, bytes: 10_000 })) },
  },
]

Mutations execute for real against the shared deck. The render is stubbed. The reducer still sees a schema-valid function_call_output item, and the state machine transitions exactly the same as it would in production.

The layering that makes this clean is tool-level unit tests. render_deck has its own test that runs Satori and Resvg end to end against fixture slides and asserts JPGs land on disk at a minimum size. Once that test exists, every state-machine test that stubs the render output is inheriting that guarantee for free. I am not losing coverage. I am choosing not to pay for it again in the outer suite.

This is the dial I wish I had reached for sooner. Without it, one slow tool drags the whole state-machine suite into a regime where I stop running it. With it, the slow tool keeps its dedicated test, and the state-machine suite stays fast enough to run on every save.

What the Test Actually Proves

When a test built this way passes, I know:

The reducer routed the event to the right stage.
The effects interpreter fired the right generation effect.
The real tools ran, validated their inputs, and produced real outputs.
The render actually landed bytes on disk, above a minimum size threshold.
The state machine harvested the tool outputs back into typed state.

That is integration coverage all the way down, for the cost of some JSON fixtures and a few tens of milliseconds per test.

What the test does not prove is whether a real model, given my prompt and tool registry, would actually emit that trajectory. Trajectory adherence is its own problem. I evaluate it separately, against a fixed benchmark, with live calls budgeted explicitly. Conflating the two was the mistake I kept making early on. A passing fixture test does not say the agent behaves. It says the scaffolding behaves when the agent behaves.

Hill-Climbable Tests Buy Refactor Confidence

Here is the payoff I did not expect. Once a suite of trajectory tests exists around a hand-rolled core, the core becomes safe to hand to an agent for refactoring.

Before, I would not let Claude or Codex near the reducer. Any rewrite that compiled might ship a subtle state leak I’d only find in production under human feedback. With trajectory tests in place, the agent has a concrete surface to climb against. Run the suite, see which transitions fail, fix, rerun. The tests encode the behaviour I care about. The agent does not need to guess my intent from comments. The intent is checked.

This is evaluation-driven development pitched at the state-machine layer instead of the prompt layer. I hand-roll the shape. I fixture the trajectories. The tests become the spec. The agent can rewrite the interior as long as the spec holds.

It also flips the economics of maintenance. Before, every change to the state machine risked a live-budget run to build confidence. Now I iterate entirely offline and spend the live budget on the one thing it’s actually for, which is measuring whether the real model follows the real trajectory.

The Shape of the Technique

The rule of thumb I keep coming back to is this. Pick the smallest slice of a system to stub, not the biggest. If I only stub the decision, every execution path stays live. If I stub the execution too, I lose the thing most likely to break silently.

For agent systems the smallest slice is almost always the LLM’s choice of tool call. Everything downstream is code I wrote, which means it is code I can run in a test. Everything upstream is code I also wrote, which means it is code I want the test to cover.

Mock the decision. Keep the tools. Test the state machine.

Mock the LLM, Keep the Tools Real

The Setup

The Wrong Fix

The Right Split

Read The Meta-Engineer

Shared Context Keeps Fixtures Lean

Stubbing the Output Too

What the Test Actually Proves

Hill-Climbable Tests Buy Refactor Confidence

The Shape of the Technique

Related

Become a better AI engineer

More Insights

Seniority Was a Proxy for Typing Speed

Which Code Is Allowed to Be Understood by Nobody