Mock the LLM, Keep the Tools Real

James Phoenix
James Phoenix

Agent systems have exactly one non-deterministic component: the model’s choice of tool call. Stub that. Let everything else run.


The Setup

I hand-rolled a state machine for a slideshow pipeline. Idea approval, render proposal approval, render approval, publish. Rejections route back to the right earlier stage. Human in the loop at every boundary. The whole thing is about three hundred lines of reducer plus an effects interpreter.

Once the machine existed, the next problem was testing it. Every transition fires an effect that calls a large model. A run to PUBLISHED involves four or five generations, three of which are agent loops with tool registries attached. Running the real pipeline against a provider for each test takes ninety seconds and costs real money. I wanted a test that exercised the full flow, boots to boots, and cost nothing.

The Wrong Fix

The reflex move is to mock everything. Mock the LLM, mock the tools, stub the renderer, fake the filesystem writes. The test becomes fast and green.

The problem is that passing tells me very little. I’ve covered the state transitions but none of the wiring that matters. If the render tool’s Satori layout changes, if the theme resolver breaks, if the deck mutation tools forget to update slide ordering, if the schema validation regresses, the fully mocked test sails through.

At that point the test is a ritual, not a check. It runs in CI, it turns green, and it protects nothing that’s actually load-bearing.

The Right Split

The split I landed on is narrower and more useful. Mock the LLM’s decisions. Keep every tool real.

The LLM’s job inside an agent step is to choose tool calls. That’s all it is. Given a prompt and a tool registry, the model emits a sequence of {tool, arguments} tuples, interleaved with tool outputs, until a stop condition. That sequence is the trajectory.

A trajectory is data. Once I write it down, I don’t need a model to produce it.

type ToolCallEvent = { tool: string; arguments: Record<string, unknown> };
type Trajectory = ToolCallEvent[];

My test harness installs a fake OpenRouter client. When the state machine calls getClient().callModel({...}) with tools attached, the fake pops the next trajectory off a queue and streams it back as if the LLM had just produced it. Each step is run through the real tool registry. The returned stream has the exact function_call and function_call_output items the real client would have emitted, so the state machine’s harvest code sees them and doesn’t know the difference.

for (const step of trajectory) {
  const tool = registry[step.tool];
  const parsedInput = tool.function.inputSchema.parse(step.arguments);
  const output = await tool.function.execute(parsedInput, ctx);
  yield { type: "function_call", name: step.tool, arguments: JSON.stringify(parsedInput) };
  yield { type: "function_call_output", name: step.tool, output: JSON.stringify(output) };
}

The tool executes for real. Satori runs. JPGs land on disk. The input schema validates. The reducer harvests the output the same way it would with a live model. The only thing I substituted is the decision of which tool to call.

Shared Context Keeps Fixtures Lean

A trajectory with one tool call is easy. A revision trajectory with three chained mutations like remove_slide, then move_slide, then set_slide_text is where naive fixture threading falls apart, because the second tool needs the deck as it looked after the first, and the third needs it as it looked after the second.

I avoided threading this by hand. The real tool implementations already read and write a shared state object passed through ctx.shared. The harness reuses the same object reference across every step of a trajectory:

const ctx = { shared: (req.context as any)?.shared ?? {} };
for (const step of trajectory) {
  await tool.function.execute(parsedInput, ctx); // mutates ctx.shared.state.deck
}

Because ctx.shared.state.deck is the same object across every step, a mutation from step one is visible to step two without any glue. The trajectory stays lean. The arguments list the intent of each step and nothing about the current deck.

[
  { tool: "remove_slide",   arguments: { slideIndex: 1 } },
  { tool: "move_slide",     arguments: { slideIndex: 0, position: "end" } },
  { tool: "set_slide_text", arguments: { slideIndex: 1, text: "Mutated primary text." } },
]

A fixture trajectory reads like a storyboard. I can eyeball what the agent is supposed to do on this turn, and the test asserts that the state machine drove the right tools with the right shape of inputs at the right time, against the right live deck.

What the Test Actually Proves

When a test built this way passes, I know:

  • The reducer routed the event to the right stage.
  • The effects interpreter fired the right generation effect.
  • The real tools ran, validated their inputs, and produced real outputs.
  • The render actually landed bytes on disk, above a minimum size threshold.
  • The state machine harvested the tool outputs back into typed state.

That is integration coverage all the way down, for the cost of some JSON fixtures and a few tens of milliseconds per test.

What the test does not prove is whether a real model, given my prompt and tool registry, would actually emit that trajectory. Trajectory adherence is its own problem. I evaluate it separately, against a fixed benchmark, with live calls budgeted explicitly. Conflating the two was the mistake I kept making early on. A passing fixture test does not say the agent behaves. It says the scaffolding behaves when the agent behaves.

Hill-Climbable Tests Buy Refactor Confidence

Here is the payoff I did not expect. Once a suite of trajectory tests exists around a hand-rolled core, the core becomes safe to hand to an agent for refactoring.

Before, I would not let Claude or Codex near the reducer. Any rewrite that compiled might ship a subtle state leak I’d only find in production under human feedback. With trajectory tests in place, the agent has a concrete surface to climb against. Run the suite, see which transitions fail, fix, rerun. The tests encode the behaviour I care about. The agent does not need to guess my intent from comments. The intent is checked.

This is evaluation-driven development pitched at the state-machine layer instead of the prompt layer. I hand-roll the shape. I fixture the trajectories. The tests become the spec. The agent can rewrite the interior as long as the spec holds.

It also flips the economics of maintenance. Before, every change to the state machine risked a live-budget run to build confidence. Now I iterate entirely offline and spend the live budget on the one thing it’s actually for, which is measuring whether the real model follows the real trajectory.

The Shape of the Technique

The rule of thumb I keep coming back to is this. Pick the smallest slice of a system to stub, not the biggest. If I only stub the decision, every execution path stays live. If I stub the execution too, I lose the thing most likely to break silently.

For agent systems the smallest slice is almost always the LLM’s choice of tool call. Everything downstream is code I wrote, which means it is code I can run in a test. Everything upstream is code I also wrote, which means it is code I want the test to cover.

Leanpub Book

Read The Meta-Engineer

A practical book on building autonomous AI systems with Claude Code, context engineering, verification loops, and production harnesses.

Continuously updated
Claude Code + agentic systems
View Book

Mock the decision. Keep the tools. Test the state machine.


Related

Topics
Ai AgentsLlm TestingMocking StrategiesPipeline Testing

Newsletter

Become a better AI engineer

Weekly deep dives on production AI systems, context engineering, and the patterns that compound. No fluff, no tutorials. Just what works.

Join 306K+ developers. No spam. Unsubscribe anytime.


More Insights

Cover Image for Hand-Roll the Core

Hand-Roll the Core

The further a piece of code sits from the core of your system, the more you can give to agents. At the core, hand-roll it or write the spec yourself. Fully outsource and you lose the ability to patch it.

James Phoenix
James Phoenix
Cover Image for Techniques for Overcoming Chat Psychosis Bias

Techniques for Overcoming Chat Psychosis Bias

Chatbots are trained to preserve rapport with the user. Left alone, that trains you into a flattering mirror. These are the prompt-level techniques I use to break the sycophancy gradient and get honest feedback.

James Phoenix
James Phoenix