LLM VCR and Agent Trace Hierarchy: Deterministic Replay for Agent Pipelines

James Phoenix
James Phoenix

Summary

Three patterns that turn agent pipelines from opaque prompt chains into debuggable, reproducible engineering systems: (1) an LLM VCR that records and replays model interactions, (2) a Run > Step > Message hierarchy for structured observability, and (3) an LLM Gateway abstraction that switches between live and replay modes.

The Problem

Agent pipelines call LLMs dozens of times per run. Without recording, you face:

  • “Why did this output change?” with no way to answer
  • Expensive integration tests that hit model providers on every run
  • Non-reproducible failures that vanish on re-run
  • Flat trace logs where 50 messages blur together with no semantic structure

Traditional HTTP mocking (Jest mocks, MSW) does not fit because LLM responses are non-deterministic and context-dependent. You need recording at the model interaction level, not the HTTP level.

Pattern 1: LLM VCR

The same idea as Ruby’s VCR gem or Python’s betamax, but for model interactions instead of HTTP responses.

Concept

Live mode:
  test → agent pipeline → model provider → record trace → return response

Replay mode:
  test → agent pipeline → cache lookup → return recorded response

During development or CI, run live once to record. All subsequent runs replay from cache. No model calls, no cost, deterministic output.

What gets recorded per interaction

model
prompt (system + user + tool results)
parameters (temperature, seed, max_tokens)
response (content + tool calls)
token counts
latency

When to re-record

  • Prompt changed
  • Model version changed
  • Temperature or parameters changed
  • You explicitly want fresh data

This is analogous to deleting VCR cassettes when the API contract changes.

Pattern 2: Run > Step > Message Hierarchy

Agent observability needs three levels of abstraction, not one.

The hierarchy

Run
 ├── Step
 │    ├── Message (LLM call)
 │    ├── Message (LLM call)
 │    └── Tool call
 ├── Step
 │    └── Message (LLM call)
 └── Step
      └── Render job

What each level represents

Level Represents Example Maps to
Run Whole pipeline execution generate_campaign_2026_03_13 Business event
Step Logical unit of agent work script_agent, idea_agent Agent responsibility
Message Single model interaction System + user prompt → completion LLM I/O

Why steps are not messages

Messages are content exchanged with the model. Steps are units of work in the system. A single step (e.g. script_agent) may contain multiple messages:

script_agent step
  message 1 → initial prompt
  message 2 → follow-up refinement
  tool call  → keyword extraction

If you collapse steps and messages, you lose:

  • Semantic meaning: “script_agent failed” vs “message 17 failed”
  • Step-level metrics: latency, cost, success rate per agent
  • Evaluation targets: you evaluate agent outputs (step level), not individual prompts

Mapping to Langfuse

Langfuse       →  Your system
─────────────────────────────
Trace          →  Run
Span           →  Step
Generation     →  Message

Mapping to OpenTelemetry

OTEL           →  Your system
─────────────────────────────
Trace          →  Run
Span           →  Step
Event          →  Message

Storage schema

runs
  id
  workflow
  started_at

steps
  id
  run_id
  agent
  input_snapshot
  output_snapshot
  started_at
  duration_ms
  status

messages
  id
  step_id
  model
  prompt
  response
  tokens
  latency_ms

Input/output snapshots on steps make replay easier. You can re-run a single step without replaying the entire pipeline.

Pattern 3: LLM Gateway Abstraction

Instead of calling model providers directly, route all calls through a gateway that handles recording and replay.

The type

type LLMRequest = {
  model: string
  prompt: string
  temperature?: number
  seed?: number
}

type LLMResponse = {
  output: string
  tokens: number
}

type LLMClient = (req: LLMRequest) => Promise<LLMResponse>

Live client (record mode)

const liveLLMClient: LLMClient = async (req) => {
  const res = await callProvider(req)

  await langfuse.record({
    prompt: req.prompt,
    model: req.model,
    response: res.output,
  })

  return res
}

Replay client (test mode)

const replayLLMClient: LLMClient = async (req) => {
  const cached = await vcrLookup(req)
  if (!cached) throw new Error("Missing VCR recording")
  return cached
}

Tests pick the client via environment or config. Same pipeline code, different execution mode.

Reproducibility: Deterministic Run Seeds

For maximum reproducibility, store these per recording:

  • Model identifier + version
  • Temperature
  • Seed (if provider supports it)
  • Prompt version hash

This lets you know exactly when a replay cassette is stale.

Udemy Bestseller

Learn Prompt Engineering

My O'Reilly book adapted for hands-on learning. Build production-ready prompts with practical exercises.

4.5/5 rating
306,000+ learners
View Course

Langfuse as Observability, Not Source of Truth

Langfuse is the debugging UI and trace viewer. Your own database is the ground truth.

Langfuse Your DB
Trace viewer Run/step/message storage
Prompt management Evaluation datasets
Debugging UI Analytics and metrics

This separation matters because evaluation, golden datasets, and long-term analytics should live in infrastructure you control.

Golden Dataset Workflow

Periodically snapshot recorded traces into a golden dataset:

Record traces (live runs)
  → Curate best examples
    → Snapshot as golden dataset
      → Run evals against golden dataset on every change

Evaluators can include: precision, recall, semantic similarity, LLM-as-judge.

Related

Topics
Agent ObservabilityAgent PipelinesCachingDeterministic ReplayIntegration TestingLangfuseLlm VcrOpentelemetryReproducibilityTrace Hierarchy

More Insights

Cover Image for Agent Search Observation Loop: Learning What Context to Provide

Agent Search Observation Loop: Learning What Context to Provide

Watch how the agent navigates your codebase. What it searches for tells you what to hand it next time.

James Phoenix
James Phoenix
Cover Image for The Two Camps of Agentic Coding

The Two Camps of Agentic Coding

One camp talks to models. The other camp specifies systems. The second camp is where the real leverage lives.

James Phoenix
James Phoenix