LLM VCR and Agent Trace Hierarchy: Deterministic Replay for Agent Pipelines

James Phoenix

Summary

Three patterns that turn agent pipelines from opaque prompt chains into debuggable, reproducible engineering systems: (1) an LLM VCR that records and replays model interactions, (2) a Run > Step > Message hierarchy for structured observability, and (3) an LLM Gateway abstraction that switches between live and replay modes.

The Problem

Agent pipelines call LLMs dozens of times per run. Without recording, you face:

“Why did this output change?” with no way to answer
Expensive integration tests that hit model providers on every run
Non-reproducible failures that vanish on re-run
Flat trace logs where 50 messages blur together with no semantic structure

Traditional HTTP mocking (Jest mocks, MSW) does not fit because LLM responses are non-deterministic and context-dependent. You need recording at the model interaction level, not the HTTP level.

Pattern 1: LLM VCR

The same idea as Ruby’s VCR gem or Python’s betamax, but for model interactions instead of HTTP responses.

Concept

Live mode:
  test → agent pipeline → model provider → record trace → return response

Replay mode:
  test → agent pipeline → cache lookup → return recorded response

During development or CI, run live once to record. All subsequent runs replay from cache. No model calls, no cost, deterministic output.

What gets recorded per interaction

model
prompt (system + user + tool results)
parameters (temperature, seed, max_tokens)
response (content + tool calls)
token counts
latency

When to re-record

Prompt changed
Model version changed
Temperature or parameters changed
You explicitly want fresh data

This is analogous to deleting VCR cassettes when the API contract changes.

Pattern 2: Run > Step > Message Hierarchy

Agent observability needs three levels of abstraction, not one.

The hierarchy

Run
 ├── Step
 │    ├── Message (LLM call)
 │    ├── Message (LLM call)
 │    └── Tool call
 ├── Step
 │    └── Message (LLM call)
 └── Step
      └── Render job

What each level represents

Level	Represents	Example	Maps to
Run	Whole pipeline execution	`generate_campaign_2026_03_13`	Business event
Step	Logical unit of agent work	`script_agent`, `idea_agent`	Agent responsibility
Message	Single model interaction	System + user prompt → completion	LLM I/O

Why steps are not messages

Messages are content exchanged with the model. Steps are units of work in the system. A single step (e.g. script_agent) may contain multiple messages:

script_agent step
  message 1 → initial prompt
  message 2 → follow-up refinement
  tool call  → keyword extraction

If you collapse steps and messages, you lose:

Semantic meaning: “script_agent failed” vs “message 17 failed”
Step-level metrics: latency, cost, success rate per agent
Evaluation targets: you evaluate agent outputs (step level), not individual prompts

Mapping to Langfuse

Langfuse       →  Your system
─────────────────────────────
Trace          →  Run
Span           →  Step
Generation     →  Message

Mapping to OpenTelemetry

OTEL           →  Your system
─────────────────────────────
Trace          →  Run
Span           →  Step
Event          →  Message

Storage schema

runs
  id
  workflow
  started_at

steps
  id
  run_id
  agent
  input_snapshot
  output_snapshot
  started_at
  duration_ms
  status

messages
  id
  step_id
  model
  prompt
  response
  tokens
  latency_ms

Input/output snapshots on steps make replay easier. You can re-run a single step without replaying the entire pipeline.

Pattern 3: LLM Gateway Abstraction

Instead of calling model providers directly, route all calls through a gateway that handles recording and replay.

The type

type LLMRequest = {
  model: string
  prompt: string
  temperature?: number
  seed?: number
}

type LLMResponse = {
  output: string
  tokens: number
}

type LLMClient = (req: LLMRequest) => Promise<LLMResponse>

Live client (record mode)

const liveLLMClient: LLMClient = async (req) => {
  const res = await callProvider(req)

  await langfuse.record({
    prompt: req.prompt,
    model: req.model,
    response: res.output,
  })

  return res
}

Replay client (test mode)

const replayLLMClient: LLMClient = async (req) => {
  const cached = await vcrLookup(req)
  if (!cached) throw new Error("Missing VCR recording")
  return cached
}

Tests pick the client via environment or config. Same pipeline code, different execution mode.

Reproducibility: Deterministic Run Seeds

For maximum reproducibility, store these per recording:

Model identifier + version
Temperature
Seed (if provider supports it)
Prompt version hash

This lets you know exactly when a replay cassette is stale.

Udemy Bestseller

Learn Prompt Engineering

My O'Reilly book adapted for hands-on learning. Build production-ready prompts with practical exercises.

★ 4.5/5 rating

306,000+ learners

View Course

Langfuse as Observability, Not Source of Truth

Langfuse is the debugging UI and trace viewer. Your own database is the ground truth.

Langfuse	Your DB
Trace viewer	Run/step/message storage
Prompt management	Evaluation datasets
Debugging UI	Analytics and metrics

This separation matters because evaluation, golden datasets, and long-term analytics should live in infrastructure you control.

Golden Dataset Workflow

Periodically snapshot recorded traces into a golden dataset:

Record traces (live runs)
  → Curate best examples
    → Snapshot as golden dataset
      → Run evals against golden dataset on every change

Evaluators can include: precision, recall, semantic similarity, LLM-as-judge.

Integration Testing Patterns – End-to-end testing for agents
Evaluation-Driven Development – AI-driven evaluation loops
Closed-Loop Telemetry Optimization – OTEL as control input
Systems Thinking & Observability – Tests vs telemetry, OTEL tracing
Building the Harness – The meta-engineering layer