Summary
Three patterns that turn agent pipelines from opaque prompt chains into debuggable, reproducible engineering systems: (1) an LLM VCR that records and replays model interactions, (2) a Run > Step > Message hierarchy for structured observability, and (3) an LLM Gateway abstraction that switches between live and replay modes.
The Problem
Agent pipelines call LLMs dozens of times per run. Without recording, you face:
- “Why did this output change?” with no way to answer
- Expensive integration tests that hit model providers on every run
- Non-reproducible failures that vanish on re-run
- Flat trace logs where 50 messages blur together with no semantic structure
Traditional HTTP mocking (Jest mocks, MSW) does not fit because LLM responses are non-deterministic and context-dependent. You need recording at the model interaction level, not the HTTP level.
Pattern 1: LLM VCR
The same idea as Ruby’s VCR gem or Python’s betamax, but for model interactions instead of HTTP responses.
Concept
Live mode:
test → agent pipeline → model provider → record trace → return response
Replay mode:
test → agent pipeline → cache lookup → return recorded response
During development or CI, run live once to record. All subsequent runs replay from cache. No model calls, no cost, deterministic output.
What gets recorded per interaction
model
prompt (system + user + tool results)
parameters (temperature, seed, max_tokens)
response (content + tool calls)
token counts
latency
When to re-record
- Prompt changed
- Model version changed
- Temperature or parameters changed
- You explicitly want fresh data
This is analogous to deleting VCR cassettes when the API contract changes.
Pattern 2: Run > Step > Message Hierarchy
Agent observability needs three levels of abstraction, not one.
The hierarchy
Run
├── Step
│ ├── Message (LLM call)
│ ├── Message (LLM call)
│ └── Tool call
├── Step
│ └── Message (LLM call)
└── Step
└── Render job
What each level represents
| Level | Represents | Example | Maps to |
|---|---|---|---|
| Run | Whole pipeline execution | generate_campaign_2026_03_13 |
Business event |
| Step | Logical unit of agent work | script_agent, idea_agent |
Agent responsibility |
| Message | Single model interaction | System + user prompt → completion | LLM I/O |
Why steps are not messages
Messages are content exchanged with the model. Steps are units of work in the system. A single step (e.g. script_agent) may contain multiple messages:
script_agent step
message 1 → initial prompt
message 2 → follow-up refinement
tool call → keyword extraction
If you collapse steps and messages, you lose:
- Semantic meaning: “script_agent failed” vs “message 17 failed”
- Step-level metrics: latency, cost, success rate per agent
- Evaluation targets: you evaluate agent outputs (step level), not individual prompts
Mapping to Langfuse
Langfuse → Your system
─────────────────────────────
Trace → Run
Span → Step
Generation → Message
Mapping to OpenTelemetry
OTEL → Your system
─────────────────────────────
Trace → Run
Span → Step
Event → Message
Storage schema
runs
id
workflow
started_at
steps
id
run_id
agent
input_snapshot
output_snapshot
started_at
duration_ms
status
messages
id
step_id
model
prompt
response
tokens
latency_ms
Input/output snapshots on steps make replay easier. You can re-run a single step without replaying the entire pipeline.
Pattern 3: LLM Gateway Abstraction
Instead of calling model providers directly, route all calls through a gateway that handles recording and replay.
The type
type LLMRequest = {
model: string
prompt: string
temperature?: number
seed?: number
}
type LLMResponse = {
output: string
tokens: number
}
type LLMClient = (req: LLMRequest) => Promise<LLMResponse>
Live client (record mode)
const liveLLMClient: LLMClient = async (req) => {
const res = await callProvider(req)
await langfuse.record({
prompt: req.prompt,
model: req.model,
response: res.output,
})
return res
}
Replay client (test mode)
const replayLLMClient: LLMClient = async (req) => {
const cached = await vcrLookup(req)
if (!cached) throw new Error("Missing VCR recording")
return cached
}
Tests pick the client via environment or config. Same pipeline code, different execution mode.
Reproducibility: Deterministic Run Seeds
For maximum reproducibility, store these per recording:
- Model identifier + version
- Temperature
- Seed (if provider supports it)
- Prompt version hash
This lets you know exactly when a replay cassette is stale.
Langfuse as Observability, Not Source of Truth
Langfuse is the debugging UI and trace viewer. Your own database is the ground truth.
| Langfuse | Your DB |
|---|---|
| Trace viewer | Run/step/message storage |
| Prompt management | Evaluation datasets |
| Debugging UI | Analytics and metrics |
This separation matters because evaluation, golden datasets, and long-term analytics should live in infrastructure you control.
Golden Dataset Workflow
Periodically snapshot recorded traces into a golden dataset:
Record traces (live runs)
→ Curate best examples
→ Snapshot as golden dataset
→ Run evals against golden dataset on every change
Evaluators can include: precision, recall, semantic similarity, LLM-as-judge.
Related
- Integration Testing Patterns – End-to-end testing for agents
- Evaluation-Driven Development – AI-driven evaluation loops
- Closed-Loop Telemetry Optimization – OTEL as control input
- Systems Thinking & Observability – Tests vs telemetry, OTEL tracing
- Building the Harness – The meta-engineering layer

