Agentic Observability

James Phoenix

If you cannot trace what an agent did and why, you cannot debug it or improve it.

Definition

Agentic observability is the practice of instrumenting agent systems so you can reconstruct the full sequence of decisions, tool calls, and outcomes after the fact. It collapses four concerns into one discipline: when something happened, what happened, why it happened, and how to fix it.

Mental Model

Traditional software has request/response logs. Agents have decision traces.

An agent makes a chain of choices: which tool to call, what arguments to pass, how to interpret results, when to stop. Without tracing, a failure looks like “it gave a bad answer.” With tracing, you see: “It called search with the wrong query on step 3, got irrelevant results, then hallucinated from those results on step 5.”

The trace is the debugging unit for agents, not the final output.

The Four Questions

Every observability system for agents must answer:

Question	What it requires
When did it happen?	Timestamped event log
What happened?	Full tool call sequence with inputs/outputs
Why did it happen?	Model reasoning, prompt state at decision point
How do I fix it?	Ability to replay, modify prompts, and re-run

If your system answers all four, you can debug anything. If it answers fewer, you are guessing.

Leanpub Book

Read The Meta-Engineer

A practical book on building autonomous AI systems with Claude Code, context engineering, verification loops, and production harnesses.

Continuously updated

Claude Code + agentic systems

View Book

Tool Call Patterns

The highest-signal thing to trace is the tool call sequence. Patterns to watch for:

Loops: agent calling the same tool repeatedly with similar inputs (stuck)
Cascading errors: one bad tool result poisoning all downstream decisions
Missing calls: agent skipping a tool it should have used
Argument drift: tool arguments degrading in quality over a multi-step chain

Tracing tool calls gives you a structural view of agent behavior that raw token output never will.

Implementation Principles

Trace everything by default. Storage is cheap. Missing data during a post-mortem is expensive.
Make traces queryable. A trace you cannot search is a trace you will not use.
Tie traces to evals. When an eval fails, the first thing you do is pull the trace. If that path is not smooth, your eval loop is broken.
Instrument at the tool boundary. Every tool call in, every result out, with timestamps and the prompt state that triggered the call.

Gotchas

Logging only final outputs. The failure is almost never in the last step.
Tracing without querying. If you cannot filter traces by tool name, error type, or session, they will rot.
Treating observability as optional until production. Instrument from day one. Debugging in prod without traces is guessing.

Related Concepts

Sources

Personal notes from Agentic Engineering session, 2026