I built an agent that diagnoses a coding transcript far too big for any context window. The trick was never letting it read the transcript. The model drives computation; the file stays on disk.
The Problem
A coding agent’s transcript is enormous. A long-running agent emits millions of tool calls, and a full run is hundreds of millions of tokens. You cannot paste it into a model. You cannot even paste a meaningful slice of it. And yet the most valuable question you can ask about an agent is exactly the one that requires reading the whole thing: where does this agent keep going wrong, and what should I change in its policy?
I wanted to answer that without a context window big enough to hold the evidence. So I built a small harness on the OpenAI Agents SDK that does it, and put it on GitHub. The idea is borrowed from Zhang and Khattab’s Recursive Language Models: the root agent never reads the transcript. It treats the file as an environment it queries, slices, and recurses over.
The interesting part was not the agent. It was the three constraints I had to enforce in plain Python to stop the model from doing the obvious, expensive, wrong thing.
The Narrow Waist
The first constraint is the whole game: every tool truncates its output to about six kilobytes before it returns to the model.
That sounds like a limitation. It is actually the mechanism. If a search tool could return ten thousand matching rows, the model would happily ask for them and try to reason over the dump. By capping the output, I make that physically impossible. The model cannot pull the transcript into context through the back door of a tool result. It is forced to compute a small answer instead: a count, a handful of representative rows, a reduced summary.
I think of this as the narrow waist, the same shape as the hourglass in network design. A huge file on one side, a small model context on the other, and a deliberately thin channel between them. Everything the model learns about the file has to squeeze through that channel as a computed result, not as raw bytes. Once you accept that the waist is non-negotiable, the rest of the design writes itself: you need tools that reduce, not tools that fetch.
Byte Offsets Beat Re-Parsing
To make windowed access cheap, I index the file once. The transcript is JSONL, so I serialise every row to canonical bytes a single time and record, for each row, its [start, end) offset in the concatenated buffer. Two parallel arrays are the entire index.
After that, reading row i is a slice of the byte buffer and one json.loads. Mapping a byte offset from a substring hit back to its owning row is a bisect over the start offsets, which is logarithmic rather than a linear rescan. That last detail matters more than it looks: substring search calls it once per hit, so on a multi-million-row trace the difference between O(log n) and O(n) per hit is the difference between snappy and unusable.
The same index powers a coverage tracker. Every read and search records the byte range it touched, the harness merges those ranges, and a tool called unsearched_ranges returns the complement. That is how the agent answers “what haven’t I looked at yet?” instead of guessing when it is done. Coverage becomes a number, not a vibe.
Don’t Trust the Model to Count
The second constraint: the model produces findings, but it never aggregates them.
Each agent in the tree returns the same Pydantic shape, a list of failure modes with evidence and a count. The merge across the whole tree, deduping by failure mode, summing occurrences, unioning evidence, keeping the highest severity, happens in deterministic Python. The LLM only ever emits leaves. The reduce is code.
This is a habit worth generalising. Anywhere you ask a model to both find things and total them, the total is the part it gets wrong, because totalling is arithmetic dressed as language. Keep the judgement in the model and move the accounting into a function you can unit-test. I have a whole test file asserting that the merge dedupes case-insensitively, caps evidence, and sorts by frequency. None of it needs an API key, because none of it needs the model.
Recursion Capped in Code, Not in the Prompt
The third constraint: the depth limit is enforced in Python, not requested in the prompt.
The agent can delegate a slice of the file to a recursive copy of itself. Left to its own devices a model will cheerfully recurse forever, or refuse to recurse when it should. So the delegate tool reads the current depth from a context object threaded through every run, and at the cap it returns an error payload and refuses to spawn another child. The prompt mentions the limit too, but the prompt is advisory. The gate is load-bearing. A model that decides to ignore the limit simply gets told no by the function.
The general principle, which I keep relearning: assume the model will do the wrong thing and put the guarantee in the harness. Anything you only ask for in the system prompt is a preference, not an invariant.
What the Live Run Actually Showed
Here is the honest part. I ran it against a fifty-thousand-line transcript with five failure modes planted in known quantities. The harness delegated two byte-range slices to sub-agents, merged their findings, and recovered the real problems: patching files that were never read, declaring tasks done with no verification, calling a function that does not exist, retrying the identical failing command. The mechanism works. The recursion works. The structured output is clean.
It also failed in two instructive ways.
The merge fragmented. One sub-agent named a failure mode stale-context-hunk, another called the same thing stale-patch-hunks, and because my merge key is an exact string, they did not fold together. Deterministic reduction is only as good as the key, and free-text keys from independent agents drift. The fix is a controlled vocabulary: give the agents a fixed enum of failure-mode names instead of letting each invent its own.
And the counts inflated. A mode planted eleven times got reported as sixteen, because the model counted hits it could see in truncated search previews from ranges it had delegated away, despite being told not to. The deterministic merge was exact; the leaf counts feeding it were not. The lesson is the same as before, pushed one level down: even when you move the arithmetic into code, the inputs to that arithmetic still come from a model, so they still need verification. The honest version of this harness counts occurrences in Python from the matched offsets, and uses the model only to label and explain.
That is the real output of a portfolio project. Not “it works”, but a precise map of where the seams are and what the next version does differently.
Why This Compounds
The pattern outlives the toy. Any time you have to reason over something larger than a context window, a log, a monorepo, a year of tickets, the move is the same. Index it so access is cheap, force every tool through a narrow waist so the model computes instead of reads, keep the aggregation in deterministic code, and enforce your limits in the harness rather than the prompt. The model becomes a driver of computation over an environment, not a bucket you pour data into.
Build a harness is genuinely the new reverse a linked list. This one taught me more about agent design than any amount of prompt tuning, precisely because the lessons were about the plumbing, not the prompt.
References
- Recursive Language Models – Zhang and Khattab, the environment-plus-sub-model-calls pattern this mirrors
- recursive-lm-harness on GitHub – the project, MIT licensed
- OpenAI Agents SDK –
Agent+Runnerwith structured output

