If I cannot answer “what exactly was running when this experiment scored 0.74”, I do not have an experiment. I have an anecdote.
Author: James Phoenix | Date: May 2026
Why This Matters
Every eval result is a function of two things: the scenario it ran against, and the program that produced the output. Most teams obsess over the scenario (datasets, prompts, golden answers) and treat the program as a free variable. Six weeks later they look at a winning result and realise they cannot recreate the exact code path that produced it.
This is the provenance problem. And the cheapest place to solve it is upstream of the run, in how I express my config. The trace tools tell me what happened. They are not the source of truth for what my program is.
There are three escalating ways to track this, and each one closes a gap the previous one leaves open.
Way 1: Flat Config Object
The most basic move is just having a config object at all, with every parameter you tweak between runs spelled out as a field.
const config = {
model: "claude-sonnet-4-6",
temperature: 0.2,
maxIterations: 10,
systemPrompt: SYSTEM_PROMPT_V3,
toolset: ["bash", "filesystem", "search"],
};
I serialise it, commit it, hash the whole blob, and store it alongside the eval result. The provenance I get is “this object existed when I ran the eval”.
The limitation is what is not in the object. Ad-hoc orchestration logic, sub-agent topology, retry policies, helper prompts buried in utility functions. These tend to live in code rather than in the config. So my “config hash” captures only the part of the program I remembered to expose. Two runs with the same flat config can still produce wildly different distributions because the code around them changed.
Way 2: Config Layer Generator with addField() and hash()
The next step up is to stop hand-rolling the config literal and build it through a generator that accumulates fields and gives you a deterministic hash for free.
const builder = configLayer({ scenario: "support-tickets" })
.addField("model", "claude-sonnet-4-6")
.addField("temperature", 0.2)
.addField("maxIterations", 10)
.addField("toolset", ["bash", "fs", "search"])
.addField("systemPrompt", SYSTEM_PROMPT_V3);
const id = builder.hash(); // canonicalised SHA256
Two wins.
First, the hash is the experiment ID. Same fields, same hash, same experiment. I can drop a row in a results table keyed by this hash and trivially compare runs that share an ID. If two scores diverge under the same ID, the bug is non-determinism in the program, not in the config.
Second, the builder forces me to be explicit about what counts. addField becomes a checklist. Every parameter that could meaningfully change between runs has to be added by name, or it is silently outside the experiment. This is the moment most teams realise their orchestration logic was never in the config to begin with. Roll a hash with a missing field, get a stable ID over an unstable program, learn the hard way.
Way 3: SHA256 of the Algorithm or Code Itself
The third move is to track the code the config drives, not just the values it contains. This always lives inside Way 1 or Way 2 (it is just another field) but it is worth calling out separately because most people forget to do it.
If my “orchestration” is MAX_ITERATIONS = 10, fine. That is a primitive, the config object captures it. But the moment my orchestration is an actual function — a sub-agent dispatcher, a custom retry policy, a dynamic tool allow-list, a generator-evaluator loop — I need a label for that algorithm.
The fix is mechanical. Hash the source of the function and add the hash as a field.
import { createHash } from "crypto";
function hashFn(fn: Function) {
return createHash("sha256").update(fn.toString()).digest("hex").slice(0, 12);
}
builder.addField("orchestrator", {
name: "fanout-critic",
sha256: hashFn(orchestrator),
});
Now when I change the orchestrator’s behaviour, the hash changes, and the experiment ID changes. I can trivially answer “did the algorithm change between run A and run B?” without diffing git history or relying on memory.
The general rule. Every piece of custom logic that meaningfully affects the output gets a label and a hash. Sub-agent topology. Retry policies. Custom prompts buried in helper functions. The dispatcher that decides which model to route to. If it could plausibly change between runs and it is not just a primitive value, it earns a hash.
Why Not Just Lean on Langfuse or Langsmith
Tools like Langfuse and Langsmith are excellent at capturing what happened at runtime: traces, spans, latencies, token counts, errors. They are not the right home for the program definition itself, and using them as one quietly causes two problems.
The first problem is that config and traces blur together. A trace is a record of execution. A config is a definition of the program that produced it. Mixing them means I cannot reason about either cleanly. Querying “what programs have I run on this scenario” turns into a trace query when it should be a config lookup.
The second problem is that only the instrumented surface is captured. The trace tool sees the spans I wired up. It does not see the full Program(config). The un-traced parts of orchestration, the function bodies, the assembly logic, the bits glue-coded together at the edges. Those parts stay un-tracked even when traces look full. I get a strong feeling of observability and a weak guarantee of reproducibility.
The right shape is: config tracking lives upstream of the trace, owned by my own code, addressable by hash. The trace tool sees the config because I log it on the run, but it is not the source of truth for it. I should be able to answer “what is this program” from the config alone, with no Langfuse query required.
How This Plugs Into Evals
This article is the engine room of the six evergreen levers. The lever framing assumes I can hold one variable still and change another. That assumption only holds if my config makes the variables explicit. Without provenance, every “lever pull” is an anecdote, and I cannot tell whether the win came from the change I intended or from drift in something I forgot to track.
The unified shape:
evals(scenario) over Program(config)
-> Experiment<configHash, result>
Two runs with the same configHash over the same scenario should produce the same distribution of results. If they do not, the config is missing a field. The discipline is iterative. Every experiment I cannot reproduce is a hint that another piece of the program needs to enter the config and pick up a hash.
This is the difference between running experiments and running rituals. Rituals look like science but the variables are uncontrolled. Experiments have an addressable program ID and a measurable scenario, and the difference between two results means something. Provenance is what closes that gap.

