The two most important ideas in applied AI right now share the same skeleton. Strip away the specifics and you get one pattern: build a stable evaluator, then let agents search the space until they beat it.
The Pattern
Andrej Karpathy’s autoresearch repo showed the simplest version. An agent tries an experiment. You measure the result. If it is better than the current best, you keep it. If not, you discard it. Repeat.
Rob C’s Darwin Derby generalizes this into a framework: take anything that can be measured, and let a swarm of agents auto-tune it. The key property is that it always gets better, but it is not strictly hill-climbing, so it avoids getting stuck at local optima.
Both are instances of the same two-phase loop:
- Evaluator: A function that takes an output and returns a number. This never changes between runs.
- Optimizer: An agent (or swarm of agents) that proposes changes, gets scored, and keeps only improvements.
The evaluator is the fixed measuring stick. The optimizer is the search process. Separate them cleanly and you can swap either side independently.
Evaluations with Goldens: Policy Enforcement for Subjective Domains
The hard part has always been: how do you evaluate things that don’t have objective metrics? Writing quality, visual style, persona calibration, design taste. These are subjective. Traditional metrics fail.
Without goldens, an LLM judge is working from vibes. You give it a rubric, it scores the output, and you get a number. But that number is lossy. The judge is projecting its own sense of “good” onto the task, and that sense is a broad distribution. Ask it to score ten brand-voice paragraphs on a 1-10 scale and you will get a spread that reflects the model’s general understanding of quality, not yours. The scores cluster around what the model thinks good writing looks like in general. Your specific taste, your specific standards, your specific policy, those are somewhere inside that distribution, but the evaluator has no way to find the right region without a reference point.
This is the core problem: a rubric-only evaluator models a generic distribution of quality. Goldens let you model the specific distribution you actually want.
The answer is golden references. A golden reference is a human-curated example of what “good” looks like. It is the target, and it is how you enforce a policy. Your goldens define the policy. The evaluator enforces it. Every candidate output is judged against those goldens, and the evaluation becomes a comparative question: does this new output match the golden reference better than the current best?
Goldens collapse the evaluation distribution. Instead of the judge asking “is this generally good?”, it asks “is this more like the thing we know is good?” That is a much tighter question. The variance drops. The signal sharpens. You go from scoring against a fuzzy concept to scoring against concrete reality.
The structure is simple:
- Golden reference: The human-created example representing the policy you want to enforce
- Incumbent: The current best output
- Challenger: The new candidate
The LLM judge answers one question: which is closer to the golden reference, the incumbent or the challenger? If the challenger wins more than 50% of the time across your reference set, it becomes the new incumbent.
This is policy enforcement through examples rather than rules. Instead of writing a ten-page style guide and hoping the system follows it, you show 20 examples of what compliance looks like. The goldens are the policy. The evaluator is the enforcement mechanism.
The critical property: the judge never changes, and the golden references never change. This gives you a stable evaluation landscape. Every run is comparable to every other run. Progress is real, not an artifact of a shifting scoring function.
Golden References Can Be Anything: Few-Shot as Evaluation Targets
Here is where it gets interesting. Golden references are not limited to text. They are few-shot learning examples of what you want. They can be:
- Text: A paragraph written in the exact tone you want. A product description in your brand voice. A code snippet in your preferred style.
- Images: A design mockup showing the visual direction. A photograph with the lighting and composition you are targeting. A UI screenshot demonstrating the layout pattern.
- Video: A clip showing the pacing and editing style you want. A screen recording demonstrating the interaction flow. A motion graphic with the animation feel you are after.
The golden reference answers the question “what does good look like?” in the most literal way possible. You show the evaluator an example of the target, and it judges candidates against that example.
This is policy enforcement through few-shot examples rather than written specifications. Instead of trying to describe quality in words (which is lossy and ambiguous), you point at concrete examples. The LLM judge can compare a candidate image against a reference image. It can compare a candidate video frame against a reference. It can compare generated text against a style exemplar.
The more modalities your judge can process, the more domains you can optimize.
The Full Loop: Evolutionary Search to Target
Put the pieces together and you get a system that can optimize toward any target you can show it:
1. Define your golden references (the examples of what "good" looks like)
2. Build your evaluator (pairwise comparison against golden refs)
3. Set your initial state (the starting point)
4. Launch agents that:
a. Propose a change to the current best
b. Get scored by the evaluator
c. Replace the incumbent if they win
5. Repeat until convergence or budget exhaustion
The agents never see the scoring code. They only see the current state and get back a number. This prevents gaming. They cannot reverse-engineer the evaluator. They can only try things and see what works. This is the same dynamic that makes biological evolution robust: organisms cannot hack the fitness function, they can only adapt to it.
Early iterations show high win rates because the bar is low. As the incumbent improves, subsequent challengers face a higher bar. Each replacement jumps to the next hill. The system ratchets upward.
What This Actually Looks Like in Practice
Optimizing a brand voice: Your golden references are 20 paragraphs from your best blog posts. The evaluator compares AI-generated paragraphs against these examples. Agents iterate on the prompt, the system prompt, the few-shot examples, the temperature. Each morning you wake up to a stronger writer.
Optimizing visual design: Your golden references are screenshots of designs you admire. The evaluator compares generated UI components against these targets. Agents iterate on CSS, layout parameters, color palettes. The output converges toward your taste.
Optimizing video thumbnails: Your golden references are your top-performing thumbnails. The evaluator compares candidates against these. Agents iterate on composition, text placement, color grading. You get thumbnails that match the pattern of what already works.
Optimizing code style: Your golden references are functions written the way you like. The evaluator compares generated code against these exemplars. Agents iterate on formatting rules, naming conventions, abstraction patterns. The output matches your taste in code.
The Regime Shift
We were previously in a regime where optimization required objective metrics. Loss functions, accuracy scores, bits-per-byte. Now we are in a regime where we can optimize anything a multimodal LLM can judge, so long as we give it stable references to judge against.
The combination is simple:
- Golden references give you a stable target (the “what”)
- LLM-as-judge gives you a stable evaluator (the “how to measure”)
- Evolutionary search gives you a stable optimization process (the “how to improve”)
Goodhart’s law still applies. If you only value one dimension, you will over-optimize for it. But biological evolution managed to ratchet from bacteria to fish to primates building semiconductors using a single fitness function. The key is that the fitness function was rich enough. Your golden references encode richness. Twenty examples of great writing carry more signal than any written rubric.
The evaluator-optimizer loop is the general pattern. Golden references are what make it work for subjective domains. Evolutionary search is what makes it automatic. Together they let you point at examples of what you want and say “more like this,” then walk away while agents converge on it.

