Autonomous Loops Need a Scoring Function

James Phoenix
James Phoenix

Without a benchmark, a RALPH loop is a chaos engine. With one, it becomes automated hill climbing. The difference is not the loop. It is the objective.

Author: James Phoenix | Date: March 2026


The Fascination Phase

Most engineers go through a phase of fascination with autonomous loops. RALPH loops, agent swarms, overnight coding. The promise is seductive:

Start loop → wake up to finished product.

Then they build real systems and realise something important.

Engineering is a high-entropy search problem. And LLM loops are very bad searchers.


Why Loops Fail at Open-Ended Engineering

LLM loops lack four things that make engineering hard:

Missing Capability Why It Matters
Long-term system context Cannot hold full architectural picture across iterations
Architectural judgement Cannot evaluate trade-offs between competing designs
Subtle invariant detection Cannot spot when a change violates an unwritten rule
Knowing when to stop Cannot distinguish “done” from “stuck” from “wrong direction”

What actually happens in practice:

Human engineer → chooses direction
Agent → executes local steps
Human → evaluates results
Agent → continues

This is human-directed search with machine execution. Not autonomy.

Tools like Claude Code, Codex, and Augment Code have all converged on this pattern. They are tight feedback harnesses, not autonomous developers.


The Search Space Problem

Real codebases are not suited to undirected search. Even small changes have huge branching factors:

architecture decision
   ↓
schema design
   ↓
api surface
   ↓
integration tests
   ↓
observability

Autonomous loops wander in this space without a strong heuristic. You have the heuristic. The loop does not.


LLMs Are Good Generators, Bad Critics

A loop needs two roles: a generator and a critic.

LLMs are good generators. They are mediocre critics.

So loops often collapse into:

write code → review own code → approve own code → repeat

This is the fundamental quality problem. The generator and critic are the same model with the same blind spots. There is no adversarial pressure. Quality decays silently.

The human engineer supplies the real signal: architectural constraints, product intuition, system invariants, simplification pressure. Agents cannot yet replicate that.


Where Loops Actually Work

Here is the distinction that matters. Autonomous loops are bad at open-ended engineering but powerful at bounded search problems.

The difference is subtle but fundamental.

Open-ended engineering (loops fail)

Designing architecture. Deciding schema boundaries. Evolving APIs. Balancing product trade-offs.

f(architecture, product, constraints, time, infra)

No clear scoring function. The objective is constantly being redefined. Humans dominate here.

Benchmark optimisation (loops shine)

Improving prompt accuracy. Tuning retrieval strategies. Evaluating pipelines. Optimising evaluation metrics.

maximize score(model_config)

There is a measurable objective. The loop can generate variants, run benchmarks, keep the best, and repeat.

That is automated hill climbing.


The Math: What Loops Actually Do

Most autonomous loops reduce to a form of black-box optimisation. They search for parameters theta that maximise a metric.

theta* = argmax_theta f(theta)

Breaking this down:

  • theta: the parameters being searched (prompt wording, chain structure, temperature, retrieval k)
  • f(theta): a scoring function (benchmark accuracy, engagement rate, latency)
  • argmax: the value of theta that produces the highest score

Examples in practice: DSPy optimisers, prompt search, RAG pipeline tuning, eval-driven development.


Three Conditions That Make Loops Powerful

Condition Why It Matters
Clear metric The loop knows what “better” means. accuracy = correct / total
Cheap evaluation If a benchmark takes 1 minute, a loop can run 1,000 experiments overnight. Humans cannot.
Small parameter space Prompt wording, few-shot examples, retrieval k, ranking threshold. Manageable.

When all three conditions hold, loops outperform humans by orders of magnitude. When any condition is missing, loops degrade into expensive random walks.


The Core Thesis: Benchmarks Prevent Spec Drift

This is the key insight. The best autonomous loops have a benchmark.

Without a scoring function, each iteration drifts further from the original intent. The loop has no way to measure whether it is making progress or just making changes. This is spec drift: the gradual divergence between what you wanted and what the loop produces.

A benchmark anchors the loop. It provides:

  1. Direction. The loop knows which way is “better.”
  2. Termination. The loop knows when to stop (score plateaus).
  3. Regression detection. The loop catches when a change makes things worse.
  4. Accountability. You can measure the loop’s actual contribution.

Without these, the loop is just a stochastic process. With them, it is an optimiser.

No benchmark:   loop → drift → chaos → "why is this broken?"
With benchmark:  loop → measure → improve → converge

The Clean Separation

The mature workflow splits cleanly along the boundary of “is there a scoring function?”

Human work (no scoring function)

  • Architecture
  • Specs
  • System boundaries
  • Invariants
  • Dataset design
  • Evaluation criteria definition

Loop work (clear scoring function)

  • Prompt tuning
  • Retrieval tuning
  • Ranking experiments
  • Evaluation sweeps
  • Parameter search
  • Content optimisation

The human designs the experiment. The agent runs thousands of trials. The human evaluates whether the metric itself is still the right one.


How This Applies in Practice

Think of agents as research assistants, not engineers.

They are strong at:

  • Systematic exploration of parameter spaces
  • Benchmark optimisation
  • Large parameter sweeps
  • A/B variant generation

Humans still dominate:

  • Architecture
  • Taste
  • Product judgement
  • Defining what “good” means

Example: Content Optimisation

Content optimisation is almost perfectly suited to loops.

maximize engagement(post_parameters)

Parameters: hook wording, structure, topic framing, post length, format. A loop could test thousands of variants against a scoring function overnight.

Example: Prompt Engineering

maximize accuracy(prompt_variant, eval_dataset)

The loop generates prompt variants, runs them against an eval set, keeps the best. This is where DSPy, ADAS, and similar tools operate.

Example: RAG Pipeline Tuning

maximize relevance(chunk_size, overlap, embedding_model, reranker_config)

Small parameter space. Cheap evaluation. Clear metric. Perfect loop territory.


The Vectorised Programming Analogy

Think of agents like vectorised programming.

Old way:

Udemy Bestseller

Learn Prompt Engineering

My O'Reilly book adapted for hands-on learning. Build production-ready prompts with practical exercises.

4.5/5 rating
306,000+ learners
View Course
for (i=0; i<n; i++) {
  compute()
}

Vectorised way:

compute_all()

Agents are vectorised engineering work. But you still design the algorithm. The loop executes. You define what to optimise.


The Trap to Avoid

The temptation is to build:

  • Self-improving agent swarms
  • Autonomous dev loops
  • Meta-orchestration
  • Agent schedulers

Instead of shipping the product.

Strong engineers fall into this regularly. You start optimising the meta-system instead of the system. Building the loop runner instead of defining the benchmark.

If you stop trying to automate engineering completely and instead run tight human-agent loops for architecture work, reserving autonomous loops for bounded optimisation problems with clear metrics, you ship faster.


The Workflow That Works

For engineering work (no benchmark)

spec → agent implementation → human review → agent fixups → merge

This is 10x faster than solo coding. 100x more reliable than full autonomy.

For optimisation work (clear benchmark)

define metric → define parameter space → run loop → evaluate results → ship winner

This is where overnight runs make sense. Because the loop has an objective.


One Sentence

If you cannot write score = f(output), do not put it in a loop.


Related

Topics
Agent Feedback LoopsAutonomous SystemsEngineering ObjectivesLlm ChallengesRalph Loops

More Insights

Cover Image for Reverse Ralph Loop

Reverse Ralph Loop

Use the Ralph Loop pattern to reverse-engineer existing software from public resources into clean-room specifications, then regenerate a functionally equivalent implementation.

James Phoenix
James Phoenix
Cover Image for Skill Graphs: Networked Knowledge Beats Monolithic Skill Files

Skill Graphs: Networked Knowledge Beats Monolithic Skill Files

A single skill file captures one capability. A skill graph captures an entire domain. The difference is whether your agent can follow instructions or reason through a field.

James Phoenix
James Phoenix