Autonomous Loops Need a Scoring Function

James Phoenix

Without a benchmark, a RALPH loop is a chaos engine. With one, it becomes automated hill climbing. The difference is not the loop. It is the objective.

Author: James Phoenix | Date: March 2026

The Fascination Phase

Most engineers go through a phase of fascination with autonomous loops. RALPH loops, agent swarms, overnight coding. The promise is seductive:

Start loop → wake up to finished product.

Then they build real systems and realise something important.

Engineering is a high-entropy search problem. And LLM loops are very bad searchers.

Why Loops Fail at Open-Ended Engineering

LLM loops lack four things that make engineering hard:

Missing Capability	Why It Matters
Long-term system context	Cannot hold full architectural picture across iterations
Architectural judgement	Cannot evaluate trade-offs between competing designs
Subtle invariant detection	Cannot spot when a change violates an unwritten rule
Knowing when to stop	Cannot distinguish “done” from “stuck” from “wrong direction”

What actually happens in practice:

Human engineer → chooses direction
Agent → executes local steps
Human → evaluates results
Agent → continues

This is human-directed search with machine execution. Not autonomy.

Tools like Claude Code, Codex, and Augment Code have all converged on this pattern. They are tight feedback harnesses, not autonomous developers.

The Search Space Problem

Real codebases are not suited to undirected search. Even small changes have huge branching factors:

architecture decision
   ↓
schema design
   ↓
api surface
   ↓
integration tests
   ↓
observability

Autonomous loops wander in this space without a strong heuristic. You have the heuristic. The loop does not.

LLMs Are Good Generators, Bad Critics

A loop needs two roles: a generator and a critic.

LLMs are good generators. They are mediocre critics.

So loops often collapse into:

write code → review own code → approve own code → repeat

This is the fundamental quality problem. The generator and critic are the same model with the same blind spots. There is no adversarial pressure. Quality decays silently.

The human engineer supplies the real signal: architectural constraints, product intuition, system invariants, simplification pressure. Agents cannot yet replicate that.

Where Loops Actually Work

Here is the distinction that matters. Autonomous loops are bad at open-ended engineering but powerful at bounded search problems.

The difference is subtle but fundamental.

Open-ended engineering (loops fail)

Designing architecture. Deciding schema boundaries. Evolving APIs. Balancing product trade-offs.

f(architecture, product, constraints, time, infra)

No clear scoring function. The objective is constantly being redefined. Humans dominate here.

Benchmark optimisation (loops shine)

Improving prompt accuracy. Tuning retrieval strategies. Evaluating pipelines. Optimising evaluation metrics.

maximize score(model_config)

There is a measurable objective. The loop can generate variants, run benchmarks, keep the best, and repeat.

That is automated hill climbing.

The Math: What Loops Actually Do

Most autonomous loops reduce to a form of black-box optimisation. They search for parameters theta that maximise a metric.

theta* = argmax_theta f(theta)

Breaking this down:

theta: the parameters being searched (prompt wording, chain structure, temperature, retrieval k)
f(theta): a scoring function (benchmark accuracy, engagement rate, latency)
argmax: the value of theta that produces the highest score

Examples in practice: DSPy optimisers, prompt search, RAG pipeline tuning, eval-driven development.

Three Conditions That Make Loops Powerful

Condition	Why It Matters
Clear metric	The loop knows what “better” means. `accuracy = correct / total`
Cheap evaluation	If a benchmark takes 1 minute, a loop can run 1,000 experiments overnight. Humans cannot.
Small parameter space	Prompt wording, few-shot examples, retrieval k, ranking threshold. Manageable.

When all three conditions hold, loops outperform humans by orders of magnitude. When any condition is missing, loops degrade into expensive random walks.

The Core Thesis: Benchmarks Prevent Spec Drift

This is the key insight. The best autonomous loops have a benchmark.

Without a scoring function, each iteration drifts further from the original intent. The loop has no way to measure whether it is making progress or just making changes. This is spec drift: the gradual divergence between what you wanted and what the loop produces.

A benchmark anchors the loop. It provides:

Direction. The loop knows which way is “better.”
Termination. The loop knows when to stop (score plateaus).
Regression detection. The loop catches when a change makes things worse.
Accountability. You can measure the loop’s actual contribution.

Without these, the loop is just a stochastic process. With them, it is an optimiser.

No benchmark:   loop → drift → chaos → "why is this broken?"
With benchmark:  loop → measure → improve → converge

The Clean Separation

The mature workflow splits cleanly along the boundary of “is there a scoring function?”

Human work (no scoring function)

Architecture
Specs
System boundaries
Invariants
Dataset design
Evaluation criteria definition

Loop work (clear scoring function)

Prompt tuning
Retrieval tuning
Ranking experiments
Evaluation sweeps
Parameter search
Content optimisation

The human designs the experiment. The agent runs thousands of trials. The human evaluates whether the metric itself is still the right one.

How This Applies in Practice

Think of agents as research assistants, not engineers.

They are strong at:

Systematic exploration of parameter spaces
Benchmark optimisation
Large parameter sweeps
A/B variant generation

Humans still dominate:

Architecture
Taste
Product judgement
Defining what “good” means

Example: Content Optimisation

Content optimisation is almost perfectly suited to loops.

maximize engagement(post_parameters)

Parameters: hook wording, structure, topic framing, post length, format. A loop could test thousands of variants against a scoring function overnight.

Example: Prompt Engineering

maximize accuracy(prompt_variant, eval_dataset)

The loop generates prompt variants, runs them against an eval set, keeps the best. This is where DSPy, ADAS, and similar tools operate.

Example: RAG Pipeline Tuning

maximize relevance(chunk_size, overlap, embedding_model, reranker_config)

Small parameter space. Cheap evaluation. Clear metric. Perfect loop territory.

The Vectorised Programming Analogy

Think of agents like vectorised programming.

Old way:

Udemy Bestseller

Learn Prompt Engineering

My O'Reilly book adapted for hands-on learning. Build production-ready prompts with practical exercises.

★ 4.5/5 rating

306,000+ learners

View Course

for (i=0; i<n; i++) {
  compute()
}

Vectorised way:

compute_all()

Agents are vectorised engineering work. But you still design the algorithm. The loop executes. You define what to optimise.

The Trap to Avoid

The temptation is to build:

Self-improving agent swarms
Autonomous dev loops
Meta-orchestration
Agent schedulers

Instead of shipping the product.

Strong engineers fall into this regularly. You start optimising the meta-system instead of the system. Building the loop runner instead of defining the benchmark.

If you stop trying to automate engineering completely and instead run tight human-agent loops for architecture work, reserving autonomous loops for bounded optimisation problems with clear metrics, you ship faster.

The Workflow That Works

For engineering work (no benchmark)

spec → agent implementation → human review → agent fixups → merge

This is 10x faster than solo coding. 100x more reliable than full autonomy.

For optimisation work (clear benchmark)

define metric → define parameter space → run loop → evaluate results → ship winner

This is where overnight runs make sense. Because the loop has an objective.

One Sentence

If you cannot write score = f(output), do not put it in a loop.

The RALPH Loop – The iteration pattern these loops are based on
Agent-Driven Development – Human-directed search with machine execution
Two Camps of Agentic Coding – Conversational vs spec-driven engineering
Evaluation-Driven Development – Building around eval metrics
Synthetic Loss Functions for Agent Swarms – Defining scoring functions for agent work
Human-in-the-Loop Patterns – The feedback harness model
Goodharting Prevention – When the benchmark becomes the wrong target
Building the Harness – The meta-engineering layer