Loop Engineering

James Phoenix

A loop is engineered, not launched. Every durable one has three parts: a trigger that wakes it, a bounded runner that does the work, and a gate that decides whether the work lands. The whole skill is matching each loop’s autonomy to whether you can write down what “done” means.

Date: June 2026

The Difference Between A Loop And A Toy

When engineers discover agentic loops, they reach straight for the dramatic version: overnight swarms, self-improving systems, wake up to a finished product. I wrote a whole note on why that fails without a scoring function (see Autonomous Loops Need a Scoring Function). Loop engineering is the unglamorous opposite. It is taking the recurring toil you already do by hand, pruning branches, triaging errors, chasing flakes, bumping dependencies, and turning each chore into a loop with a clear trigger, a tightly scoped runner, and a gate that decides whether its output is allowed to ship.

A loop that merely runs is a toy. A loop you would let touch your repo while you sleep is engineered. The three parts:

Trigger. What wakes the loop. A cron schedule, a webhook, a metric crossing a threshold, or a queue with items in it.
Runner. The bounded set of actions it may take. Pushed down to deterministic code wherever possible, with an LLM reserved for the one genuinely ambiguous step.
Gate. The verification that decides land or discard. Tests green, a metric improved, a human approval. No gate, no autonomy.

Anatomy of an engineered loop: a trigger (webhook or cron) wakes a bounded runner (code plus an agent), which hands to a gate (score = f(output) or a human), and the work either lands or loops back to the runner

The whole art lives in the gate. Where you can write score = f(output), the loop can run itself. Where you cannot, a human stands at the gate and the loop stops at “here is a draft.” Hold that distinction, because all six examples below sort along it.

Deterministic vs LLM runner: Branch GC pushes the work down to code and self-gates with a dry-run, so it runs unattended; Sentry-to-PR is LLM-driven with no correctness metric, so a human is the gate and the loop stops at a reviewable PR

Goal, Loop, Routine: One Anatomy, Three Verbs

The three parts are the anatomy. The thing you actually type is a verb, and there are exactly three, which is worth getting straight because most of the confusion in the discourse is people reaching for the wrong one. The cleanest split: a goal runs until an outcome is true and then stops, a loop repeats a task while you are sitting there, and a routine keeps working while you are gone.

In commands:

/goal <condition> runs until a verifiable condition holds, then halts, with a separate fast model checking after each turn whether you are actually done. This is the “fix it until the tests pass” verb, and it is the one both tools now share (Claude Code shipped it in v2.1.139, Codex in its CLI v0.128.0).
/loop <interval> <prompt> repeats on a timer while your session is open, like /loop 5m check the deploy. It is hands-on, right now, you watching. Codex has no /loop yet, so its equivalent is codex exec wrapped in a shell loop.
/schedule <description> creates a cloud routine that runs while your laptop is shut, like /schedule daily PR review at 9am. Codex’s equivalent is Automations in its app.

The trap that catches everyone: there is no /routine command. The scheduler is /schedule in Claude Code and Automations in Codex. Get the verb right and the rest follows.

What makes this more than terminology is that the verb you reach for is a statement about where the gate lives. A /loop puts you at the gate, you are the verification, which is why it is for watching. A /goal says the gate is a machine check a judge model can run without you, which is the only reason it is safe to walk away from. A /scheduled routine had better have a machine-checkable gate, because you are asleep when it fires. Same anatomy, three verbs, sorted by exactly the question the rest of this note turns on: can you write down what “done” means.

Two Ways To Wake A Loop: Events vs Cron

The trigger comes in two flavors, and picking the wrong one is the most common way a loop ends up either too slow or pure waste.

Events push. A webhook fires the loop the instant something happens: a new Sentry issue, a PR getting approved, a deploy finishing, a row landing in a queue. This is how the Sentry-to-PR loop works. The error happens, the webhook fires, and an agent on my own machine is already reproducing it before I have finished reading the alert. Events react with the lowest possible latency and cost nothing when there is no work, but they need a listener, and they need de-duplication, because the same issue can fire a hundred times and you want one fix, not a hundred PRs.

Cron pulls. The loop wakes on a fixed schedule and goes looking for work. Branch GC runs weekly, dependency bumps and flake sweeps run nightly. There is no infrastructure beyond a scheduler, but the loop runs whether or not anything needs doing, and its latency is bounded by the interval. A branch deleted up to a week late is fine. A production error fixed up to a week late is not.

The rule I use: if you can name the event, use a webhook. If the trigger is really just “every so often, go check,” use cron. Most operational loops are cron. The few that have to react to the outside world are events.

Leanpub Book

Read The Meta-Engineer

A practical book on building autonomous AI systems with Claude Code, context engineering, verification loops, and production harnesses.

Continuously updated

Claude Code + agentic systems

View Book

Events vs Cron: event-driven loops push, a webhook fires them the instant something happens (Sentry errors, a PR approval), best when latency matters; scheduled loops poll, a clock wakes them to check for work (branch GC, dependency bumps), best for a periodic pass over accumulated state

One Shot or a Swarm

There is a second choice hiding in the runner: does one agent do the whole job, or do many?

One shot. Most loops are a single agent in a fresh context that takes one unit of work, does it, passes the gate, and exits. One trigger, one task, one PR. This is the RALPH inner loop, and it is the right default: it is cheap, it is easy to reason about, and the gate has exactly one output to judge. Branch GC, a single dependency bump, one Sentry fix, all one-shot.

A swarm. When the work decomposes into independent sub-tasks, the runner can fan out into many agents in parallel, each in its own worktree so they do not collide, coordinating only through shared state in git and task files. Upgrading thirty dependencies becomes thirty agents, each on its own branch. A brute-force test sweep becomes one agent per attack vector. The merger is the coordination layer for a swarm of PRs.

The gate changes shape with the structure. A one-shot loop gates one output. A swarm needs a gate per agent plus a merge gate that decides whether the parallel results actually compose, which is exactly why the merger re-runs CI after each rebase instead of trusting that green-in-isolation survives contact. More agents is more throughput and more coordination cost, so reach for a swarm only when the sub-tasks are genuinely independent. (See agent swarm patterns and agent sprawl for when parallelism pays and when it just burns tokens.)

Six Loops Worth Building

1. Branch garbage collection (push the work down)

Remove every merged remote branch, and every branch whose PR was closed, but only branches older than 30 days.

Trigger: weekly cron. Runner: this is ninety-five percent deterministic, and that is the lesson. git for-each-ref plus the GitHub API gives you merge status, PR state, and last-commit age. Filter, diff, delete. The LLM earns its place only on the residue: a branch that was never merged, has no PR, and has sat untouched for a year. That is a judgment call, so hand that one to the model with context and keep everything else as code. Gate: a protected-branch allowlist and a dry-run diff you can eyeball, because deletion is hard to walk back.

The point of loop engineering is to push work down the ladder. The model is the last resort, not the first reach. (Branch hygiene is itself a discipline, see Keeping Parallel Agentic Development Tidy.)

2. Sentry error to PR (no metric, so human gate)

New Sentry error, custom runner, fix, PR.

Trigger: a Sentry webhook on a genuinely new issue, not a regression of one you already know. Runner: a local agent on your own machine pulls the stack trace, reproduces the failure, writes a test that captures it, and fixes until that test is green and the suite stays green. Gate: there is no clean scoring function for “is this fix correct,” so the gate is human review on the PR plus the new regression test guarding against recurrence.

The loop’s job here is to deliver a reviewable draft, never to merge on its own. That is still an enormous win: triage to drafted PR with zero of your attention spent. When correctness is not measurable, the output is a draft and you are the gate.

There is a hosted version of this in Sentry’s Seer, and I genuinely like it as a triage signal. But I keep the runner on my own machine, because the fix has to happen at pure dev parity: reproduced and patched against my exact SDKs, my CLIs, and my local environment, not a cloud sandbox that only approximates them. A patch that reproduces in someone else’s runner but not in mine is one I cannot trust, extend, or debug at 2am.

3. Reduce flake in the test suite (clean metric)

Trigger: nightly, or whenever the flake rate crosses a threshold. Runner: rerun the suite many times, rank tests by failure variance, isolate the flakiest, diagnose the cause (order dependence, wall-clock time, network, shared state), then fix or quarantine. (See Flaky Test Diagnosis Script.) Gate: the flake rate itself, flakes / runs. That is a real scoring function, so the loop can hill-climb it overnight and you wake to a ranked list of fixed tests and a lower number. And the gate should demand a streak, not a single clean pass: one green run is luck, ten consecutive green runs is the signal the flake is actually dead. A new failure resets the count.

A measurable objective is exactly what converts “an agent that does stuff” into an optimizer.

4. Reduce total wall-clock time of the suite (metric plus a guardrail)

Trigger: scheduled. Runner: profile the slowest tests, then attack them: parallelize, replace live calls with fixtures, cache expensive setup, split fat tests. Gate: two numbers, not one. Wall-clock time must go down and coverage must not. The second metric is the guardrail against the obvious cheat, deleting slow tests so the clock drops while the suite gets worse. That is Goodhart insurance in its purest form.

Every optimization loop needs the metric you want plus the metric that stops it from gaming you.

5. Upgrade dependencies (small isolated units)

Trigger: scheduled, or a Dependabot / Renovate alert. Runner: one dependency at a time. Bump it, run the full suite, and when something breaks, fix the breakage. The loop’s real value is reading the changelog and patching the call sites, not the version bump itself. Gate: green CI, one PR per dependency.

The per-dependency isolation is the design decision that makes this safe. A batched “upgrade everything” PR is unreviewable and impossible to bisect when it goes red. Small, isolated units of work are what make a loop’s output trustable, the same task-sizing discipline that the RALPH loop depends on.

6. The merger (a shape that is already a product)

A loop that merges multiple PRs.

Trigger: a queue of approved, CI-green PRs. Runner: take the queue in order, rebase each onto the latest main, re-run CI against the freshly rebased code (because green in isolation is not the same as green after the PR in front of you landed), merge if it holds, kick it back if it does not. Gate: CI green after rebase, evaluated per PR.

This is a merge queue, and the fact that GitHub and friends ship it as a product is the tell: clear trigger, deterministic runner, unambiguous gate. The best loops often already exist as products. Recognizing the shape is half of loop engineering.

The Pattern Underneath

Lay the six side by side and the spectrum is obvious:

Loop	Scoring function	Autonomy
Branch GC	deterministic predicate	full, behind a dry-run
Flake reduction	flake rate	full
Wall-clock	time, with a coverage guard	full
Dep upgrade	green CI per dependency	full, review optional
Merger	green CI per rebase	full
Sentry fix	none for “is it correct”	draft only, human gate

The loops with a writable scoring function earn full autonomy. The one without, whether a given bug fix is actually right, stops at a draft. This is the same line from the benchmark note, applied to operational toil instead of feature work. If you can write score = f(output), the loop runs itself. If you cannot, you are the scoring function and the loop hands you a draft.

What People Are Actually Running

Those six are loops I run or have built. To sanity-check the shape against the field, I went through a month of what people are actually posting, and the reassuring part is that every loop worth stealing sorts cleanly into the three verbs. A curated set, grouped that way:

Loops (you are at the gate, watching):

The build-test-fix pair (raycfu). Two agents: a builder that writes code, and a checker that runs the tests, types, and lint and reports exactly what broke. They pass work back and forth until it is clean. The most-demoed loop in the whole scan, and the cleanest illustration of the gate being a separate agent from the runner.
Peter Steinberger‘s five-minute maintainer. Every five minutes while he works, an agent makes one small verified upkeep change: a flaky test, a stale comment, a missing type. One commit, tests green, nothing risky. He has described merging on the order of 859 PRs across his repos in a month at a ~95% acceptance rate this way. It is my branch-GC instinct turned into a live heartbeat.
The human-in-the-loop approval queue (from the n8n crowd). The workflow runs, then pauses and pings you with approve / revise / skip before anything ships. Same loop shape, but the stop condition is your approval instead of a passing test, which is exactly the Sentry-loop pattern: no machine gate, so a human stands at it.

Goals (run until a condition holds, then stop):

The capped plan-verify-fix (qbuilder). Plan, implement, verify against the tests, fix what failed, save state to files each pass, and hard-cap at five iterations. You read only the final version. The cap is what makes it safe to leave, the budget point made concrete.
The goal-meta-skill (evgenii.arsentev). A skill whose only job is to rewrite a vague ask into a rigorous goal: the exact end state, how it will be verified, what not to touch, and the stop condition. His line is the whole thesis in seven words, “your agent is not dumb, your instructions are vague.” This is the front half of every good loop, and the part most people skip.
The completion contract (3goblack). Before any work starts, the agent writes down what “complete” means and what evidence proves each requirement, then refuses to claim success without that evidence. It exists to kill the most common lie an agent tells: “done,” when it is not.

Routines (run while you are gone):

The overnight PR routine. The line that started all of this: the people deepest in this stop writing code and write loops that write the code while they sleep, reportedly up to ~30% of it. The shape is a scheduled routine that watches your open PRs overnight, auto-fixes build failures, answers review comments in a fresh worktree, rebases what is stale, and leaves anything ambiguous for you, with state in git so a crash loses nothing.
The 15,000-emails-a-day triage (posted to r/LangChain). A production routine that pulls new guest emails on an interval, classifies each, drafts replies for the routine ones, and escalates only what needs a human, never auto-sending a refund or a booking change. The rare public post that ships the whole production loop rather than a demo, and a clean example of a routine whose gate is “is this sensitive? then queue it for a person.”

Two things stand out across the set. The strongest ones are almost all somebody else’s gate made explicit: a checker agent, a written contract, an approval button, a five-iteration cap. And not one of the durable ones trusts the worker to grade itself.

Loops Stack By Time Scale

The six recipes above all sit at one altitude: an operational chore that fires, runs, and gates somewhere between a minute and a week. Step back, though, and each one is nested inside slower loops and wraps faster ones. Agentic work is loops all the way down, sorted by how long a single turn of each takes.

Every loop sits inside a slower one: a token loop that emits in milliseconds runs inside a tool-call loop measured in seconds, inside a /goal loop that judges in hours, inside a human-review loop over days, inside a strategic loop measured in quarters, and each loop's exit condition is a single step of the loop above it — Every loop sits inside a slower one: a token loop that emits in milliseconds runs inside a tool-call loop measured in seconds, inside a /goal loop that judges in hours, inside a human-review loop over days, inside a strategic loop measured in quarters, and each loop’s exit condition is a single step of the loop above it

At the bottom, the token loop samples the next token until a stop token, in milliseconds. Wrap it and you get the tool-call loop: the model calls a tool, reads the result, and goes again until there are no calls left, which is one agent turn of a few seconds (the inner loop of RALPH). Wrap that in a /goal loop that runs the agent, judges the result against the goal, and retries until it is met, and you are into hours. Around that sits the human loop, you reviewing and asking for another pass over days, and around everything the strategic loop: set goals, allocate, cull, while the company is not bankrupt, measured in quarters.

The structure is fractal. Each loop is itself a trigger, a runner, and a gate, and one loop’s gate is a single step of the loop above it. “Goal reached” is one tick of the human loop; “human satisfied” is one tick of the strategy loop.

This is why autonomy is easy at the bottom and impossible at the top. The token and tool loops have a crisp, mechanical exit, so they run unattended and unwatched. Climb the stack and the exit gets fuzzier, until the strategic loop has no scoring function at all and a person has to stand at its gate. The loops worth automating sit in the readable middle band, low enough that you can still write down what “done” means.

Where The Leverage Actually Is

Most people over-invest in the runner, the clever agent, and under-invest in the trigger and the gate. That is backwards. The runner is the commodity now: any decent harness can write the fix. The trigger is what makes the loop fire without you remembering it exists, and the gate is what makes its output safe to ship without you watching. Those two are the engineering. The agent in the middle is the easy part.

The strongest loops people actually run all make the same move: they put a second, independent set of eyes inside the loop instead of letting the worker grade its own homework. Boris Cherny describes running the agent alongside a separate verifier model and only advancing a task when it passes. Lukas Kucinski’s Clodex has Codex review Claude’s PRs before merge, so two different model families have to agree before code lands. roborev installs a post-commit hook that reviews every commit and feeds the findings back into a fix loop while the context is still warm. The reason this pattern keeps reappearing is the failure mode it prevents: an agent grading itself will, given the chance, delete the failing test and call the build green. A loop that cannot tell good output from bad does not save you work, it produces wrong answers faster.

The one caveat is context. The agent is a commodity; the context you hand it is not. Give the runner the MCP servers it needs to see your systems, and the secrets it needs through the 1Password CLI, so the loop runs against real dev and staging at the same parity you have at your own terminal. The cleverest agent pointed at the wrong context just writes a confident patch for a system it never actually touched.

And start at the bottom of the ladder, not the top. The moment loops click, the temptation is to point them at feature work: “generate the whole next feature overnight.” That is the seductive version, and it is the one that fails, because feature work has no scoring function and the loop has nothing to climb toward (the whole argument of the benchmark note). Operational chores are the opposite. They are bounded, they recur, and most of them come with a metric for free. Branch GC, flake reduction, dependency upgrades: this is the low-hanging fruit, and it pays you back every week without ever putting a feature at risk. Win there first. The boring loops are where loop engineering compounds. The dramatic ones are where it embarrasses you.

So start concrete. List the operational chores you did by hand this month. Each one is a candidate loop. For each, ask the three questions: what wakes it, what is the smallest deterministic-plus-LLM runner that does it, and can I write down what “done” means? If the answer to the last is yes, you have an autonomous loop. If it is no, you have a draft generator with you at the gate, which is still most of the win.

Budget Is a Stop Condition Too

There is a second way a loop ends that the trigger/runner/gate frame can hide: it runs out of money. A cap is just a gate that fires on cost instead of correctness, and leaving it off is how the romantic version of loops (“a thousand agents build my company overnight”) turns into a bill. Uber reportedly capped its engineers at $1,500 per tool per month after burning its annual AI budget in four months. People have torched thousands of dollars overnight with a single command. The honest one-liner going around the threads is while (you have tokens): burn them in a loop.

So every goal carries an “or stop after N turns,” and every routine runs under a daily ceiling, set before you walk away rather than discovered when the invoice arrives. This matters most for the loops that cannot tell they are stuck. The well-designed ones add explicit anti-spin stops: no-progress detection, a retry cap, flip-flop detection between two approaches, and a budget. Without them a loop will happily spend the whole cap retrying the same broken fix, or do the thing that should terminate it on the spot, quietly edit the failing test until it passes (the Goodhart failure in its purest form).

It is worth being clear about what is genuinely new here, because the skeptics have half a point: the scheduling layer really is just cron. What cron never had is a decision-maker in the body, something that reads the state, acts, checks whether it worked, and decides whether to keep going. That decision is the whole new thing. Everything else, the trigger and the schedule, is plumbing we have had for decades.

One Sentence

Do not automate the work. Engineer the loop around it: a trigger to wake it, a bounded runner to do it, and a gate that decides whether it lands.

Autonomous Loops Need a Scoring Function – The gate is the scoring function; this is the theory the recipes apply
The RALPH Loop – Task-sizing and fresh-context iteration, the inner loop these wrap
Sentry Errors Should Spawn Agents on Your Own Machine – The error-to-PR runner in detail
Flaky Test Diagnosis Script – The runner inside the flake-reduction loop
Goodharting Prevention – Why an optimization loop needs a guardrail metric
Agent Sprawl and the Two Constraint Modes – Bounded versus free-form runners and why unbounded tasks burn tokens
24/7 Development Strategy – Loops as the backbone of continuous, unattended work

Part of the field guide

This is one of my field notes in AI Native Software Engineering, a plain-English guide to building software with AI agents. The terms behind it are defined in the AI Coding Dictionary.