A loop is engineered, not launched. Every durable one has three parts: a trigger that wakes it, a bounded runner that does the work, and a gate that decides whether the work lands. The whole skill is matching each loop’s autonomy to whether you can write down what “done” means.
Date: June 2026
The Difference Between A Loop And A Toy
When engineers discover agentic loops, they reach straight for the dramatic version: overnight swarms, self-improving systems, wake up to a finished product. I wrote a whole note on why that fails without a scoring function (see Autonomous Loops Need a Scoring Function). Loop engineering is the unglamorous opposite. It is taking the recurring toil you already do by hand, pruning branches, triaging errors, chasing flakes, bumping dependencies, and turning each chore into a loop with a clear trigger, a tightly scoped runner, and a gate that decides whether its output is allowed to ship.
A loop that merely runs is a toy. A loop you would let touch your repo while you sleep is engineered. The three parts:
- Trigger. What wakes the loop. A cron schedule, a webhook, a metric crossing a threshold, or a queue with items in it.
- Runner. The bounded set of actions it may take. Pushed down to deterministic code wherever possible, with an LLM reserved for the one genuinely ambiguous step.
- Gate. The verification that decides land or discard. Tests green, a metric improved, a human approval. No gate, no autonomy.

The whole art lives in the gate. Where you can write score = f(output), the loop can run itself. Where you cannot, a human stands at the gate and the loop stops at “here is a draft.” Hold that distinction, because all six examples below sort along it.

Two Ways To Wake A Loop: Events vs Cron
The trigger comes in two flavors, and picking the wrong one is the most common way a loop ends up either too slow or pure waste.
Events push. A webhook fires the loop the instant something happens: a new Sentry issue, a PR getting approved, a deploy finishing, a row landing in a queue. This is how the Sentry-to-PR loop works. The error happens, the webhook fires, and an agent on my own machine is already reproducing it before I have finished reading the alert. Events react with the lowest possible latency and cost nothing when there is no work, but they need a listener, and they need de-duplication, because the same issue can fire a hundred times and you want one fix, not a hundred PRs.
Cron pulls. The loop wakes on a fixed schedule and goes looking for work. Branch GC runs weekly, dependency bumps and flake sweeps run nightly. There is no infrastructure beyond a scheduler, but the loop runs whether or not anything needs doing, and its latency is bounded by the interval. A branch deleted up to a week late is fine. A production error fixed up to a week late is not.
The rule I use: if you can name the event, use a webhook. If the trigger is really just “every so often, go check,” use cron. Most operational loops are cron. The few that have to react to the outside world are events.

One Shot or a Swarm
There is a second choice hiding in the runner: does one agent do the whole job, or do many?
One shot. Most loops are a single agent in a fresh context that takes one unit of work, does it, passes the gate, and exits. One trigger, one task, one PR. This is the RALPH inner loop, and it is the right default: it is cheap, it is easy to reason about, and the gate has exactly one output to judge. Branch GC, a single dependency bump, one Sentry fix, all one-shot.
A swarm. When the work decomposes into independent sub-tasks, the runner can fan out into many agents in parallel, each in its own worktree so they do not collide, coordinating only through shared state in git and task files. Upgrading thirty dependencies becomes thirty agents, each on its own branch. A brute-force test sweep becomes one agent per attack vector. The merger is the coordination layer for a swarm of PRs.
The gate changes shape with the structure. A one-shot loop gates one output. A swarm needs a gate per agent plus a merge gate that decides whether the parallel results actually compose, which is exactly why the merger re-runs CI after each rebase instead of trusting that green-in-isolation survives contact. More agents is more throughput and more coordination cost, so reach for a swarm only when the sub-tasks are genuinely independent. (See agent swarm patterns and agent sprawl for when parallelism pays and when it just burns tokens.)
Six Loops Worth Building
1. Branch garbage collection (push the work down)
Remove every merged remote branch, and every branch whose PR was closed, but only branches older than 30 days.
Trigger: weekly cron. Runner: this is ninety-five percent deterministic, and that is the lesson. git for-each-ref plus the GitHub API gives you merge status, PR state, and last-commit age. Filter, diff, delete. The LLM earns its place only on the residue: a branch that was never merged, has no PR, and has sat untouched for a year. That is a judgment call, so hand that one to the model with context and keep everything else as code. Gate: a protected-branch allowlist and a dry-run diff you can eyeball, because deletion is hard to walk back.
The point of loop engineering is to push work down the ladder. The model is the last resort, not the first reach. (Branch hygiene is itself a discipline, see Keeping Parallel Agentic Development Tidy.)
2. Sentry error to PR (no metric, so human gate)
New Sentry error, custom runner, fix, PR.
Trigger: a Sentry webhook on a genuinely new issue, not a regression of one you already know. Runner: a local agent on your own machine pulls the stack trace, reproduces the failure, writes a test that captures it, and fixes until that test is green and the suite stays green. Gate: there is no clean scoring function for “is this fix correct,” so the gate is human review on the PR plus the new regression test guarding against recurrence.
The loop’s job here is to deliver a reviewable draft, never to merge on its own. That is still an enormous win: triage to drafted PR with zero of your attention spent. When correctness is not measurable, the output is a draft and you are the gate.
There is a hosted version of this in Sentry’s Seer, and I genuinely like it as a triage signal. But I keep the runner on my own machine, because the fix has to happen at pure dev parity: reproduced and patched against my exact SDKs, my CLIs, and my local environment, not a cloud sandbox that only approximates them. A patch that reproduces in someone else’s runner but not in mine is one I cannot trust, extend, or debug at 2am.
3. Reduce flake in the test suite (clean metric)
Trigger: nightly, or whenever the flake rate crosses a threshold. Runner: rerun the suite many times, rank tests by failure variance, isolate the flakiest, diagnose the cause (order dependence, wall-clock time, network, shared state), then fix or quarantine. (See Flaky Test Diagnosis Script.) Gate: the flake rate itself, flakes / runs. That is a real scoring function, so the loop can hill-climb it overnight and you wake to a ranked list of fixed tests and a lower number.
A measurable objective is exactly what converts “an agent that does stuff” into an optimizer.
4. Reduce total wall-clock time of the suite (metric plus a guardrail)
Trigger: scheduled. Runner: profile the slowest tests, then attack them: parallelize, replace live calls with fixtures, cache expensive setup, split fat tests. Gate: two numbers, not one. Wall-clock time must go down and coverage must not. The second metric is the guardrail against the obvious cheat, deleting slow tests so the clock drops while the suite gets worse. That is Goodhart insurance in its purest form.
Every optimization loop needs the metric you want plus the metric that stops it from gaming you.
5. Upgrade dependencies (small isolated units)
Trigger: scheduled, or a Dependabot / Renovate alert. Runner: one dependency at a time. Bump it, run the full suite, and when something breaks, fix the breakage. The loop’s real value is reading the changelog and patching the call sites, not the version bump itself. Gate: green CI, one PR per dependency.
The per-dependency isolation is the design decision that makes this safe. A batched “upgrade everything” PR is unreviewable and impossible to bisect when it goes red. Small, isolated units of work are what make a loop’s output trustable, the same task-sizing discipline that the RALPH loop depends on.
6. The merger (a shape that is already a product)
A loop that merges multiple PRs.
Trigger: a queue of approved, CI-green PRs. Runner: take the queue in order, rebase each onto the latest main, re-run CI against the freshly rebased code (because green in isolation is not the same as green after the PR in front of you landed), merge if it holds, kick it back if it does not. Gate: CI green after rebase, evaluated per PR.
This is a merge queue, and the fact that GitHub and friends ship it as a product is the tell: clear trigger, deterministic runner, unambiguous gate. The best loops often already exist as products. Recognizing the shape is half of loop engineering.
The Pattern Underneath
Lay the six side by side and the spectrum is obvious:
| Loop | Scoring function | Autonomy |
|---|---|---|
| Branch GC | deterministic predicate | full, behind a dry-run |
| Flake reduction | flake rate | full |
| Wall-clock | time, with a coverage guard | full |
| Dep upgrade | green CI per dependency | full, review optional |
| Merger | green CI per rebase | full |
| Sentry fix | none for “is it correct” | draft only, human gate |
The loops with a writable scoring function earn full autonomy. The one without, whether a given bug fix is actually right, stops at a draft. This is the same line from the benchmark note, applied to operational toil instead of feature work. If you can write score = f(output), the loop runs itself. If you cannot, you are the scoring function and the loop hands you a draft.
Where The Leverage Actually Is
Most people over-invest in the runner, the clever agent, and under-invest in the trigger and the gate. That is backwards. The runner is the commodity now: any decent harness can write the fix. The trigger is what makes the loop fire without you remembering it exists, and the gate is what makes its output safe to ship without you watching. Those two are the engineering. The agent in the middle is the easy part.
The one caveat is context. The agent is a commodity; the context you hand it is not. Give the runner the MCP servers it needs to see your systems, and the secrets it needs through the 1Password CLI, so the loop runs against real dev and staging at the same parity you have at your own terminal. The cleverest agent pointed at the wrong context just writes a confident patch for a system it never actually touched.
And start at the bottom of the ladder, not the top. The moment loops click, the temptation is to point them at feature work: “generate the whole next feature overnight.” That is the seductive version, and it is the one that fails, because feature work has no scoring function and the loop has nothing to climb toward (the whole argument of the benchmark note). Operational chores are the opposite. They are bounded, they recur, and most of them come with a metric for free. Branch GC, flake reduction, dependency upgrades: this is the low-hanging fruit, and it pays you back every week without ever putting a feature at risk. Win there first. The boring loops are where loop engineering compounds. The dramatic ones are where it embarrasses you.
So start concrete. List the operational chores you did by hand this month. Each one is a candidate loop. For each, ask the three questions: what wakes it, what is the smallest deterministic-plus-LLM runner that does it, and can I write down what “done” means? If the answer to the last is yes, you have an autonomous loop. If it is no, you have a draft generator with you at the gate, which is still most of the win.
One Sentence
Do not automate the work. Engineer the loop around it: a trigger to wake it, a bounded runner to do it, and a gate that decides whether it lands.
Related
- Autonomous Loops Need a Scoring Function – The gate is the scoring function; this is the theory the recipes apply
- The RALPH Loop – Task-sizing and fresh-context iteration, the inner loop these wrap
- Sentry Errors Should Spawn Agents on Your Own Machine – The error-to-PR runner in detail
- Flaky Test Diagnosis Script – The runner inside the flake-reduction loop
- Goodharting Prevention – Why an optimization loop needs a guardrail metric
- Agent Sprawl and the Two Constraint Modes – Bounded versus free-form runners and why unbounded tasks burn tokens
- 24/7 Development Strategy – Loops as the backbone of continuous, unattended work

