Traditional development measures progress by output (features shipped). The right measure is whether system error decreases over time.
Author: James Phoenix | Date: February 2026
Summary
In ML, you do not ask “is the model good?” You define a loss function, then optimise it down over time. Apply the same principle to software production with agent swarms. Define a composite scalar L_total from measurable terms (spec violations, test failures, architectural drift, type errors, regressions, operational gaps, unknowns). Track it per cycle. Stop when it is low, flat, and bounce-free. This turns “is the system done?” from a vibes question into a measurable one.
Why Software Needs a Scalar Loss Function
Agent swarms can produce enormous output. Output volume is not a signal of progress. A swarm that generates 10,000 lines of code per cycle might be making the system worse.
The problem:
"the code feels messy" → not quantified
"this is risky" → not quantified
"the architecture is drifting" → not quantified
"we should refactor" → not quantified
None of these are continuously optimised. Progress is measured indirectly via features shipped, not system health.
ML systems solved this decades ago with an explicit loss function. The system is judged by whether loss decreases over time. Software production needs the same thing.
The Formula
Define total system loss as a weighted sum of 7 measurable terms:
L_total = w₁·L_spec + w₂·L_tests + w₃·L_arch + w₄·L_types + w₅·L_reg + w₆·L_ops + w₇·L_unknown
Where each term is a non-negative scalar counting unresolved issues, and each weight wᵢ encodes how much you care about that category.
The system objective:
Minimise L_total over time
L_total(t+1) ≤ L_total(t)
If loss does not trend downward, the system is not improving, regardless of how many features are shipped.
The 7 Loss Terms
1. Spec Loss (L_spec)
Divergence between PRDs/design docs and the actual codebase.
| Signal | Example |
|---|---|
| Feature implemented differently from spec | Auth flow uses sessions when spec says JWT |
| Missing required behaviour | Rate limiting not implemented |
| Extra behaviour not in spec | Unauthorized admin endpoint added |
L_spec = Σᵢ w_specᵢ (weighted by severity)
2. Test Loss (L_tests)
Failing, flaky, or missing tests.
| Signal | Example |
|---|---|
| Failing tests | 3 unit tests broken after refactor |
| Flaky tests | Payment test passes 80% of runs |
| Missing critical path coverage | No test for session expiry |
L_tests = Σᵢ w_testᵢ
Analogous to training error in ML.
3. Architectural Loss (L_arch)
Structural decay that compounds future loss.
| Signal | Example |
|---|---|
| Layer violations | UI component imports from database layer |
| Circular dependencies | Module A depends on B depends on A |
| Broken domain boundaries | Payment logic leaks into auth module |
L_arch = Σᵢ w_archᵢ
This term is the most dangerous because it compounds. A layer violation today means 10 bugs tomorrow.
4. Type and Invariant Loss (L_types)
Violations of static types, runtime invariants, Effect error contracts, ESLint domain rules.
| Signal | Example |
|---|---|
| TypeScript errors | 12 type errors after dependency update |
| Effect channel violations | Unhandled error in Effect pipeline |
| Lint violations | 8 architectural lint rule breaches |
L_types = Σᵢ w_typeᵢ
This term collapses the error search space. It is the cheapest loss to reduce.
5. Regression Loss (L_reg)
Previously fixed issues that reappear. Regressions indicate unstable convergence.
| Signal | Example |
|---|---|
| Reopened issues | Bug #342 reintroduced by refactor |
| Reverted fixes | Rate limiter broken after auth changes |
| Test de-stabilization | Stable test now flaky |
L_reg = Σᵢ w_regᵢ
A rising L_reg means the system is not learning. Weight this term heavily.
6. Operational Loss (L_ops)
Runtime health signals.
| Signal | Example |
|---|---|
| Performance regressions | p99 latency doubled |
| Memory leaks | Heap grows 5MB/hour |
| Observability gaps | No metrics on payment endpoint |
| Missing alerts | No alert for database connection failures |
L_ops = Σᵢ w_opsᵢ
7. Unknown Loss (L_unknown)
Uncertainty about system correctness. This is the term most people miss.
| Signal | Example |
|---|---|
| Code changed without tests | 200 lines in payment module, no test |
| High-risk module with low observability | Auth has no structured logging |
| New code paths without spec references | Endpoint added with no design doc |
L_unknown = Σᵢ w_unknownᵢ
You want L_unknown near zero before calling anything “done.” It is a penalty for “we do not know if this is safe.”
Weighting Strategies
Weights encode your priorities. There is no universal weighting, but a reasonable starting point:
| Term | Weight | Rationale |
|---|---|---|
| L_reg | 3.0 | Regressions destroy trust in convergence |
| L_spec | 2.0 | Spec violations mean the wrong thing was built |
| L_arch | 2.0 | Architectural decay compounds future loss |
| L_types | 1.5 | Types are cheap to fix, expensive to ignore |
| L_tests | 1.0 | Baseline correctness signal |
| L_unknown | 1.0 | Uncertainty penalty |
| L_ops | 0.5 | Important but less urgent pre-release |
Adjust per phase. During a polish phase, increase L_reg weight. During early growth, tolerate higher L_unknown temporarily.
The Stop Condition
Do not stop when L_total is low once. Stop when it is low and stable.
Stop if ALL of:
1. L_total < T (below threshold)
2. slope(L_total, last K cycles) ≈ 0 (no meaningful improvement left)
3. regressions in last K cycles == 0 (no bounce)
4. L_unknown < U (uncertainty is low)
In words: low, flat, and no bounce.
This avoids the classic failure: loss dips, you stop, then it pops back up next cycle.
Stop Condition Visualized
Healthy (ready to stop):
L_total
│
│╲
│ ╲
│ ╲
│ ╲___________ ← flat, below threshold
│
└───────────────── cycles
Unhealthy (not ready):
L_total
│
│╲ ╱╲ ╱╲
│ ╲╱ ╲╱ ╲ ← sawtooth, regressions
│ ╲╱
└───────────────── cycles
Chaotic (no convergence):
L_total
│
│ ╱╲╱╲╱╲╱╲╱╲╱╲ ← noisy flat, no progress
│
└───────────────── cycles
Implementation: Per-Cycle Tracking
Store per cycle:
interface CycleMetrics {
cycle: number;
timestamp: string;
// Raw counts per loss term
specViolations: number;
failingTests: number;
flakyTests: number;
archViolations: number;
typeErrors: number;
lintViolations: number;
regressions: number;
opsIssues: number;
unknownRiskAreas: number;
// Computed
L_total: number;
// Deltas (what changed this cycle)
issuesClosed: string[];
issuesOpened: string[];
regressionEvents: string[];
// Churn metrics
filesChanged: number;
linesAdded: number;
linesRemoved: number;
}
Compute L_total each cycle:
function computeLoss(m: CycleMetrics, weights: Weights): number {
return (
weights.spec * m.specViolations +
weights.tests * (m.failingTests + m.flakyTests) +
weights.arch * m.archViolations +
weights.types * (m.typeErrors + m.lintViolations) +
weights.reg * m.regressions +
weights.ops * m.opsIssues +
weights.unknown * m.unknownRiskAreas
);
}
Tracking the Trend
function isConverging(history: CycleMetrics[], windowSize: number): boolean {
const recent = history.slice(-windowSize);
const losses = recent.map(m => m.L_total);
// Check monotonic decrease (with tolerance)
const slope = linearRegressionSlope(losses);
const noRegressions = recent.every(m => m.regressions === 0);
const belowThreshold = losses[losses.length - 1] < THRESHOLD;
return slope <= 0 && noRegressions && belowThreshold;
}
Loss Curve Patterns
Smooth Decay (Healthy)
Loss decreases monotonically. Swarm is converging. Constraints are biting. Regressions are rare.
What this looks like in practice:
- Issue count drops each cycle
- Same bug class does not reappear
- Diff sizes shrink over time
- Agent output becomes more repetitive (running out of things to fix)
Sawtooth (Regressions)
Loss drops then spikes repeatedly. The swarm is fixing things and breaking things at the same rate.
Root causes:
- Insufficient regression tests
- Agents rewriting modules without understanding dependencies
- No accept/reject gate on changes
Fix: Increase regression penalty weight, add regression memory, tighten acceptance gate.
Noisy Flat (No Progress)
Loss oscillates around a fixed value. No real improvement despite high activity.
Root causes:
- Agents performing random walks (no gradient signal)
- Constraints too weak to guide behaviour
- Loss function missing a key term
- Goodharting on proxy metrics
Fix: Review whether all loss terms are actually being measured. Add the missing term. Tighten constraints.
Monotonic Rise (Diverging)
Loss increases each cycle. The swarm is making things actively worse.
Root causes:
- No acceptance gate (all changes applied)
- Agents introducing complexity faster than they resolve it
- Architectural violations compounding
Fix: Stop the swarm. Audit constraints. Add an explicit accept/reject gate.
Integration with Existing Quality Gates
This framework does not replace quality gates. It wraps them. Each quality gate contributes to reducing one or more loss terms:
| Quality Gate | Loss Term Reduced |
|---|---|
| Type checker | L_types |
| ESLint | L_types, L_arch |
| Unit tests | L_tests |
| Integration tests | L_tests, L_spec |
| E2E tests | L_spec |
| Security scan | L_ops |
| Architecture linter | L_arch |
| Spec review | L_spec |
| Coverage check | L_unknown |
The loss function aggregates what gates measure into a single trend line.
Key Insight
Agents do not need to be correct. They need to be biased toward negative delta-L. Individual agents can be wrong as long as the aggregate signal drives loss down.
This is why parallelism works. Each agent samples a noisy estimate of where loss can be reduced. The orchestrator applies only the updates that actually reduce loss. The swarm converges not because any single agent is smart, but because the accept/reject gate filters for improvement.
Related
- Quality Gates as Information Filters – Gates that reduce state space (individual loss term reducers)
- Agent Swarm Patterns – How to run swarms (this article explains what to measure)
- Constraint-First Development – Constraints define the feasible region for loss minimization
- Swarm Convergence Theory – Why swarms converge or diverge
- Goodharting Prevention – When agents optimize the proxy instead of the real objective
- Growth vs Polish Phases – Adjusting loss weights per development phase
- Compounding Effects of Quality Gates – Why layered gates produce exponential quality improvement

