Buy Orchestration, Own Semantics

James Phoenix
James Phoenix

The five layers of an agent system are not mutually exclusive. You do not need alpha in all of them. But you need to know which ones are hard and which ones are solved.


The Five Layers

Every agent-powered development system decomposes into five layers. They compound when combined, but each stands alone. You can outsource some and invest deeply in others.

Layer What It Does Example Tools
1. Orchestration Schedules agents, manages budgets, parallelises, routes, retries Paperclip, OpenClaw, custom DAGs
2. Tasks Defines the work queue, dependencies, state transitions, claims tx, Linear, custom task graphs
3. Specs Defines what “correct” means, acceptance criteria, invariants, PRDs tx specs, design docs, typed schemas
4. Boilerplate / Harness The codebase environment agents operate inside: folder structure, typed boundaries, ESLint rules, test harness, patterns tx-agent-kit, CLAUDE.md, custom starters
5. Runtime Actually executes work: workflows, persistence, domain adapters, tool access Effect, Temporal, Drizzle, domain code

These layers are not mutually exclusive. You could outsource orchestration entirely to Paperclip and focus all your energy on making specs incredibly precise. You could build a boilerplate with a world-class test harness and use someone else’s task system. You could go deep on runtime and treat specs as lightweight markdown files. The layers compound when combined, but you pick where to invest based on where you have the most leverage.


Orchestration Is Mainly Solved

Agent orchestration is converging to commodity. The core primitives are well-understood:

  • Task queues with priority and claim semantics
  • DAGs for dependency resolution
  • Heartbeats for liveness detection
  • Schedulers for cron, event-driven, and on-demand dispatch
  • Budget controls for token/cost limits
  • Dashboards for visibility

These are distributed systems problems with known solutions. Multiple frameworks already solve them well. The orchestration layer is important, but it is not where the difficulty lives.

You can buy orchestration and lose very little. Paperclip, OpenClaw, or even a custom cron + task queue will get you 90% of the way. The marginal return on building your own orchestrator is low unless orchestration itself is your product.


Boilerplate and Specs Are the Hard Layers

The hardest problem is not “make an agent do something.” It is: make an agent do the right thing, repeatedly, inside a complex evolving system.

That difficulty lives almost entirely in two places: the boilerplate/harness and the specs.

Why Boilerplate Is Hard

Consider the agent’s job as navigating a state space S (all possible repo states) by choosing actions A (edits, refactors, commands). Without strong boilerplate, the agent does blind search over S. That is hopeless.

Your boilerplate is a reduction of the search space: S to S’, where S’ is smaller, more structured, more predictable.

When you use Effect layers, repository pattern, strict ESLint rules, consistent folder structure, typed errors, and deterministic interfaces, you collapse the action space. Valid actions become obvious. Invalid actions get rejected by the compiler or linter before the agent even finishes. The agent is not “intelligent.” It is operating in a heavily regularized space.

This is why an engineer with strong boilerplate barely writes code themselves and still ships. The environment does the work.

Examples of boilerplate that makes agents dramatically better:

  • A test harness that catches regressions before merge
  • ESLint rules that enforce architectural boundaries
  • Typed error hierarchies that prevent silent failures
  • Repository patterns that make data access predictable
  • CLAUDE.md files that encode domain knowledge into agent context

Why Specs Are Hard

Specs define the target. Without precise specs, even a perfect agent in a perfect harness will build the wrong thing.

The difficulty is not writing a document. The difficulty is:

  • Defining “correct” precisely enough that a machine can verify it
  • Decomposing ambiguous requirements into testable invariants
  • Mapping business intent to acceptance criteria an agent can check
  • Keeping specs updated as the system evolves

Specs and boilerplate form a feedback loop. Better specs tell agents what to build. Better boilerplate constrains how they build it. Together they collapse the search space from “anything an LLM could generate” to “the small set of correct implementations.”

Difficulty Comparison

Layer Difficulty Why
Orchestration 5/10 Solved problem. Tasks + DAGs + heartbeats + schedulers
Tasks 6/10 State machines, dependency graphs. Well-understood patterns
Specs 9/10 Defining truth precisely, mapping to testable invariants
Boilerplate / Harness 10/10 Correctness under ambiguity, evolving constraints, long-range dependencies
Runtime 7/10 Domain-specific but built on proven foundations (Effect, Temporal, etc.)

You Do Not Need Alpha in All Five

This is the key insight. The layers compound, but you can build a strong position by going deep on just one or two.

Scenario A: Spec-focused. You build the best spec tooling in a niche. Your PRDs, design docs, and acceptance criteria are so precise that any agent framework can execute them reliably. The specs are the product.

Scenario B: Harness-focused. You build a boilerplate/starter with an incredible test harness, typed boundaries, and ESLint rules. Any agent dropped into this environment immediately becomes more productive. The harness is the product.

Scenario C: Full-stack. You invest in specs, boilerplate, and runtime together. They compound. This is the most powerful position but the most expensive to build.

Scenario D: Orchestration-focused. You build the best coordinator, dashboard, budget system. This is where Paperclip and OpenClaw live. It is a valid business, but it is the most commoditized layer.

Leanpub Book

Read The Meta-Engineer

A practical book on building autonomous AI systems with Claude Code, context engineering, verification loops, and production harnesses.

Continuously updated
Claude Code + agentic systems
View Book

The strategic question is: where do you have the most alpha? Invest there. Outsource the rest.


The Human Loop Is the Alpha

The most important layer is not in the table above. It is the human in the loop.

Agents optimize for local completion. They finish the task in front of them. They do not know whether the task matters. They do not know whether the feature should exist. They do not know whether the market shifted last week. They cannot tell the difference between “technically correct” and “strategically right.”

The human provides:

  • Taste. Which of the ten valid implementations is the one customers actually want?
  • Strategy. What to build next, what to stop building, what to ignore entirely.
  • Judgment under ambiguity. When the spec is unclear, agents guess. Humans ask the right question.
  • Risk calibration. Agents treat all tasks equally. Humans know which ones are load-bearing.
  • Narrative. Why this product exists, who it serves, what story it tells.

Do not try to build a zero-human company. That is not the goal. The goal is to make the human’s time absurdly leveraged. One hour of human direction should produce 10-20 hours of agent output. That ratio is already achievable today with good specs and a strong harness.

The fantasy of “agents run everything while I am on a beach” collapses for the same reason fully autonomous vehicles keep stalling. The long tail of edge cases is where all the value and all the danger lives. Agents handle the 80% of predictable work brilliantly. The 20% that requires judgment, context, and taste is where humans earn their keep.

The right mental model: you are the executive function. Agents are the workforce. An executive who disappears entirely gets a company that drifts. An executive who shows up for 2 focused hours a day and directs a tireless workforce gets disproportionate output.


Day-to-Day vs Holiday Mode

Agents should run differently depending on whether a human is in the loop or not. The difference is not capability. It is permission scope.

Day-to-Day (Human in the Loop)

You are actively directing work. Agents are extensions of your hands. This is where most of the value is created.

Activity How It Works
Feature development You write specs or PRDs. Agents implement, you review and merge. Tight feedback loop.
Bug fixes You triage and prioritize. Agents draft fixes, run tests. You approve.
Refactoring You define the target architecture. Agents execute the migration in branches.
Code review Agents run swarms for security, performance, maintainability. You make the call.
Spec refinement You iterate on specs with agents. Back-and-forth until acceptance criteria are precise.
Exploratory work Agents research, prototype, summarize options. You pick the direction.

Day-to-day is high-trust, high-bandwidth. Agents can merge, deploy, and make changes because you are watching. The human provides judgment, taste, and strategy. The agents provide speed and thoroughness. This mode is where the real compounding happens, because human direction keeps the system on the right trajectory.

Holiday Mode (No Human in the Loop)

You are away. Agents operate inside a constrained policy. The goal is not “run the company.” The goal is keep momentum, prevent chaos, compound prepared work.

Activity Allowed Why
Triage backlog Yes Low risk, high value on return
Run test suites Yes Catches regressions early
Draft PRs (do not merge) Yes Work is prepared for review
Prepare refactors behind flags Yes Reversible, non-destructive
Write specs from known templates Yes Queues future work
Summarize emails, issues, commits Yes Reduces catch-up time
Produce daily ops digests Yes Surfaces anomalies
Dependency bumps (minor/patch) Yes Low risk with test gates
Merge to main No Irreversible without review
Deploy to prod No Blast radius too high
Change billing/pricing No Financial commitment
Email customers automatically No Reputation risk
Change infra without approval No Hard to reverse
Publish content No Brand/accuracy risk
Delete data No Irreversible
Mutate core schemas No Cascading breakage
Architecture decisions No Strategic, requires taste

Escalation triggers (stop autonomous action, queue for human):

  • Ambiguous requirement with no clear spec
  • Failing integration test that was previously green
  • Conflicting specs or contradictory acceptance criteria
  • Increased error rate in monitoring
  • Unbounded retries or cost spike
  • Customer-facing impact
  • Any auth, billing, or data model touch

The Return

The measure of a good holiday mode system is what you come back to. Elite:

  • 8-12 reviewable PRs ready for merge
  • Cleaned and prioritized issue queue
  • Updated specs with open questions flagged
  • Ranked risks with supporting evidence
  • Daily digests summarizing what happened
  • A few blocked escalations with clear context

That is not “autonomous company.” That is high-trust constrained operations. The agents kept the factory warm. You come back and immediately ship.


The Deeper Principle

You are not building an “agent system.” You are building a programming language for agents to operate inside.

Your boilerplate defines:

  • Grammar (what is allowed)
  • Syntax (how things are structured)
  • Semantics (what things mean)
  • Type system (what is valid)
  • Runtime (how things execute)

LLMs are just probabilistic interpreters of that language.

The agents that seem magical are not smarter. They are operating in environments where the boilerplate and specs have collapsed the search space so aggressively that correct output is the path of least resistance.

Orchestration scales agents. But boilerplate and specs make them correct. The first is a commodity. The second two are where the real engineering lives.


Related

Topics
Agent SystemsOrchestrationRuntime ExecutionSemanticsTask Management

More Insights

Cover Image for Moats vs Execution: Code Was Never a Moat

Moats vs Execution: Code Was Never a Moat

Moats set the ceiling on value capture. Execution determines how close you actually get. They are fully independent variables.

James Phoenix
James Phoenix
Cover Image for Model Downgrade Testing Hardens Agent Skills

Model Downgrade Testing Hardens Agent Skills

If your skill only works with Opus, you don’t have a good skill. You have a good model compensating for bad instructions.

James Phoenix
James Phoenix