Use autonomy where it earns its keep. Use deterministic workflow where the path is known.
Why it matters
Most production agent systems are not one open-ended agent. They are a mix of workflows, specialised agents, deterministic code, retrieval, tool calls, and human checkpoints. Orchestration decides who does what, in what order, with what state, and how recovery works when a step fails.
Build this
- A task graph or state machine that names each step, owner, timeout, retry policy, and completion condition.
- Durable checkpoints so long-running work can resume without rerunning successful steps.
- Clear boundaries between workflow steps, agent loops, subagents, and deterministic validators.
- Cancellation, compensation, and escalation paths for partial failure.
Watch for
- One general agent asked to manage every step without state or deadlines.
- Retries that repeat non-idempotent actions.
- Subagents with overlapping responsibilities and no conflict resolution.
- Failures hidden inside natural language summaries instead of surfaced as states.
Proof it works
- A task can be paused, resumed, cancelled, and inspected from durable state.
- Each step has an owner, input contract, output contract, and timeout.
- A known tool failure triggers a recoverable path instead of an endless loop.
Implementation checklist
Draw the workflow first, then decide where agent judgement is actually needed.
Make side effects idempotent or guarded by explicit operation IDs.
Persist enough state to explain progress without reading raw logs.
Use specialist agents only when specialisation reduces context or risk.