The LLM should decide what to do. Everything else should happen somewhere else.
Author: James Phoenix | Date: March 2026
The Three Systems
Every AI agent system operates across three layers. Each layer has different properties: cost, determinism, and latency. Understanding these layers is the first step to building agents that scale.
┌─────────────────────────────────────────────────────────┐
│ SYSTEM 1: The LLM │
│ Stochastic. Expensive per token. Reasons over language. │
│ Good at: decisions, judgment, ambiguity, synthesis │
│ Bad at: repetition, precision, data fetching, math │
├─────────────────────────────────────────────────────────┤
│ SYSTEM 2: Tools, Scripts, CLIs │
│ Deterministic. Cheap per execution. Runs locally. │
│ Good at: transformation, validation, file ops, compute │
│ Bad at: judgment, novel situations, ambiguous inputs │
├─────────────────────────────────────────────────────────┤
│ SYSTEM 3: APIs, Data Markets, External Services │
│ Deterministic. Priced per call. Requires network. │
│ Good at: data retrieval, specialized compute, storage │
│ Bad at: reasoning, context, anything beyond its contract │
└─────────────────────────────────────────────────────────┘
| Property | System 1 (LLM) | System 2 (Tools) | System 3 (APIs) |
|---|---|---|---|
| Determinism | Low | High | High |
| Cost model | Per token | Per execution (near zero) | Per call |
| Latency | High | Low | Variable |
| Judgment | Excellent | None | None |
| Precision | Variable | Exact | Exact |
| State | Context window | Filesystem, memory | External databases |
The insight is simple: most agent systems keep too much work in System 1. Every token the LLM spends on work that System 2 or System 3 could handle is wasted cost, wasted latency, and wasted context window.
The Problem: Context Bloat and Determinism Erosion
When the LLM orchestrates everything, two things happen.
Context bloat. The LLM accumulates intermediate results, raw API responses, file contents, and error logs in its context window. At 200k tokens, Claude can hold a lot, but attention degrades as context grows. The signal-to-noise ratio drops. The model starts ignoring instructions, repeating work, or hallucinating tool calls. This is not a model limitation. It is information theory: as noise increases, mutual information with the task decreases.
Determinism erosion. Every decision routed through the LLM is probabilistic. If you ask the LLM to parse a JSON response, format a date, validate an email, or sort a list, it will probably get it right. Probably. Over a 50-step workflow, “probably” compounds into “sometimes.” Over a 200-step workflow, it compounds into “rarely.”
Both problems have the same solution: move work to the layer where it belongs.
The Principle: Progressive Downward Orchestration
BEFORE: LLM does everything
┌──────────────────────────────┐
│ System 1 (LLM) │
│ - Decides what to do │
│ - Calls API │
│ - Parses response │
│ - Validates data │
│ - Transforms format │
│ - Writes to file │
│ - Checks result │
│ - Decides next step │
│ Context: 8000 tokens │
│ Determinism: ~70% │
└──────────────────────────────┘
AFTER: Each layer does its job
┌──────────────────────────────┐
│ System 1 (LLM) │
│ - Decides what to do │
│ - Decides next step │
│ Context: 800 tokens │
│ Determinism: ~95% │
├──────────────────────────────┤
│ System 2 (Tools) │
│ - Parses response │
│ - Validates data │
│ - Transforms format │
│ - Writes to file │
│ - Checks result │
│ Determinism: 100% │
├──────────────────────────────┤
│ System 3 (APIs) │
│ - Retrieves data │
│ - Stores results │
│ Determinism: 100% │
└──────────────────────────────┘
The LLM’s job is routing and judgment. Everything else should be pushed down.
This is not about making agents dumber. It is about making them focused. A general who micromanages every soldier’s footsteps loses the battle. A general who gives clear orders and trusts execution wins.
Five Concrete Transitions
Transition 1: Raw API Calls to Tool-Wrapped Services
System 1 doing System 3’s job:
The LLM constructs a curl command, reads the raw JSON, extracts the fields it needs, handles errors in natural language, and stuffs the entire response into context.
LLM thinks: "I need to get the user's billing info"
LLM generates: curl -H "Authorization: Bearer ..." https://api.stripe.com/v1/customers/cus_123
LLM receives: 2KB of raw JSON
LLM parses: "The customer's plan is 'pro' and they have 3 invoices"
Context cost: ~3000 tokens
Pushed down to System 2 + System 3:
The LLM calls a typed tool. The tool handles authentication, calls the API, parses the response, extracts relevant fields, and returns a compact summary.
LLM thinks: "I need the user's billing info"
LLM calls: get_customer_billing({ customer_id: "cus_123" })
Tool returns: { plan: "pro", invoices: 3, next_billing: "2026-04-01" }
Context cost: ~200 tokens
Savings: 93% fewer tokens. 100% deterministic parsing. No auth tokens in context.
Transition 2: Multi-Step Data Processing to Scripts
System 1 doing System 2’s job:
The LLM reads a CSV, iterates through rows, applies transformations, validates each record, and generates output. Each step adds to context. The LLM has to “remember” what it already processed.
Pushed down to System 2:
The LLM decides “process this CSV with the clean-and-validate script” and calls a single tool. The script handles iteration, transformation, validation, and outputs a summary.
// System 2: deterministic script
function processCSV(input: string, rules: ValidationRules): ProcessResult {
const rows = parse(input);
const validated = rows.map(row => validate(row, rules));
const errors = validated.filter(r => !r.valid);
return {
processed: validated.length,
errors: errors.length,
errorSummary: errors.slice(0, 5).map(e => e.reason),
};
}
The LLM sees a 5-line summary instead of processing 500 rows. Its context stays clean for the decision that matters: what to do about the errors.
Transition 3: LLM-Driven Validation to Type-Checked Contracts
System 1 doing System 2’s job:
The LLM generates a JSON payload, then re-reads it to check if it matches the expected schema. If it finds an error, it fixes it and re-checks. This loop burns tokens and is not reliable.
Pushed down to System 2:
// System 2: Zod schema validates deterministically
const PaymentSchema = z.object({
amount: z.number().positive(),
currency: z.enum(["USD", "EUR", "GBP"]),
customer_id: z.string().startsWith("cus_"),
});
// LLM outputs structured data, System 2 validates
const result = PaymentSchema.safeParse(llmOutput);
if (!result.success) {
// Feed compact error back to LLM for one correction
return { error: result.error.flatten() };
}
Validation is not a judgment call. It is a mechanical check. Mechanical checks belong in System 2.
Transition 4: Agent-Managed State to External Persistence
System 1 doing System 3’s job:
The LLM keeps track of a workflow’s state in its context window. “We’ve completed steps 1-3, step 4 failed with error X, we need to retry.” As the workflow grows, the LLM spends more tokens re-reading its own history than doing useful work.
Pushed down to System 2 + System 3:
// System 3: database holds state
interface WorkflowState {
id: string;
currentStep: number;
completedSteps: StepResult[];
status: "running" | "paused" | "failed" | "completed";
}
// System 2: deterministic state machine
function nextAction(state: WorkflowState): Action {
if (state.status === "failed") return { type: "retry", step: state.currentStep };
if (state.currentStep >= TOTAL_STEPS) return { type: "complete" };
return { type: "execute", step: state.currentStep + 1 };
}
// System 1: LLM only handles the ambiguous parts
// "Step 4 failed because the API returned a 429. Should I retry with backoff or skip?"
The LLM sees the current state (20 tokens) and the decision point (50 tokens), not the entire history (5000 tokens).
Transition 5: LLM Orchestration to Deterministic Pipelines
System 1 orchestrating other System 1 instances:
An orchestrator LLM reads a task, decides which sub-agent to call, constructs the prompt, reads the result, decides the next sub-agent, and so on. The orchestrator’s context grows with every delegation.
Pushed down to System 2:
// System 2: deterministic pipeline
const pipeline: Stage[] = [
{ agent: "planner", input: (task) => task.description },
{ agent: "coder", input: (task, prev) => prev.plan },
{ agent: "tester", input: (task, prev) => prev.code },
{ agent: "reviewer", input: (task, prev) => ({ code: prev.code, tests: prev.testResults }) },
];
async function runPipeline(task: Task): Promise<Result> {
let context = {};
for (const stage of pipeline) {
const input = stage.input(task, context);
const result = await callAgent(stage.agent, input);
context = { ...context, [stage.agent]: result };
}
return context;
}
The pipeline itself is deterministic code. Each agent receives only its relevant input, not the accumulated context of every prior agent. The orchestration logic lives in System 2, where it can be tested, versioned, and debugged without prompt engineering.
The Decision Rule
For every piece of work in an agent workflow, ask:
1. Does this require judgment, reasoning, or handling ambiguity?
→ System 1 (LLM)
2. Does this require deterministic transformation, validation, or computation?
→ System 2 (Tools/Scripts)
3. Does this require data retrieval, storage, or calling an external service?
→ System 3 (APIs)
If the answer is 2 or 3, the LLM should emit a tool call, not do the work itself. If the answer is 1, the LLM should receive the minimum context it needs to make the decision, not the raw data it would need to do everything.
Measuring the Shift
Track three metrics to know if orchestration is at the right layer:
Context Efficiency Ratio
Context Efficiency = tokens_for_decisions / total_tokens_in_context
If the LLM’s context is 80% raw data and 20% decision-relevant information, you have work to push down. Target: >70% decision-relevant tokens.
Determinism Score
Determinism = deterministic_steps / total_steps
Count the steps in a workflow. How many are deterministic (always produce the same output for the same input)? The higher this ratio, the more reliable the system. Target: >80% deterministic steps.
Token Cost per Decision
Cost per Decision = total_tokens / number_of_LLM_decisions
If the LLM makes 5 decisions but consumes 50,000 tokens, you are paying 10,000 tokens per decision. That means 90%+ of your token spend is on data shuffling, not intelligence. Push the data shuffling down.
The Architecture That Emerges
When you consistently push orchestration down the stack, a clear architecture emerges:
┌────────────────────────────────────────────────────────────┐
│ System 1: LLM │
│ │
│ Receives: compact summaries, decision points, error flags │
│ Emits: structured tool calls, routing decisions │
│ Context: small, focused, high signal-to-noise │
├────────────────────────────────────────────────────────────┤
│ System 2: Tools & Scripts │
│ │
│ Receives: structured inputs from System 1 │
│ Does: parsing, validation, transformation, orchestration │
│ Emits: compact results to System 1, calls to System 3 │
│ Properties: deterministic, testable, versioned, fast │
├────────────────────────────────────────────────────────────┤
│ System 3: APIs & Services │
│ │
│ Receives: structured requests from System 2 │
│ Does: data retrieval, storage, external computation │
│ Emits: raw responses (consumed by System 2, not System 1) │
│ Properties: deterministic contracts, priced per call │
└────────────────────────────────────────────────────────────┘
Notice the data flow. System 3 responses flow to System 2 for processing, not directly to System 1. System 2 compresses, validates, and summarizes before passing results up. The LLM never sees raw API responses, unfiltered database dumps, or unstructured file contents. It sees processed, decision-ready information.
This is the same principle as Building the Harness: the LLM is a noisy channel, and every layer of harness increases the signal-to-noise ratio. The three-system model makes this structural rather than ad-hoc.
Connection to Kahneman’s Dual-Process Theory
The naming is deliberate. In cognitive psychology, System 1 is fast, intuitive, and pattern-matching. System 2 is slow, deliberate, and analytical. The irony is that in the AI agent stack, the mapping is reversed.
The LLM (System 1 in our model) is the slow, expensive reasoner. The tools and scripts (System 2) are the fast, cheap executors. But the principle holds: route work to the system best suited for it. Do not use your expensive reasoner for tasks that require mechanical precision. Do not use your mechanical executor for tasks that require judgment.
Kahneman’s key insight was that System 1 errors come from applying intuitive shortcuts where deliberate analysis is needed. In agent systems, the equivalent error is using the LLM where deterministic code is needed. The fix is the same: know which system to engage for which type of problem.
Anti-Patterns
Anti-Pattern 1: LLM as Data Pipeline
The LLM reads 50 records, transforms each one, and accumulates results. This is System 2 work wearing a System 1 costume.
Fix: Write a transformation function. Have the LLM call it once.
Anti-Pattern 2: LLM as API Client
The LLM constructs HTTP requests, manages headers, handles retries, and parses responses. This is System 3 coordination wearing a System 1 costume.
Fix: Wrap API interactions in typed tools. The LLM calls get_customer(id), not curl -H "Authorization: ...".
Anti-Pattern 3: LLM as State Machine
The LLM tracks which step it’s on, what’s been completed, and what’s next. This is System 2 state management wearing a System 1 costume.
Fix: External state persistence. The LLM receives the current state and decides the next action. It does not maintain the state.
Anti-Pattern 4: LLM Orchestrating LLMs
An orchestrator LLM manages sub-agents by holding their outputs in its context and routing between them. This is System 2 pipeline logic wearing a System 1 costume.
Fix: Deterministic pipeline code. System 2 manages the flow. System 1 instances handle their individual decisions.
When to Break the Rule
Not everything should be pushed down. Keep work in System 1 when:
- The input is ambiguous. “This customer seems unhappy” requires judgment. A sentiment score does not capture the nuance.
- The decision space is open-ended. “What should we do about this bug?” has too many valid answers for a switch statement.
- The task is novel. If you have never seen this type of request before, the LLM’s generalization is your best tool. Once you see it three times, push it to System 2. (Ad-hoc to Scripts)
- The context is small. If the total workflow is 10 steps and 500 tokens, the overhead of building tools exceeds the benefit. Optimization has diminishing returns.
The three-time rule: if the LLM does the same type of work three times, it should have been a tool after the first.
Practical Implementation Path
Phase 1: Audit
List every step in your agent workflows. Tag each step as System 1 (judgment), System 2 (mechanical), or System 3 (external data). Count the ratio.
Phase 2: Extract the Obvious
Move all validation, parsing, formatting, and data transformation into tools. This is usually 40-60% of an agent’s work. Ad-hoc to Scripts covers the conversion process.
Phase 3: Wrap External Services
Build typed tool wrappers around every API call. The tool handles auth, retries, error mapping, and response compression. The LLM calls search_customers({ query: "acme" }) and gets back 3 lines, not 3 pages.
Phase 4: Externalize State
Move workflow state out of the context window and into a database or state machine. The LLM receives a compact state snapshot and a decision prompt, not a conversation history.
Phase 5: Deterministic Orchestration
Replace LLM-driven multi-agent routing with typed pipeline code. Effect provides the algebra for composing these pipelines with typed errors, dependencies, and retry semantics.
Summary
The three-system model is a lens for diagnosing and fixing agent architectures:
- System 1 (LLM): Judgment, reasoning, ambiguity resolution. Stochastic. Expensive.
- System 2 (Tools/Scripts): Transformation, validation, orchestration. Deterministic. Cheap.
- System 3 (APIs/Services): Data retrieval, external compute, storage. Deterministic. Priced.
The principle: push orchestration down to the lowest layer that can handle it. Every piece of work running in System 1 that could run in System 2 is wasted cost, wasted context, and wasted determinism. Every raw API response flowing directly into System 1 instead of through a System 2 compressor is noise polluting the reasoning channel.
LLMs are not expensive because they are bad. They are expensive because they are general. Use them for what requires generality. Use deterministic systems for everything else.
Related
- Building the Harness – The three-layer harness model this framework extends
- Why Effect Fits LLM Orchestration – Typed orchestration for System 2
- Ad-hoc to Scripts – Converting System 1 work to System 2
- 12 Factor Agents – Factor 1 (NL to tool calls) and Factor 4 (tools as structured outputs)
- Orchestration Patterns – Coordinator, Swarm, Pipeline at the System 2 layer
- Sub-Agent Context Hierarchy – Context isolation across System 1 instances
- Long-Running Agent Patterns – Skills and compaction for System 1 efficiency
- Information Theory for Coding Agents – The signal-to-noise framework
- Systems Thinking – Observability across all three layers

