Push Orchestration Down the Stack: A Three-System Model for AI Agents

James Phoenix
James Phoenix

The LLM should decide what to do. Everything else should happen somewhere else.

Author: James Phoenix | Date: March 2026


The Three Systems

Every AI agent system operates across three layers. Each layer has different properties: cost, determinism, and latency. Understanding these layers is the first step to building agents that scale.

┌─────────────────────────────────────────────────────────┐
│  SYSTEM 1: The LLM                                      │
│  Stochastic. Expensive per token. Reasons over language. │
│  Good at: decisions, judgment, ambiguity, synthesis      │
│  Bad at: repetition, precision, data fetching, math      │
├─────────────────────────────────────────────────────────┤
│  SYSTEM 2: Tools, Scripts, CLIs                          │
│  Deterministic. Cheap per execution. Runs locally.       │
│  Good at: transformation, validation, file ops, compute  │
│  Bad at: judgment, novel situations, ambiguous inputs     │
├─────────────────────────────────────────────────────────┤
│  SYSTEM 3: APIs, Data Markets, External Services         │
│  Deterministic. Priced per call. Requires network.       │
│  Good at: data retrieval, specialized compute, storage   │
│  Bad at: reasoning, context, anything beyond its contract │
└─────────────────────────────────────────────────────────┘
Property System 1 (LLM) System 2 (Tools) System 3 (APIs)
Determinism Low High High
Cost model Per token Per execution (near zero) Per call
Latency High Low Variable
Judgment Excellent None None
Precision Variable Exact Exact
State Context window Filesystem, memory External databases

The insight is simple: most agent systems keep too much work in System 1. Every token the LLM spends on work that System 2 or System 3 could handle is wasted cost, wasted latency, and wasted context window.


The Problem: Context Bloat and Determinism Erosion

When the LLM orchestrates everything, two things happen.

Context bloat. The LLM accumulates intermediate results, raw API responses, file contents, and error logs in its context window. At 200k tokens, Claude can hold a lot, but attention degrades as context grows. The signal-to-noise ratio drops. The model starts ignoring instructions, repeating work, or hallucinating tool calls. This is not a model limitation. It is information theory: as noise increases, mutual information with the task decreases.

Determinism erosion. Every decision routed through the LLM is probabilistic. If you ask the LLM to parse a JSON response, format a date, validate an email, or sort a list, it will probably get it right. Probably. Over a 50-step workflow, “probably” compounds into “sometimes.” Over a 200-step workflow, it compounds into “rarely.”

Both problems have the same solution: move work to the layer where it belongs.


The Principle: Progressive Downward Orchestration

BEFORE: LLM does everything
┌──────────────────────────────┐
│         System 1 (LLM)       │
│  - Decides what to do        │
│  - Calls API                 │
│  - Parses response           │
│  - Validates data            │
│  - Transforms format         │
│  - Writes to file            │
│  - Checks result             │
│  - Decides next step         │
│  Context: 8000 tokens        │
│  Determinism: ~70%           │
└──────────────────────────────┘

AFTER: Each layer does its job
┌──────────────────────────────┐
│         System 1 (LLM)       │
│  - Decides what to do        │
│  - Decides next step         │
│  Context: 800 tokens         │
│  Determinism: ~95%           │
├──────────────────────────────┤
│       System 2 (Tools)       │
│  - Parses response           │
│  - Validates data            │
│  - Transforms format         │
│  - Writes to file            │
│  - Checks result             │
│  Determinism: 100%           │
├──────────────────────────────┤
│       System 3 (APIs)        │
│  - Retrieves data            │
│  - Stores results            │
│  Determinism: 100%           │
└──────────────────────────────┘

The LLM’s job is routing and judgment. Everything else should be pushed down.

This is not about making agents dumber. It is about making them focused. A general who micromanages every soldier’s footsteps loses the battle. A general who gives clear orders and trusts execution wins.


Five Concrete Transitions

Transition 1: Raw API Calls to Tool-Wrapped Services

System 1 doing System 3’s job:

The LLM constructs a curl command, reads the raw JSON, extracts the fields it needs, handles errors in natural language, and stuffs the entire response into context.

LLM thinks: "I need to get the user's billing info"
LLM generates: curl -H "Authorization: Bearer ..." https://api.stripe.com/v1/customers/cus_123
LLM receives: 2KB of raw JSON
LLM parses: "The customer's plan is 'pro' and they have 3 invoices"
Context cost: ~3000 tokens

Pushed down to System 2 + System 3:

The LLM calls a typed tool. The tool handles authentication, calls the API, parses the response, extracts relevant fields, and returns a compact summary.

LLM thinks: "I need the user's billing info"
LLM calls: get_customer_billing({ customer_id: "cus_123" })
Tool returns: { plan: "pro", invoices: 3, next_billing: "2026-04-01" }
Context cost: ~200 tokens

Savings: 93% fewer tokens. 100% deterministic parsing. No auth tokens in context.

Transition 2: Multi-Step Data Processing to Scripts

System 1 doing System 2’s job:

The LLM reads a CSV, iterates through rows, applies transformations, validates each record, and generates output. Each step adds to context. The LLM has to “remember” what it already processed.

Pushed down to System 2:

The LLM decides “process this CSV with the clean-and-validate script” and calls a single tool. The script handles iteration, transformation, validation, and outputs a summary.

// System 2: deterministic script
function processCSV(input: string, rules: ValidationRules): ProcessResult {
  const rows = parse(input);
  const validated = rows.map(row => validate(row, rules));
  const errors = validated.filter(r => !r.valid);

  return {
    processed: validated.length,
    errors: errors.length,
    errorSummary: errors.slice(0, 5).map(e => e.reason),
  };
}

The LLM sees a 5-line summary instead of processing 500 rows. Its context stays clean for the decision that matters: what to do about the errors.

Transition 3: LLM-Driven Validation to Type-Checked Contracts

System 1 doing System 2’s job:

The LLM generates a JSON payload, then re-reads it to check if it matches the expected schema. If it finds an error, it fixes it and re-checks. This loop burns tokens and is not reliable.

Pushed down to System 2:

// System 2: Zod schema validates deterministically
const PaymentSchema = z.object({
  amount: z.number().positive(),
  currency: z.enum(["USD", "EUR", "GBP"]),
  customer_id: z.string().startsWith("cus_"),
});

// LLM outputs structured data, System 2 validates
const result = PaymentSchema.safeParse(llmOutput);
if (!result.success) {
  // Feed compact error back to LLM for one correction
  return { error: result.error.flatten() };
}

Validation is not a judgment call. It is a mechanical check. Mechanical checks belong in System 2.

Transition 4: Agent-Managed State to External Persistence

System 1 doing System 3’s job:

The LLM keeps track of a workflow’s state in its context window. “We’ve completed steps 1-3, step 4 failed with error X, we need to retry.” As the workflow grows, the LLM spends more tokens re-reading its own history than doing useful work.

Pushed down to System 2 + System 3:

// System 3: database holds state
interface WorkflowState {
  id: string;
  currentStep: number;
  completedSteps: StepResult[];
  status: "running" | "paused" | "failed" | "completed";
}

// System 2: deterministic state machine
function nextAction(state: WorkflowState): Action {
  if (state.status === "failed") return { type: "retry", step: state.currentStep };
  if (state.currentStep >= TOTAL_STEPS) return { type: "complete" };
  return { type: "execute", step: state.currentStep + 1 };
}

// System 1: LLM only handles the ambiguous parts
// "Step 4 failed because the API returned a 429. Should I retry with backoff or skip?"

The LLM sees the current state (20 tokens) and the decision point (50 tokens), not the entire history (5000 tokens).

Transition 5: LLM Orchestration to Deterministic Pipelines

System 1 orchestrating other System 1 instances:

An orchestrator LLM reads a task, decides which sub-agent to call, constructs the prompt, reads the result, decides the next sub-agent, and so on. The orchestrator’s context grows with every delegation.

Pushed down to System 2:

// System 2: deterministic pipeline
const pipeline: Stage[] = [
  { agent: "planner", input: (task) => task.description },
  { agent: "coder", input: (task, prev) => prev.plan },
  { agent: "tester", input: (task, prev) => prev.code },
  { agent: "reviewer", input: (task, prev) => ({ code: prev.code, tests: prev.testResults }) },
];

async function runPipeline(task: Task): Promise<Result> {
  let context = {};
  for (const stage of pipeline) {
    const input = stage.input(task, context);
    const result = await callAgent(stage.agent, input);
    context = { ...context, [stage.agent]: result };
  }
  return context;
}

The pipeline itself is deterministic code. Each agent receives only its relevant input, not the accumulated context of every prior agent. The orchestration logic lives in System 2, where it can be tested, versioned, and debugged without prompt engineering.

Udemy Bestseller

Learn Prompt Engineering

My O'Reilly book adapted for hands-on learning. Build production-ready prompts with practical exercises.

4.5/5 rating
306,000+ learners
View Course

The Decision Rule

For every piece of work in an agent workflow, ask:

1. Does this require judgment, reasoning, or handling ambiguity?
   → System 1 (LLM)

2. Does this require deterministic transformation, validation, or computation?
   → System 2 (Tools/Scripts)

3. Does this require data retrieval, storage, or calling an external service?
   → System 3 (APIs)

If the answer is 2 or 3, the LLM should emit a tool call, not do the work itself. If the answer is 1, the LLM should receive the minimum context it needs to make the decision, not the raw data it would need to do everything.


Measuring the Shift

Track three metrics to know if orchestration is at the right layer:

Context Efficiency Ratio

Context Efficiency = tokens_for_decisions / total_tokens_in_context

If the LLM’s context is 80% raw data and 20% decision-relevant information, you have work to push down. Target: >70% decision-relevant tokens.

Determinism Score

Determinism = deterministic_steps / total_steps

Count the steps in a workflow. How many are deterministic (always produce the same output for the same input)? The higher this ratio, the more reliable the system. Target: >80% deterministic steps.

Token Cost per Decision

Cost per Decision = total_tokens / number_of_LLM_decisions

If the LLM makes 5 decisions but consumes 50,000 tokens, you are paying 10,000 tokens per decision. That means 90%+ of your token spend is on data shuffling, not intelligence. Push the data shuffling down.


The Architecture That Emerges

When you consistently push orchestration down the stack, a clear architecture emerges:

┌────────────────────────────────────────────────────────────┐
│                      System 1: LLM                          │
│                                                              │
│  Receives: compact summaries, decision points, error flags   │
│  Emits: structured tool calls, routing decisions             │
│  Context: small, focused, high signal-to-noise               │
├────────────────────────────────────────────────────────────┤
│                   System 2: Tools & Scripts                   │
│                                                              │
│  Receives: structured inputs from System 1                   │
│  Does: parsing, validation, transformation, orchestration    │
│  Emits: compact results to System 1, calls to System 3      │
│  Properties: deterministic, testable, versioned, fast        │
├────────────────────────────────────────────────────────────┤
│                   System 3: APIs & Services                   │
│                                                              │
│  Receives: structured requests from System 2                 │
│  Does: data retrieval, storage, external computation         │
│  Emits: raw responses (consumed by System 2, not System 1)  │
│  Properties: deterministic contracts, priced per call        │
└────────────────────────────────────────────────────────────┘

Notice the data flow. System 3 responses flow to System 2 for processing, not directly to System 1. System 2 compresses, validates, and summarizes before passing results up. The LLM never sees raw API responses, unfiltered database dumps, or unstructured file contents. It sees processed, decision-ready information.

This is the same principle as Building the Harness: the LLM is a noisy channel, and every layer of harness increases the signal-to-noise ratio. The three-system model makes this structural rather than ad-hoc.


Connection to Kahneman’s Dual-Process Theory

The naming is deliberate. In cognitive psychology, System 1 is fast, intuitive, and pattern-matching. System 2 is slow, deliberate, and analytical. The irony is that in the AI agent stack, the mapping is reversed.

The LLM (System 1 in our model) is the slow, expensive reasoner. The tools and scripts (System 2) are the fast, cheap executors. But the principle holds: route work to the system best suited for it. Do not use your expensive reasoner for tasks that require mechanical precision. Do not use your mechanical executor for tasks that require judgment.

Kahneman’s key insight was that System 1 errors come from applying intuitive shortcuts where deliberate analysis is needed. In agent systems, the equivalent error is using the LLM where deterministic code is needed. The fix is the same: know which system to engage for which type of problem.


Anti-Patterns

Anti-Pattern 1: LLM as Data Pipeline

The LLM reads 50 records, transforms each one, and accumulates results. This is System 2 work wearing a System 1 costume.

Fix: Write a transformation function. Have the LLM call it once.

Anti-Pattern 2: LLM as API Client

The LLM constructs HTTP requests, manages headers, handles retries, and parses responses. This is System 3 coordination wearing a System 1 costume.

Fix: Wrap API interactions in typed tools. The LLM calls get_customer(id), not curl -H "Authorization: ...".

Anti-Pattern 3: LLM as State Machine

The LLM tracks which step it’s on, what’s been completed, and what’s next. This is System 2 state management wearing a System 1 costume.

Fix: External state persistence. The LLM receives the current state and decides the next action. It does not maintain the state.

Anti-Pattern 4: LLM Orchestrating LLMs

An orchestrator LLM manages sub-agents by holding their outputs in its context and routing between them. This is System 2 pipeline logic wearing a System 1 costume.

Fix: Deterministic pipeline code. System 2 manages the flow. System 1 instances handle their individual decisions.


When to Break the Rule

Not everything should be pushed down. Keep work in System 1 when:

  • The input is ambiguous. “This customer seems unhappy” requires judgment. A sentiment score does not capture the nuance.
  • The decision space is open-ended. “What should we do about this bug?” has too many valid answers for a switch statement.
  • The task is novel. If you have never seen this type of request before, the LLM’s generalization is your best tool. Once you see it three times, push it to System 2. (Ad-hoc to Scripts)
  • The context is small. If the total workflow is 10 steps and 500 tokens, the overhead of building tools exceeds the benefit. Optimization has diminishing returns.

The three-time rule: if the LLM does the same type of work three times, it should have been a tool after the first.


Practical Implementation Path

Phase 1: Audit

List every step in your agent workflows. Tag each step as System 1 (judgment), System 2 (mechanical), or System 3 (external data). Count the ratio.

Phase 2: Extract the Obvious

Move all validation, parsing, formatting, and data transformation into tools. This is usually 40-60% of an agent’s work. Ad-hoc to Scripts covers the conversion process.

Phase 3: Wrap External Services

Build typed tool wrappers around every API call. The tool handles auth, retries, error mapping, and response compression. The LLM calls search_customers({ query: "acme" }) and gets back 3 lines, not 3 pages.

Phase 4: Externalize State

Move workflow state out of the context window and into a database or state machine. The LLM receives a compact state snapshot and a decision prompt, not a conversation history.

Phase 5: Deterministic Orchestration

Replace LLM-driven multi-agent routing with typed pipeline code. Effect provides the algebra for composing these pipelines with typed errors, dependencies, and retry semantics.


Summary

The three-system model is a lens for diagnosing and fixing agent architectures:

  • System 1 (LLM): Judgment, reasoning, ambiguity resolution. Stochastic. Expensive.
  • System 2 (Tools/Scripts): Transformation, validation, orchestration. Deterministic. Cheap.
  • System 3 (APIs/Services): Data retrieval, external compute, storage. Deterministic. Priced.

The principle: push orchestration down to the lowest layer that can handle it. Every piece of work running in System 1 that could run in System 2 is wasted cost, wasted context, and wasted determinism. Every raw API response flowing directly into System 1 instead of through a System 2 compressor is noise polluting the reasoning channel.

LLMs are not expensive because they are bad. They are expensive because they are general. Use them for what requires generality. Use deterministic systems for everything else.


Related

Topics
Agent ArchitectureAPIsContext BloatContext EngineeringCost OptimizationDeterminismOrchestrationSystem LayersTools

More Insights

Cover Image for Assume Wrong by Default: Mining LLM Latent Space for Correctness

Assume Wrong by Default: Mining LLM Latent Space for Correctness

A single pass through a coding LLM is a single sample from a probability distribution. You would not ship a system tested once. Do not ship code reviewed once.

James Phoenix
James Phoenix
Cover Image for Attention Arbitrage: Delegate to Agents

Attention Arbitrage: Delegate to Agents

Human attention is scarce. Agents are cheap. Default to delegation.

James Phoenix
James Phoenix