Formal Verification for Agent Orchestration

James Phoenix
James Phoenix

Your LLM is non-deterministic. Your orchestrator is not. Verify the part you control.


Definition

Formal verification uses mathematical tools (TLA+ for state machine exploration, Z3 for constraint solving) to prove properties of the deterministic orchestration layer around AI agents, catching bugs that testing cannot: deadlocks, infinite loops, budget overruns, permission bypasses, and race conditions.

Mental Model

An agentic system has two layers:

┌─────────────────────────────────────────┐
│  Orchestration Layer (deterministic)    │  ← Provable
│  State machine, routing, guards, budget │
├─────────────────────────────────────────┤
│  LLM Layer (non-deterministic)          │  ← Testable (evals)
│  Content generation, reasoning          │
└─────────────────────────────────────────┘

You cannot prove what an LLM will say. You can prove that your orchestrator always terminates, never exceeds budget, never lets an unprivileged agent call a dangerous tool, and never deadlocks when two agents fail simultaneously.

The formal verification wraps the deterministic shell. Evals cover the stochastic core.

The Two Tools

TLA+: “Can my system reach a bad state through some sequence of events?”

TLA+ (Temporal Logic of Actions) models systems as state machines, then exhaustively explores every reachable state. Created by Leslie Lamport (also created LaTeX). Amazon used it to find critical bugs in DynamoDB, S3, and EBS that no amount of testing caught.

Udemy Bestseller

Learn Prompt Engineering

My O'Reilly book adapted for hands-on learning. Build production-ready prompts with practical exercises.

4.5/5 rating
306,000+ learners
View Course

You define:

  • Variables (the state)
  • Actions (transitions that change state)
  • Invariants (what must always be true)
  • Temporal properties (what must eventually happen)

TLC (the model checker) tries every possible interleaving of actions and either proves your properties hold or gives you the exact step-by-step trace that breaks them.

Install: VS Code extension “TLA+ Nightly” or download the TLA+ Toolbox. Source code and releases on GitHub.

Z3: “Do values exist that break this formula?”

Z3 is a SAT/SMT solver from Microsoft Research. You give it logical formulas and it either finds values that satisfy them (SAT), proves no such values exist (UNSAT), or says it cannot decide (UNKNOWN). See the official documentation and Python API reference.

Think of it as an impossibly powerful equation solver that works on booleans, integers, arrays, bitvectors, and strings.

Install: pip install z3-solver

The One-Liner Rule

  • TLA+: “For all possible executions, does this property hold?” (dynamic, over time)
  • Z3: “For all possible inputs, does this formula hold?” (static, one point in time)

TLA+ Syntax Essentials

VARIABLE count             \* declare state variable
Init == count = 0          \* initial state
count' = count + 1         \* primed = next-state value
/\                         \* logical AND
\/                         \* logical OR
\A x \in S : P(x)         \* for all x in set S
\E x \in S : P(x)         \* there exists x in set S
[x \in S |-> expr]        \* function (map)
[f EXCEPT ![k] = v]       \* update function at key k
UNCHANGED <<a, b>>        \* a and b stay the same
<>(condition)             \* "eventually" (temporal)
[](condition)             \* "always" (temporal)

Code Example: Verifying an Orchestrator

TLA+: Proving the orchestrator always terminates

This models a hierarchical/supervisory multi-agent system. A supervisor delegates tasks to specialist agents, handles failures with retries, and escalates to a human when retries are exhausted.

---- MODULE Orchestrator ----
EXTENDS Integers, FiniteSets

CONSTANTS Agents, MaxDelegations, MaxRetries

VARIABLES
    agent_status,    \* per-agent: "idle", "working", "failed", "done"
    agent_retries,   \* per-agent retry count
    delegations,     \* total delegations made
    orch_state       \* "planning", "delegating", "aggregating",
                     \*  "escalating", "done"

Init ==
    /\ agent_status = [a \in Agents |-> "idle"]
    /\ agent_retries = [a \in Agents |-> 0]
    /\ delegations = 0
    /\ orch_state = "planning"

Delegate(agent) ==
    /\ orch_state = "delegating"
    /\ agent_status[agent] = "idle"
    /\ delegations < MaxDelegations
    /\ agent_status' = [agent_status EXCEPT ![agent] = "working"]
    /\ delegations' = delegations + 1
    /\ UNCHANGED <<agent_retries, orch_state>>

AgentSuccess(agent) ==
    /\ agent_status[agent] = "working"
    /\ agent_status' = [agent_status EXCEPT ![agent] = "done"]
    /\ UNCHANGED <<agent_retries, delegations, orch_state>>

AgentFailure(agent) ==
    /\ agent_status[agent] = "working"
    /\ agent_status' = [agent_status EXCEPT ![agent] = "failed"]
    /\ UNCHANGED <<agent_retries, delegations, orch_state>>

Retry(agent) ==
    /\ agent_status[agent] = "failed"
    /\ agent_retries[agent] < MaxRetries
    /\ delegations < MaxDelegations
    /\ agent_status' = [agent_status EXCEPT ![agent] = "working"]
    /\ agent_retries' = [agent_retries EXCEPT ![agent] = @ + 1]
    /\ delegations' = delegations + 1
    /\ UNCHANGED orch_state

Escalate(agent) ==
    /\ agent_status[agent] = "failed"
    /\ agent_retries[agent] >= MaxRetries
    /\ orch_state' = "escalating"
    /\ UNCHANGED <<agent_status, agent_retries, delegations>>

StartDelegating ==
    /\ orch_state = "planning"
    /\ orch_state' = "delegating"
    /\ UNCHANGED <<agent_status, agent_retries, delegations>>

Aggregate ==
    /\ orch_state = "delegating"
    /\ \A a \in Agents : agent_status[a] \in {"done", "failed"}
    /\ \E a \in Agents : agent_status[a] = "done"
    /\ orch_state' = "aggregating"
    /\ UNCHANGED <<agent_status, agent_retries, delegations>>

Finish ==
    /\ orch_state = "aggregating"
    /\ orch_state' = "done"
    /\ UNCHANGED <<agent_status, agent_retries, delegations>>

Next ==
    \/ StartDelegating
    \/ \E a \in Agents :
        Delegate(a) \/ AgentSuccess(a) \/ AgentFailure(a)
        \/ Retry(a) \/ Escalate(a)
    \/ Aggregate
    \/ Finish

Spec == Init /\ [][Next]_<<agent_status, agent_retries,
                           delegations, orch_state>>

\* ── INVARIANTS (must hold in every reachable state) ──
BudgetRespected == delegations <= MaxDelegations
RetriesBounded == \A a \in Agents : agent_retries[a] <= MaxRetries

\* ── TEMPORAL PROPERTIES (must hold across all executions) ──
AlwaysTerminates == <>(orch_state \in {"done", "escalating"})
FailuresHandled ==
    \A a \in Agents :
        [](agent_status[a] = "failed" =>
            <>(agent_status[a] \in {"working", "done"}
               \/ orch_state = "escalating"))
====

TLC explores every possible interleaving of agent successes, failures, and retries. If there is a sequence where two agents fail simultaneously and the retry logic deadlocks, TLC finds it and gives you the exact trace.

Z3: Proving routing logic has no gaps

from z3 import *

# ── Routing: prove every query type reaches the right agent ──
query_type = Int('query_type')
complexity = Int('complexity')
has_pii = Bool('has_pii')

# 0=research, 1=creative, 2=copywriting, 3=media, 4=senior_analyst
routed_to = If(
    has_pii,
    4,  # PII always goes to senior analyst (human-in-loop)
    If(query_type == 0, 0,
    If(query_type == 1, 1,
    If(query_type == 2, 2,
    If(query_type == 3, 3,
    If(complexity >= 8, 4,
       0))))))

s = Solver()
s.add(query_type >= 0, query_type <= 4)
s.add(complexity >= 1, complexity <= 10)

# Prove: PII never goes to a non-senior agent
s.push()
s.add(has_pii)
s.add(routed_to != 4)
print("PII bypass possible?", s.check())  # unsat = proven impossible
s.pop()

# Prove: every query type is handled
s.push()
s.add(routed_to < 0)
print("Invalid routing?", s.check())  # unsat = every query handled
s.pop()

Z3: Proving permissions cannot be bypassed

from z3 import *

role = Int('role')       # 0=basic, 1=elevated, 2=admin
tool = Int('tool')       # 0=search, 1=read, 2=write, 3=execute, 4=delete
user_approved = Bool('user_approved')

allowed = Or(
    tool == 0,                                    # search: anyone
    And(tool == 1, role >= 1),                    # read: elevated+
    And(tool == 2, role >= 1, user_approved),     # write: elevated + approval
    And(tool == 3, role >= 2),                    # execute: admin only
    And(tool == 4, role >= 2, user_approved),     # delete: admin + approval
)

s = Solver()
s.add(role >= 0, role <= 2)
s.add(tool >= 0, tool <= 4)

# Can a basic agent ever execute code?
s.push()
s.add(role == 0, tool == 3, allowed)
print("Basic agent execute?", s.check())  # unsat = proven impossible
s.pop()

# Can ANY agent delete without approval?
s.push()
s.add(tool == 4, Not(user_approved), allowed)
print("Delete without approval?", s.check())  # unsat = proven impossible
s.pop()

Z3: Finding concurrency bugs TLA+ style, but for a single function

from z3 import *

# Classic check-then-act race condition
# Two processes read balance, both see >= 60, both subtract 60
balance = 100
withdraw = 60

# Process 1 reads, Process 2 reads (both see 100)
p1_sees = balance
p2_sees = balance

# Both pass the check
p1_allowed = p1_sees >= withdraw  # True
p2_allowed = p2_sees >= withdraw  # True

# Both write
final = balance - withdraw - withdraw  # -20

# Z3 finds this automatically
b = Int('b')
s = Solver()
s.add(b == 100)
s.add(b - 60 - 60 < 0)  # Can final balance be negative?
print(s.check())  # sat = yes, the race condition exists

Mapping to Agent Architecture Patterns

Architecture Pattern TLA+ verifies Z3 verifies
Single agent Termination, retry bounds, state machine correctness Tool permissions, input validation
Hierarchical/supervisor Delegation deadlocks, all tasks eventually handled, no orphaned agents Routing logic completeness, permission escalation, budget allocation
Sequential workflow Each stage completes, no stage skipped, pipeline terminates Stage transition guards, data validation between stages
Parallel workflow Fan-out/fan-in completes, no lost results, aggregation always reached Conflict resolution logic, result merging correctness
Evaluator-optimizer Loop terminates (does not refine forever), quality threshold met Evaluation criteria completeness, scoring function properties
Collaborative/swarm No infinite message loops, consensus reached, no deadlock Role assignment completeness, shared state conflict resolution

What NOT to Verify Formally

Formal methods cover the deterministic shell. The LLM core stays in eval/testing territory.

High value (do this) Low value (skip this)
TLA+: orchestrator state machine TLA+: modeling LLM output quality
TLA+: multi-agent deadlocks TLA+: prompt effectiveness
Z3: routing completeness Z3: “will the LLM hallucinate”
Z3: permission guards Z3: content quality scoring
Z3: budget constraint proofs Z3: natural language parsing

When to Use Which

Scenario Tool
Can my system reach state X through some sequence of events? TLA+
Can inputs exist that make this function return a wrong value? Z3
Is my concurrent/distributed protocol correct? TLA+
Is my validation logic airtight? No bypasses? Z3
What happens if step 3 crashes and step 4 retries? TLA+
Find me a configuration that satisfies these 20 constraints Z3
Do all possible execution orderings preserve my invariant? TLA+
Prove this arithmetic/bitwise operation can never overflow Z3
Model a workflow with timeouts, retries, and failures TLA+
Generate test inputs that hit edge cases Z3

When NOT to Use Formal Verification

  • Simple CRUD. If your state transitions are trivial, it is overkill.
  • Performance testing. These tools check correctness, not speed.
  • UI logic. Usually not complex enough to warrant it.
  • Pure LLM quality. Use evals, RLHF, and prompt engineering instead.

When You Absolutely Should

  • Multiple agents can modify the same state concurrently.
  • Your orchestrator has more than 5 states with complex guard conditions.
  • Failure modes matter (what if a step crashes halfway through?).
  • The cost of a bug is high (financial transactions, data corruption, security).
  • You need to prove to auditors that certain states are unreachable.

Practical Playbook

For a content-driven agentic pipeline, apply verification at the right layer:

Your agentic pipeline
├── Orchestration layer          ← TLA+ (state machines, concurrency, termination)
│   ├── Agent state machine
│   ├── Multi-agent coordination
│   ├── Retry/failure logic
│   └── Human-in-the-loop gates
├── Guard/routing layer          ← Z3 (constraints, permissions, routing)
│   ├── Tool permissions
│   ├── Budget enforcement
│   ├── Query routing
│   └── Rate limiting logic
├── LLM interaction layer        ← Evals + testing
│   ├── Prompt quality
│   ├── Output parsing
│   └── Content quality
└── Output validation layer      ← JSON Schema, Pydantic, Zod
    ├── Structured output format
    └── Business rule compliance

Related Tools

  • tla-precheck (github.com/kingbootoshi/tla-precheck): Compiles a single TypeScript DSL into TLA+ specs, a TypeScript runtime interpreter, and Postgres constraints. Proves the TLA+ spec and TS interpreter produce identical state graphs.
  • Alloy: Lightweight formal modeling language, good for data structure invariants.
  • Dafny: Verification-aware programming language from Microsoft. Proofs live alongside code.

Gotchas

  • TLA+ models a simplified version of your system. If the model diverges from the code, the proof means nothing. Tools like tla-precheck address this by generating both from the same source.
  • Z3 proofs are only as good as your formula encoding. If you forgot a constraint, the proof is vacuously true.
  • State space explosion is real. TLA+ with large domains can run for hours. Use small domains for CI (proof tiers), larger domains for nightly runs.
  • Neither tool replaces integration tests. They prove properties of your design, not your deployment.

Related Concepts

Sources


Topics
Agent OrchestrationAi AgentsFormal VerificationNon Deterministic SystemsTla Plus

More Insights

Frontmatter as Document Schema: Why Your Knowledge Base Needs Type Signatures

Frontmatter is structured metadata at the top of a file that declares what a document is, what it contains, and how it should be discovered. In agent-driven systems, frontmatter serves the same role t

James Phoenix
James Phoenix
Cover Image for Pre-Commit Integration Tests: The LLM Regression Gate

Pre-Commit Integration Tests: The LLM Regression Gate

Pre-commit hooks that run integration tests are the sweet spot for preventing LLM-caused regressions from ever being committed. Pair this with linter configs that treat violations as errors (not warni

James Phoenix
James Phoenix