Your LLM is non-deterministic. Your orchestrator is not. Verify the part you control.
Definition
Formal verification uses mathematical tools (TLA+ for state machine exploration, Z3 for constraint solving) to prove properties of the deterministic orchestration layer around AI agents, catching bugs that testing cannot: deadlocks, infinite loops, budget overruns, permission bypasses, and race conditions.
Mental Model
An agentic system has two layers:
┌─────────────────────────────────────────┐
│ Orchestration Layer (deterministic) │ ← Provable
│ State machine, routing, guards, budget │
├─────────────────────────────────────────┤
│ LLM Layer (non-deterministic) │ ← Testable (evals)
│ Content generation, reasoning │
└─────────────────────────────────────────┘
You cannot prove what an LLM will say. You can prove that your orchestrator always terminates, never exceeds budget, never lets an unprivileged agent call a dangerous tool, and never deadlocks when two agents fail simultaneously.
The formal verification wraps the deterministic shell. Evals cover the stochastic core.
The Two Tools
TLA+: “Can my system reach a bad state through some sequence of events?”
TLA+ (Temporal Logic of Actions) models systems as state machines, then exhaustively explores every reachable state. Created by Leslie Lamport (also created LaTeX). Amazon used it to find critical bugs in DynamoDB, S3, and EBS that no amount of testing caught.
You define:
- Variables (the state)
- Actions (transitions that change state)
- Invariants (what must always be true)
- Temporal properties (what must eventually happen)
TLC (the model checker) tries every possible interleaving of actions and either proves your properties hold or gives you the exact step-by-step trace that breaks them.
Install: VS Code extension “TLA+ Nightly” or download the TLA+ Toolbox. Source code and releases on GitHub.
Z3: “Do values exist that break this formula?”
Z3 is a SAT/SMT solver from Microsoft Research. You give it logical formulas and it either finds values that satisfy them (SAT), proves no such values exist (UNSAT), or says it cannot decide (UNKNOWN). See the official documentation and Python API reference.
Think of it as an impossibly powerful equation solver that works on booleans, integers, arrays, bitvectors, and strings.
Install: pip install z3-solver
The One-Liner Rule
- TLA+: “For all possible executions, does this property hold?” (dynamic, over time)
- Z3: “For all possible inputs, does this formula hold?” (static, one point in time)
TLA+ Syntax Essentials
VARIABLE count \* declare state variable
Init == count = 0 \* initial state
count' = count + 1 \* primed = next-state value
/\ \* logical AND
\/ \* logical OR
\A x \in S : P(x) \* for all x in set S
\E x \in S : P(x) \* there exists x in set S
[x \in S |-> expr] \* function (map)
[f EXCEPT ![k] = v] \* update function at key k
UNCHANGED <<a, b>> \* a and b stay the same
<>(condition) \* "eventually" (temporal)
[](condition) \* "always" (temporal)
Code Example: Verifying an Orchestrator
TLA+: Proving the orchestrator always terminates
This models a hierarchical/supervisory multi-agent system. A supervisor delegates tasks to specialist agents, handles failures with retries, and escalates to a human when retries are exhausted.
---- MODULE Orchestrator ----
EXTENDS Integers, FiniteSets
CONSTANTS Agents, MaxDelegations, MaxRetries
VARIABLES
agent_status, \* per-agent: "idle", "working", "failed", "done"
agent_retries, \* per-agent retry count
delegations, \* total delegations made
orch_state \* "planning", "delegating", "aggregating",
\* "escalating", "done"
Init ==
/\ agent_status = [a \in Agents |-> "idle"]
/\ agent_retries = [a \in Agents |-> 0]
/\ delegations = 0
/\ orch_state = "planning"
Delegate(agent) ==
/\ orch_state = "delegating"
/\ agent_status[agent] = "idle"
/\ delegations < MaxDelegations
/\ agent_status' = [agent_status EXCEPT ![agent] = "working"]
/\ delegations' = delegations + 1
/\ UNCHANGED <<agent_retries, orch_state>>
AgentSuccess(agent) ==
/\ agent_status[agent] = "working"
/\ agent_status' = [agent_status EXCEPT ![agent] = "done"]
/\ UNCHANGED <<agent_retries, delegations, orch_state>>
AgentFailure(agent) ==
/\ agent_status[agent] = "working"
/\ agent_status' = [agent_status EXCEPT ![agent] = "failed"]
/\ UNCHANGED <<agent_retries, delegations, orch_state>>
Retry(agent) ==
/\ agent_status[agent] = "failed"
/\ agent_retries[agent] < MaxRetries
/\ delegations < MaxDelegations
/\ agent_status' = [agent_status EXCEPT ![agent] = "working"]
/\ agent_retries' = [agent_retries EXCEPT ![agent] = @ + 1]
/\ delegations' = delegations + 1
/\ UNCHANGED orch_state
Escalate(agent) ==
/\ agent_status[agent] = "failed"
/\ agent_retries[agent] >= MaxRetries
/\ orch_state' = "escalating"
/\ UNCHANGED <<agent_status, agent_retries, delegations>>
StartDelegating ==
/\ orch_state = "planning"
/\ orch_state' = "delegating"
/\ UNCHANGED <<agent_status, agent_retries, delegations>>
Aggregate ==
/\ orch_state = "delegating"
/\ \A a \in Agents : agent_status[a] \in {"done", "failed"}
/\ \E a \in Agents : agent_status[a] = "done"
/\ orch_state' = "aggregating"
/\ UNCHANGED <<agent_status, agent_retries, delegations>>
Finish ==
/\ orch_state = "aggregating"
/\ orch_state' = "done"
/\ UNCHANGED <<agent_status, agent_retries, delegations>>
Next ==
\/ StartDelegating
\/ \E a \in Agents :
Delegate(a) \/ AgentSuccess(a) \/ AgentFailure(a)
\/ Retry(a) \/ Escalate(a)
\/ Aggregate
\/ Finish
Spec == Init /\ [][Next]_<<agent_status, agent_retries,
delegations, orch_state>>
\* ── INVARIANTS (must hold in every reachable state) ──
BudgetRespected == delegations <= MaxDelegations
RetriesBounded == \A a \in Agents : agent_retries[a] <= MaxRetries
\* ── TEMPORAL PROPERTIES (must hold across all executions) ──
AlwaysTerminates == <>(orch_state \in {"done", "escalating"})
FailuresHandled ==
\A a \in Agents :
[](agent_status[a] = "failed" =>
<>(agent_status[a] \in {"working", "done"}
\/ orch_state = "escalating"))
====
TLC explores every possible interleaving of agent successes, failures, and retries. If there is a sequence where two agents fail simultaneously and the retry logic deadlocks, TLC finds it and gives you the exact trace.
Z3: Proving routing logic has no gaps
from z3 import *
# ── Routing: prove every query type reaches the right agent ──
query_type = Int('query_type')
complexity = Int('complexity')
has_pii = Bool('has_pii')
# 0=research, 1=creative, 2=copywriting, 3=media, 4=senior_analyst
routed_to = If(
has_pii,
4, # PII always goes to senior analyst (human-in-loop)
If(query_type == 0, 0,
If(query_type == 1, 1,
If(query_type == 2, 2,
If(query_type == 3, 3,
If(complexity >= 8, 4,
0))))))
s = Solver()
s.add(query_type >= 0, query_type <= 4)
s.add(complexity >= 1, complexity <= 10)
# Prove: PII never goes to a non-senior agent
s.push()
s.add(has_pii)
s.add(routed_to != 4)
print("PII bypass possible?", s.check()) # unsat = proven impossible
s.pop()
# Prove: every query type is handled
s.push()
s.add(routed_to < 0)
print("Invalid routing?", s.check()) # unsat = every query handled
s.pop()
Z3: Proving permissions cannot be bypassed
from z3 import *
role = Int('role') # 0=basic, 1=elevated, 2=admin
tool = Int('tool') # 0=search, 1=read, 2=write, 3=execute, 4=delete
user_approved = Bool('user_approved')
allowed = Or(
tool == 0, # search: anyone
And(tool == 1, role >= 1), # read: elevated+
And(tool == 2, role >= 1, user_approved), # write: elevated + approval
And(tool == 3, role >= 2), # execute: admin only
And(tool == 4, role >= 2, user_approved), # delete: admin + approval
)
s = Solver()
s.add(role >= 0, role <= 2)
s.add(tool >= 0, tool <= 4)
# Can a basic agent ever execute code?
s.push()
s.add(role == 0, tool == 3, allowed)
print("Basic agent execute?", s.check()) # unsat = proven impossible
s.pop()
# Can ANY agent delete without approval?
s.push()
s.add(tool == 4, Not(user_approved), allowed)
print("Delete without approval?", s.check()) # unsat = proven impossible
s.pop()
Z3: Finding concurrency bugs TLA+ style, but for a single function
from z3 import *
# Classic check-then-act race condition
# Two processes read balance, both see >= 60, both subtract 60
balance = 100
withdraw = 60
# Process 1 reads, Process 2 reads (both see 100)
p1_sees = balance
p2_sees = balance
# Both pass the check
p1_allowed = p1_sees >= withdraw # True
p2_allowed = p2_sees >= withdraw # True
# Both write
final = balance - withdraw - withdraw # -20
# Z3 finds this automatically
b = Int('b')
s = Solver()
s.add(b == 100)
s.add(b - 60 - 60 < 0) # Can final balance be negative?
print(s.check()) # sat = yes, the race condition exists
Mapping to Agent Architecture Patterns
| Architecture Pattern | TLA+ verifies | Z3 verifies |
|---|---|---|
| Single agent | Termination, retry bounds, state machine correctness | Tool permissions, input validation |
| Hierarchical/supervisor | Delegation deadlocks, all tasks eventually handled, no orphaned agents | Routing logic completeness, permission escalation, budget allocation |
| Sequential workflow | Each stage completes, no stage skipped, pipeline terminates | Stage transition guards, data validation between stages |
| Parallel workflow | Fan-out/fan-in completes, no lost results, aggregation always reached | Conflict resolution logic, result merging correctness |
| Evaluator-optimizer | Loop terminates (does not refine forever), quality threshold met | Evaluation criteria completeness, scoring function properties |
| Collaborative/swarm | No infinite message loops, consensus reached, no deadlock | Role assignment completeness, shared state conflict resolution |
What NOT to Verify Formally
Formal methods cover the deterministic shell. The LLM core stays in eval/testing territory.
| High value (do this) | Low value (skip this) |
|---|---|
| TLA+: orchestrator state machine | TLA+: modeling LLM output quality |
| TLA+: multi-agent deadlocks | TLA+: prompt effectiveness |
| Z3: routing completeness | Z3: “will the LLM hallucinate” |
| Z3: permission guards | Z3: content quality scoring |
| Z3: budget constraint proofs | Z3: natural language parsing |
When to Use Which
| Scenario | Tool |
|---|---|
| Can my system reach state X through some sequence of events? | TLA+ |
| Can inputs exist that make this function return a wrong value? | Z3 |
| Is my concurrent/distributed protocol correct? | TLA+ |
| Is my validation logic airtight? No bypasses? | Z3 |
| What happens if step 3 crashes and step 4 retries? | TLA+ |
| Find me a configuration that satisfies these 20 constraints | Z3 |
| Do all possible execution orderings preserve my invariant? | TLA+ |
| Prove this arithmetic/bitwise operation can never overflow | Z3 |
| Model a workflow with timeouts, retries, and failures | TLA+ |
| Generate test inputs that hit edge cases | Z3 |
When NOT to Use Formal Verification
- Simple CRUD. If your state transitions are trivial, it is overkill.
- Performance testing. These tools check correctness, not speed.
- UI logic. Usually not complex enough to warrant it.
- Pure LLM quality. Use evals, RLHF, and prompt engineering instead.
When You Absolutely Should
- Multiple agents can modify the same state concurrently.
- Your orchestrator has more than 5 states with complex guard conditions.
- Failure modes matter (what if a step crashes halfway through?).
- The cost of a bug is high (financial transactions, data corruption, security).
- You need to prove to auditors that certain states are unreachable.
Practical Playbook
For a content-driven agentic pipeline, apply verification at the right layer:
Your agentic pipeline
├── Orchestration layer ← TLA+ (state machines, concurrency, termination)
│ ├── Agent state machine
│ ├── Multi-agent coordination
│ ├── Retry/failure logic
│ └── Human-in-the-loop gates
├── Guard/routing layer ← Z3 (constraints, permissions, routing)
│ ├── Tool permissions
│ ├── Budget enforcement
│ ├── Query routing
│ └── Rate limiting logic
├── LLM interaction layer ← Evals + testing
│ ├── Prompt quality
│ ├── Output parsing
│ └── Content quality
└── Output validation layer ← JSON Schema, Pydantic, Zod
├── Structured output format
└── Business rule compliance
Related Tools
- tla-precheck (github.com/kingbootoshi/tla-precheck): Compiles a single TypeScript DSL into TLA+ specs, a TypeScript runtime interpreter, and Postgres constraints. Proves the TLA+ spec and TS interpreter produce identical state graphs.
- Alloy: Lightweight formal modeling language, good for data structure invariants.
- Dafny: Verification-aware programming language from Microsoft. Proofs live alongside code.
Gotchas
- TLA+ models a simplified version of your system. If the model diverges from the code, the proof means nothing. Tools like tla-precheck address this by generating both from the same source.
- Z3 proofs are only as good as your formula encoding. If you forgot a constraint, the proof is vacuously true.
- State space explosion is real. TLA+ with large domains can run for hours. Use small domains for CI (proof tiers), larger domains for nightly runs.
- Neither tool replaces integration tests. They prove properties of your design, not your deployment.
Related Concepts
Sources
- Leslie Lamport, Specifying Systems (TLA+ reference book, free PDF)
- Leslie Lamport, The TLA+ Home Page
- Leslie Lamport, Learn TLA+ video course
- Amazon, How Amazon Web Services Uses Formal Methods (CACM 2015)
- Microsoft Research, Z3 Theorem Prover (GitHub)
- Microsoft Research, Z3 Guide (interactive tutorial)
- Hillel Wayne, Learn TLA+ (practical tutorial)
- Anthropic, Building Effective AI Agents (2025)
- kingbootoshi/tla-precheck on GitHub
