Watch the Ralph

James Phoenix

What happens when you point the RALPH loop at a non-trivial Rust project and let it run for a full day. The raw output, the surprises, and the patterns that emerged.

Project: flowdiff (semantic diff analyzer in Rust) | Date: 2026 | Related: The RALPH Loop

The Problem

When AI agents modify 50-100 files in a single PR, existing diff tools (VS Code, GitHub, Beyond Compare) present changes as a flat file list. I mentally reconstruct data flow, architectural impact, and causal ordering. That is the most cognitively expensive part of code review.

My core insight: diff review is a graph problem, not a set problem. The right primitive is not “file A changed” but “request enters here, transforms here, validates here, persists here, emits here, renders here.”

flowdiff transforms flat file diffs into ranked, semantically grouped review flows.

What I Started With

A goal: “build a semantic diff viewer.” No existing code. I spent about an hour refining the spec with Claude before launching the loop, using bidirectional prompting to nail down the architecture, the 8-layer pipeline, the IR approach, and the phase breakdown. By the time I hit go, the spec was already detailed. Then a six-line prompt:

study specs/readme.md
study specs/diff-analyzer.md and pick the most important thing to do

IMPORTANT:
- author property based tests or unit tests (which ever is best)
- after making the changes to the files run the tests
- update the implementation plan when the task is done
- when tests pass, commit and push to deploy the changes

That is the entire prompt.md. The loop reads it, reads the spec, picks the next unchecked task, implements it, tests it, commits, marks it done. Fresh context. Repeat.

By the end of one day:

Metric	Result
Tests	791+ (unit, property-based, integration, live API, VCR replay)
Phases completed	2 full phases, 2 partially complete
Analysis engine	8-layer pipeline: git, AST, graph, flow, cluster, rank, LLM, export
Framework coverage	30+ frameworks auto-detected from import patterns
LLM providers	3 (Anthropic, OpenAI, Gemini) with structured outputs
Eval suite	5 synthetic codebases, 0.89 average score
Spec growth	“Build a diff viewer” evolved into 8 phases with 60+ tasks

The Spec as Living Document

I spent the first hour before the loop ever ran doing bidirectional prompting with Claude. I described the problem, Claude asked clarifying questions, I asked it questions back. We went back and forth until the spec had the 8-layer architecture, the graph-based clustering approach, the ranking formula, and the phase breakdown. I read every line and signed off on it. That hour was the most important hour of the entire day.

But the spec did not stay frozen. The spec is not a static input. It is the primary output of my work. The loop reads it each iteration. I edit it between iterations. The spec captures architectural decisions as they happen. By the end of the day, the spec was the most valuable artifact, not the code.

Hour -1:  Bidirectional prompting. Refine spec with Claude.
Hour 0:   Loop starts. Spec has 8 phases, architecture defined.
Hour 4:   IR refactor added mid-stream, .scm query engine designed
Hour 8:   Eval suite phase added after reading two reference articles
Hour 12:  60+ tasks, detailed JSON schemas, 791+ tests

The spec grew because I kept making design decisions and encoding them. The loop kept executing. Each checked-off task in the spec included a detailed description of what was built, what was tested, and how many tests passed. The spec became a living audit trail.

The Architecture That Emerged

┌─────────────────────────────────────────────────┐
│                   flowdiff CLI                   │
│                  (Rust binary)                   │
├─────────────────────────────────────────────────┤
│  Git Layer     │ Diff extraction (git2)          │
│  AST Layer     │ Tree-sitter + .scm queries      │
│  IR Layer      │ Language-agnostic types          │
│  Graph Layer   │ Symbol graph (petgraph)          │
│  Flow Layer    │ Data flow tracing + heuristics   │
│  Cluster Layer │ Semantic grouping                │
│  Rank Layer    │ Review ordering + scoring        │
│  LLM Layer     │ Anthropic + OpenAI + Gemini      │
│  Export Layer  │ JSON output + Mermaid            │
├─────────────────────────────────────────────────┤
│  IPC: JSON over stdin/stdout or Tauri commands   │
├──────────────────────┬──────────────────────────┤
│   Tauri App          │   VS Code Extension      │
│   (Three-panel UI)   │   (Thin shell over CLI)  │
└──────────────────────┴──────────────────────────┘

Two modes of operation:

Deterministic (free, fast): static analysis only. Graph construction, flow grouping, ranking.
LLM-annotated (paid, optional): reasoning models narrate over the deterministic graph. BYOK.

The ranking formula: score(group) = 0.35*risk + 0.25*centrality + 0.20*surface_area + 0.20*uncertainty

What Actually Happened, Phase by Phase

Phase 1: Core Pipeline (Complete)

The foundation. The loop built the git layer (git2, not shelling out), tree-sitter AST parsing for TS/JS and Python, symbol graph construction with petgraph, basic entrypoint detection, semantic clustering via forward reachability, review ranking with composite scores, JSON output with Mermaid diagram generation, and a CLI with clap.

Tests: 15 e2e integration tests against programmatic git repos (Express app, FastAPI, branch comparison, empty diff, JSON schema compliance, cross-cutting refactor, multiple entrypoints, mixed language, determinism, new-files-only, risk scoring, 20-file diff performance, Mermaid generation, commit range, entrypoint detection). Plus 25 AST tests, 25 graph tests, 6 property-based graph tests, 75 entrypoint tests (including 34 Effect.ts-specific), 16 clustering tests, 6 property-based clustering tests, 26 ranking tests, 11 property-based ranking tests.

This phase was straightforward. The loop handled it without intervention.

Phase 2: Language Intelligence (Complete)

This is where it got interesting. The loop built:

Heuristic inference in flow.rs: DB writes/reads, event emission/handling, config reads, HTTP calls, logging detection with confidence scoring and false positive guards (64 tests + 6 property-based)
Framework detection: auto-detect Express, Next.js, React, FastAPI, Flask, Django, Prisma, Effect.ts, and 30+ frameworks from import patterns and file structure conventions (12 tests)
Full data flow tracing: variable assignment tracking, call argument extraction, cross-file data flow edges (30 tests)
Shared IR (ir.rs): IrFile, IrFunctionDef, IrTypeDef, IrImport/IrExport, IrCallExpression, IrAssignment with IrPattern (Identifier, ObjectDestructure, ArrayDestructure, TupleDestructure) and IrExpression. Covers simple assignments, destructuring, Effect.ts yield* patterns, spread/rest, nested destructuring, defaults (72 unit + 12 property tests)
Declarative .scm query files: TypeScript/JS and Python each get imports.scm, exports.scm, definitions.scm, calls.scm, assignments.scm. Adding a new language = writing .scm files, zero Rust code
Generic query engine (query_engine.rs): loads .scm files, maps @capture names to IR types. Language-agnostic containing-function resolution via parent traversal (53 tests)
Capture-based matching refactor: replaced fragile pattern_index with has_capture()/get_capture(). Pattern ordering in .scm files is now irrelevant
Refactored graph.rs, entrypoint.rs, flow.rs, pipeline.rs to consume IR types instead of raw tree-sitter (51 parity tests)
Restructured tests to Rust convention: integration tests in tests/, unit tests co-located, shared RepoBuilder + assertion helpers

The IR refactor was my decision, made mid-stream. The original plan had the pipeline consuming raw tree-sitter nodes. I pushed for a shared intermediate representation, which made the whole architecture cleaner and made adding new languages a matter of writing .scm files instead of Rust code.

Key observation: The loop executed the refactor cleanly because I updated the spec first. Each fresh context read the updated spec and understood the new direction. No context rot from the old approach lingering in the conversation.

Phase 4: LLM Integration (Partial)

Three providers wired up with a unified LlmProvider trait and create_provider() factory. Two-pass architecture: Pass 1 (overview) scans all groups, Pass 2 (deep analysis) dives into a specific group with full file diffs.

The VCR caching layer wraps any LlmProvider as a decorator. Three modes: Record, Replay, Auto (cache-through). Cache keyed by SHA-256 of (provider, model, request JSON, prompt template hash). Automatic invalidation when system prompt templates change. 29 unit tests + 6 integration tests.

164 LLM module tests total. Live integration tests against all three providers, gated behind FLOWDIFF_RUN_LIVE_LLM_TESTS=1.

Remaining: LLM refinement pass (evaluator-optimizer loop that improves grouping), structured outputs migration to provider-native APIs, Tauri rendering.

Phase 7: Eval Suite (Mostly Complete)

5 synthetic fixture codebases: (1) TypeScript Express with services + DB + events, (2) Python FastAPI with SQLAlchemy + Celery, (3) Next.js fullstack with React + API routes + Prisma, (4) Rust CLI with modules, (5) multi-language monorepo.

Expected baselines define ground truth: which files should group together, which entrypoints should be detected, review ordering constraints, score bounds. 6 per-criterion scorers (group_coherence, entrypoint_accuracy, review_ordering, risk_reasonableness, language_detection, file_accounting) all producing [0.0, 1.0].

LLM-as-judge evaluator (judge.rs) with 5 criteria scored 1-5. evaluate_quality on the LlmProvider trait, implemented in all 3 providers. 22 unit tests + 10 integration tests + 7 VCR judge tests.

16 eval tests total. Current average score: 0.89.

Remaining: eval harness CLI (flowdiff eval), HTML dashboard.

The Compounding Effect in Practice

This is what the RALPH loop articles describe in theory. Here is what it looks like in practice.

Iteration 1-10: Slow. The loop is building foundational code. Each iteration produces one module, one test file. Progress feels linear.

Iteration 10-30: Acceleration. The loop starts reusing patterns from earlier iterations. Tests run faster because the test harness exists. New modules plug into existing infrastructure. The shared RepoBuilder and assertion helpers mean each new integration test is a few lines, not a hundred.

Iteration 30+: The loop is now producing complex, interconnected features (eval suite, VCR caching, LLM-as-judge) that build on everything before. A single iteration produces more value than the first ten combined. The VCR layer wraps the LlmProvider trait that was built in iteration 20-something. The eval suite uses the RepoBuilder from iteration 5.

The spec captures this compounding. Each completed task makes the next task easier because the loop reads what already exists and builds on it.

Mid-Stream Architecture Decisions

My job was not writing code. It was making architecture decisions and updating the spec. Four decisions shaped the project:

1. Diff Review as Graph Problem

The foundational insight I encoded in the spec: the right primitive is not “file A changed” but directed paths through a dependency graph. This shaped everything. Petgraph for the symbol graph. Forward reachability for clustering. Composite scoring for ranking.

2. Shared IR Instead of Raw Tree-Sitter

Original plan: each module queries tree-sitter directly. Problem: every new language means new Rust code in every module.

Decision: define language-agnostic IR types (IrAssignment, IrPattern with destructuring variants, IrCallExpression, IrImport, IrFunctionDef). Pipeline operates only on IR. Languages provide .scm query files that the generic engine maps to IR.

This one decision meant adding Python support was writing .scm files, not touching Rust.

3. Capture-Based Matching Over Pattern Index

The .scm query engine initially used pattern_index to decide which IR type to construct. Problem: reordering patterns in the .scm file breaks Rust code silently.

Decision: match on which @capture names are present instead. Renamed captures to be distinct per kind (@fn_name/@fn_node, @class_name/@class_node). Order-independent, self-documenting, prevents silent breakage.

4. Evaluator-Optimizer for LLM Refinement

Instead of trusting the LLM refinement pass blindly, I decided to score both v1 (deterministic) and v2 (LLM-refined) and keep whichever scores higher. The eval suite provides the scoring function. This means the LLM can only help, never hurt.

The Prompt That Drove It

The entire prompt.md is 13 lines. It tells the loop to:

Read the spec index (specs/readme.md)
Read the spec and pick the most important unchecked task
Write property-based or unit tests (whichever fits)
Run the tests
Update the implementation plan when done
Commit and push when tests pass

No elaborate chain-of-thought instructions. No persona. No examples. Just: read the spec, do the work, test it, mark it done. The spec does the heavy lifting. The prompt is lightweight because the spec is comprehensive.

The one non-obvious addition: a pointer to a .env file path for LLM integration tests, and two article URLs to read before implementing the eval suite. Context injection at the prompt level, not the spec level.

The Spec Is the Task Manager

Here is the irony. I built a full task management CLI in TypeScript (tx) with ticket IDs, dependency graphs, status transitions, assignees, the works. For flowdiff, I ditched all of that. The task manager is a markdown file with checkboxes.

- [x] Git diff extraction via git2
- [x] Tree-sitter AST parsing (TS/JS + Python grammars first)
- [ ] Tauri project setup with React frontend
- [ ] Three-panel layout shell

The LLM reads the spec, sees which boxes are unchecked, understands dependencies from context (“Tauri needs the core API surface to exist first”), and picks the right next task. No ticket system required. The model is the scheduler.

This works because the spec carries enough context for the LLM to reason about priority. It knows that Phase 3 depends on Phase 1-2 being done because the architecture section describes the dependency. It knows which task is highest-leverage because the spec describes what each task enables. The checkboxes are the state. The spec prose is the dependency graph. The LLM is the scheduler that reads both.

The outer loop script does not even need to parse the checkboxes itself. But you could. A three-line grep tells you how many tasks remain:

# Count remaining tasks
grep -c "^\- \[ \]" specs/diff-analyzer.md

# Check if all done
if ! grep -q "^\- \[ \]" specs/diff-analyzer.md; then
  echo "All tasks complete"
  exit 0
fi

Wire that into the RALPH loop’s outer script and it terminates automatically when the spec has no unchecked boxes. The spec becomes the task queue, the progress tracker, and the termination condition. No database, no API, no dashboard. Just markdown.

A dedicated task manager still makes sense for multi-person teams, cross-project coordination, and anything that needs audit trails beyond git history. But for a single-agent RALPH loop, the spec is the task manager. Adding anything else is overhead.

What Worked

The loop harness. Claude stalls sometimes. Auto-restart means no wasted time. A stalled iteration is just a lost minute, not a lost hour.

Spec-driven development. The spec is the single source of truth. The loop reads it each iteration and picks the next task. No ambiguity about what to build. The spec also records what was built, so the loop never repeats work.

Property-based testing from the start. Proptest caught edge cases that unit tests would have missed. The IR types had 12 property-based tests that found boundary conditions in destructuring patterns. The ranking module had 11. The graph module had 6. These are the tests that catch the bugs you would not think to write unit tests for.

Fresh context per iteration. The IR refactor would have been painful in a single long conversation. With fresh context, each iteration just read the updated spec and implemented against the new types. No confusion from the old approach.

Detailed task completion records. Each completed task in the spec includes exactly what was built and how many tests passed. This serves two purposes: the loop knows what exists (so it does not rebuild), and I can audit quality without reading every file.

What to Watch For

Spec size matters. The spec grew to 700+ lines. If it had grown much larger, context rot during each iteration’s spec-reading phase would have degraded quality. Keep specs as brief as possible while still capturing decisions. Consider splitting into per-phase spec files if the main spec exceeds ~1000 lines.

My bottleneck was design, not code. The loop executed faster than I could make decisions. Multiple times the loop completed a batch of tasks and was waiting for my next architectural direction. Have your design decisions ready.

Test quality compounds. A bad test early on poisons every future iteration that touches that module. The loop trusts tests. If a test is wrong, the loop builds around the wrong behavior. The eval suite’s 0.89 average score is only meaningful if the baselines are correct.

The first few iterations need close watching. Once the foundation is solid (Phase 1 done, test harness working, spec proven accurate), you can let the loop run unsupervised. Before that, watch every iteration.

Phase 8 is the reality check. The spec includes a hardening phase with parallel sub-agents auditing every layer: Rust core, query engine, LLM providers, Tauri app, VS Code extension, cross-layer integration, security. This is where the loop’s bugs get found. Plan for it.

The Division of Labor

This pattern (spec + loop + structured prompt) is a poor man’s autonomous agent with me as the architect steering.

Role	Who
Problem framing (“graph problem, not set problem”)	Me
Architecture decisions (IR, .scm queries, VCR caching)	Me
Spec writing and updating	Me
Mid-stream course corrections	Me
Reading reference articles and adding new phases	Me
Implementation of every module	Loop
Testing (unit, property-based, integration, live)	Loop
Commit discipline	Loop
Task selection within a phase	Loop

I made design decisions. The loop did the implementation grunt work. That is probably the right division of labor for now.

The Numbers

Category	Count
Total tests (all passing)	791+
Unit tests (co-located `#[cfg(test)]`)	~500
Property-based tests (proptest)	~80
Integration tests (`tests/` directory)	~150
Live API tests (gated)	~30
VCR replay tests	~40
Eval tests	16
Rust source modules	15+
.scm query files	9 (5 TS, 4 Python)
Languages supported	2 (TypeScript/JS, Python)
Frameworks detected	30+
LLM providers	3 (Anthropic, OpenAI, Gemini)
Entrypoint types detected	11 (HTTP, CLI, queue, cron, React, test, event, plus Effect.ts variants)
Eval fixture codebases	5
Eval average score	0.89
Phases completed	2 full, 2 partial
Phases remaining	4 (Tauri UI, VS Code, polish, hardening)

What Remains

Phase	Tasks Done	Tasks Left	Status
Phase 1: Core Engine	16/16	0	Complete
Phase 2: Data Flow Depth	17/17	0	Complete
Phase 3: Tauri App	0/8	8	Not started
Phase 4: LLM Integration	14/19	5	74%
Phase 5: VS Code Extension	0/6	6	Not started
Phase 6: Polish	0/7	7	Not started
Phase 7: Eval Suite	6/8	2	75%
Phase 8: Hardening	0/8	8	Not started

The entire backend analysis engine is done. What remains: two UIs to build (Tauri + VS Code), LLM refinement, and hardening. The loop should target Phase 3 (Tauri) next since Phases 5, 6, and 8 depend on the UI existing.

Udemy Bestseller

Learn Prompt Engineering

My O'Reilly book adapted for hands-on learning. Build production-ready prompts with practical exercises.

★ 4.5/5 rating

306,000+ learners

View Course

Key Insight

I am not coding. I am programming the system that writes the code. The spec is my program. The loop is my runtime. My job is to be the architect, not the typist.

One day. 791+ tests. A full 8-layer analysis engine. The RALPH loop works when I feed it a good spec and make design decisions faster than it can execute.

The RALPH Loop – The pattern this case study demonstrates
Building the Harness – The meta-engineering layer
Agent-Driven Development – The broader workflow
Two Camps of Agentic Coding – Spec-driven vs vibe coding
Autonomous Loops Need a Scoring Function – Why the eval suite matters

External References

flowdiff Project – The project documented here