How I’m Optimising Flowdiff for Semantic Grouping

James Phoenix

Flowdiff exists because raw git diff is the wrong abstraction for reviewing large AI-generated pull requests. Git shows me which files changed. What I actually need is a higher-level answer: which changes belong to the same behavior, where execution starts, how the change fans out, and what should be reviewed first.

The optimisation work in Flowdiff is centered on that gap. I am trying to group semantically similar changes without collapsing into fuzzy “these files kind of look related” clustering. The approach is deliberately hybrid:

A deterministic structural pass builds the strongest possible program-level model from the diff.
An LLM pass refines only the ambiguous cases that the structural pass cannot settle cleanly.
An eval loop measures whether each change actually improves grouping quality on real repositories.

The Core Thesis

I do not want the LLM to invent review groups from scratch.

If grouping starts as a pure prompt over raw diff text, the result is expensive, hard to debug, and hard to reproduce. It also tends to over-index on naming similarity and miss the actual execution path through the codebase.

So the main optimisation strategy in Flowdiff is:

push as much semantic signal as possible into the AST, IR, and graph layers
make the deterministic grouping engine do the heavy lifting
use the LLM as a constrained patch layer on top of that baseline

That keeps the system fast and reproducible while still giving me a way to recover from edge cases like scattered refactors, weak entrypoint signals, and misleading infrastructure buckets.

AST to IR to Graph

The structural side starts with tree-sitter parsing and declarative query extraction. The parser layer reads changed files, identifies language-specific syntax, and extracts definitions, imports, exports, calls, and assignment patterns. The important design choice is that this does not stop at per-language AST traversal. The parser normalises everything into a shared intermediate representation so the rest of the engine can reason about TypeScript, Python, Go, Rust, and other languages through the same conceptual model.

That shared IR is the real optimisation surface.

Udemy Bestseller

Learn Prompt Engineering

My O'Reilly book adapted for hands-on learning. Build production-ready prompts with practical exercises.

★ 4.5/5 rating

306,000+ learners

View Course

Instead of saying “this is a TypeScript route file” or “this is a Python controller” in ten different hand-written ways downstream, I want one language-agnostic layer that captures:

functions and exported symbols
types and structural definitions
imports and re-exports
call expressions and data-flow-adjacent assignments
binding patterns such as destructuring and tuple unpacking

This is what that looks like in the IR layer:

pub enum IrPattern {
    Identifier(String),
    ObjectDestructure {
        properties: Vec<DestructureProperty>,
        rest: Option<String>,
    },
    ArrayDestructure {
        elements: Vec<Option<IrPattern>>,
        rest: Option<String>,
    },
    TupleDestructure { elements: Vec<IrPattern> },
}

impl IrPattern {
    pub fn bound_names(&self) -> Vec<String> {
        let mut names = Vec::new();
        self.collect_names(&mut names);
        names
    }
}

That example is high signal because it shows the real job of the IR. Flowdiff is not just storing syntax nodes. It is normalising different binding shapes into one representation that downstream stages can reason about uniformly. A JavaScript destructure, a Python tuple unpack, and a Rust tuple binding all collapse into the same conceptual surface.

Once the code is normalised into IR, Flowdiff builds a symbol graph and enriches it with flow information. That graph is what lets grouping move beyond path proximity. Files are no longer related just because they sit in neighboring directories. They are related because they share entrypoints, imports, calls, and execution-adjacent edges.

This matters for semantically similar content because many large diffs are not physically local. A single behavior might span a route file, a service, a repository, a schema, and a migration. Those files can live far apart in the tree while still belonging to the same review packet. The AST and IR layers are what give the grouping algorithm enough structure to recover that shape.

The entrypoint layer is equally important. Grouping quality depends heavily on whether the system can identify where execution likely starts, and Flowdiff handles that with explicit heuristics rather than vague prompting:

fn detect_file_entrypoints(file: &ParsedFile, out: &mut Vec<Entrypoint>) {
    detect_test_file(file, out);
    detect_http_routes(file, out);
    detect_path_based_http_routes(file, out);
    detect_cli_commands(file, out);
    detect_path_based_cli_commands(file, out);
    detect_queue_consumers(file, out);
    detect_cron_jobs(file, out);
    detect_react_pages(file, out);
    detect_event_handlers(file, out);
    detect_effect_ts(file, out);
    detect_rust_modules(file, out);
}

That small function explains a lot about the system design. Flowdiff is not trying to derive all semantics from one generic graph traversal. It has concrete priors about tests, routes, commands, jobs, React pages, Effect code, and Rust module layouts. Those priors are what let the graph pass start from meaningful anchors.

Deterministic Grouping as the Main Engine

The current deterministic grouping pass works by detecting likely entrypoints, tracing reachability through the symbol graph, and assigning changed files to the nearest meaningful flow. That gives Flowdiff its basic unit: the flow group.

The optimisation work here is about reducing the two main failure modes:

group explosion, where one logical change shatters into too many tiny groups
infrastructure collapse, where too many files end up in a catch-all bucket

The fixes are structural, not cosmetic:

stronger entrypoint detection so real route, command, test, and worker files are recognised earlier
better IR and graph coverage so more files are connected to a meaningful execution path
bidirectional reachability so files that depend on an entrypoint-adjacent flow are not automatically discarded as unrelated
convention-based infrastructure classification so docs, scripts, schemas, migrations, generated code, and true infra are not all treated as the same leftover category

This is the key idea: the closer I can get the deterministic pass to “mostly right,” the more useful the LLM becomes. If the baseline groups already reflect real program structure, the LLM does not need to perform full semantic reconstruction. It only needs to repair the residual mistakes.

The clustering code makes that philosophy explicit:

let reachability: Vec<HashMap<String, usize>> = entrypoints
    .iter()
    .map(|ep| {
        let mut reach = compute_file_reachability(graph, &ep.file, &ep.symbol);
        reach.entry(ep.file.clone()).or_insert(0);
        reach
    })
    .collect();

for file in &changed_set {
    let mut best: Option<(usize, usize)> = None;

    for (ep_idx, reach) in reachability.iter().enumerate() {
        if let Some(&dist) = reach.get(file.as_str()) {
            match best {
                None => best = Some((ep_idx, dist)),
                Some((best_ep, best_dist)) => {
                    if dist < best_dist || (dist == best_dist && ep_idx < best_ep) {
                        best = Some((ep_idx, dist));
                    }
                }
            }
        }
    }
}

That is not a fuzzy similarity algorithm. It is a nearest-entrypoint assignment over graph reachability with deterministic tie-breaking. It is cheap to run, easy to reason about, and debuggable when a repo scores badly.

The other useful detail is that the deterministic pass now includes a consolidation layer for tiny groups:

const SMALL_GROUP_THRESHOLD: usize = 3;

fn consolidate_small_groups(mut groups: Vec<FlowGroup>) -> Vec<FlowGroup> {
    for depth in (1..=4).rev() {
        groups = merge_at_depth(groups, depth);
    }
    groups
}

That looks minor, but it turned out to matter a lot in practice. The experiment log shows why. On March 26, 2026, experiment #5 added directory-based consolidation for small groups and improved avg_overall to 0.9294, with the note “Massive win. Singleton ratio ->0 across most repos.” The very next experiment, #6, tried skipping the shallowest merge depth and regressed to 0.9226, so it was discarded.

This is exactly the kind of optimisation work I want Flowdiff to support. The algorithm is simple enough to change locally, and the effect is measurable against a corpus instead of judged by vibe.

The LLM Pass as a Structured Patch Layer

The LLM pass is intentionally narrow.

Instead of asking the model to re-cluster the entire diff from raw text, Flowdiff gives it a structured view of the current grouping state and asks for specific operations:

split a group
merge groups
re-rank review order
reclassify misplaced files

This is a much better fit for the problem.

The deterministic pass is very good at extracting hard signals: imports, exported handlers, graph reachability, file roles, and basic execution order. The LLM is much better at the softer semantic questions:

are these two files part of the same refactor even if the graph is weak?
is this group actually two different reviewer tasks mixed together?
should the schema be reviewed before the handler even if the graph shape is shallow?
is this file “infrastructure” or is it actually part of the behavior change?

So the optimisation goal is not “replace the deterministic engine with AI.” It is “give the model a strong structural prior, then let it make bounded semantic corrections.”

The implementation reflects that constraint directly:

pub fn validate_refinement(
    response: &RefinementResponse,
    groups: &[FlowGroup],
    infrastructure: Option<&InfrastructureGroup>,
) -> Result<(), RefinementError> {
    let group_ids: HashSet<&str> = groups.iter().map(|g| g.id.as_str()).collect();

    for split in &response.splits {
        if !group_ids.contains(split.source_group_id.as_str()) {
            return Err(RefinementError::UnknownSplitSource(
                split.source_group_id.clone(),
            ));
        }
    }

    Ok(())
}

And then the patch application happens in a fixed order:

pub fn apply_refinement(
    groups: &[FlowGroup],
    infrastructure: Option<&InfrastructureGroup>,
    response: &RefinementResponse,
) -> Result<(Vec<FlowGroup>, Option<InfrastructureGroup>), RefinementError> {
    validate_refinement(response, groups, infrastructure)?;

    // 1. reclassifications
    // 2. splits
    // 3. merges
    // 4. re-ranks
}

This is the real reason I describe the LLM as a patch layer. The model is not allowed to hand back a blob of prose and hope the system interprets it correctly. It has to propose operations against an existing grouping state, and those operations are validated before application. That makes the refinement stage legible.

Why This Looks Like Autoresearch

This work maps naturally onto the idea behind karpathy/autoresearch: treat research itself as an executable loop.

In this repo, the connection is not metaphorical. The “research” manifest is literal:

include_dir = "repos"

[defaults]
max_groups = 200
max_infra_ratio = 0.50
max_singleton_ratio = 0.60

And the experiment runner is literal too:

cargo run -p flowdiff-cli -- eval \
  --manifest eval/repositories.research.toml \
  --format json 2>/dev/null > /tmp/fd-eval-result.json

experiments/program.md then defines the loop in plain English:

read the current experiment log
pick one phase
form one hypothesis
change one variable
run eval
record the outcome
keep or discard the change

The important part is that the objective function is not training loss. My target is grouping quality:

do related files land together?
do unrelated files stay separate?
does infrastructure stay under control?
does the group count stay usable?
does review order match human intuition?

That means Flowdiff’s equivalent of autoresearch is an evaluation-driven semantic clustering loop. The artefacts are different, but the rhythm is the same: formulate a hypothesis, run the system, score the outcome, and iterate quickly.

The End State

The end state I am aiming for is a semantic review engine where the deterministic AST and IR pipeline does most of the intellectual work and the LLM acts as an optimiser over a constrained search space.

That is why the connection between Flowdiff’s public product page and autoresearch matters. The product promise is “turn raw diffs into review flows.” The research loop is how I keep improving that promise without turning the system into prompt soup.

In short, I am not trying to make Flowdiff “more AI.” I am trying to make its structural understanding strong enough that AI only has to solve the last mile.