Pixel Diffs Are the Prototype Spec

James Phoenix

Vague specs do not converge. Scalar loss functions do. If you can hand the agent a number that says “you are 0.66 wrong,” it will close the gap on its own.

Author: James Phoenix | Date: May 2026

The Moment I Spotted It

I was porting an 11+ web application from a standalone prototype folder into apps/web. The prototype was a mocked React app, no auth, no data layer, just the visual surface I wanted the real app to match. The job was to take each route, lift the components into the real app shell, wire them to real state, and end up with screens that looked identical to the prototype.

That sounds easy and it never is. The agent ports a route, claims parity, and the screen is “close” but the spacing is off, the avatar stack is shifted, the badge colour is one shade too cool. I push back, the agent guesses, we cycle. The bottleneck is not the agent’s capability. The bottleneck is the spec. “Match the prototype” is not a spec. It is a vibe.

I spent a whole day watching Codex and Claude go around in circles before I noticed what was actually breaking. The setup was a /goal brief: port these routes, match the prototype, come back when done. With no per-route ledger the agents would fix one screen, drift to another, regress the first while fixing the second, and resurface claiming “good progress” against a moving target. They were not lazy. They had no state. They were re-judging every screen from scratch every cycle against a vague memory of the prototype, and playwright-cli was happy to keep cycling on random components forever.

Then I noticed one of the agents run this, unprompted, in the middle of a port:

magick compare -metric AE \
  apps/web/prototype-parity-artifacts/tutor-routes/students-prototype.png \
  apps/web/prototype-parity-artifacts/tutor-routes/students.png \
  apps/web/prototype-parity-artifacts/tutor-routes/students-diff.png

It printed 613106 (0.665263) to stderr and wrote a diff PNG to disk. Then it kept working. I asked what that was. It explained the metric, opened the diff PNG to see which regions were off, and adjusted the component. The next run printed a smaller number. Then smaller. Then it was done.

That is the whole post. I want to make sure every agent I work with does this, every time it ports a screen.

What the Command Actually Does

magick compare -metric AE runs a pixel-by-pixel comparison between two images using the Absolute Error metric. AE is the count of differing pixels. Two outputs come back:

An integer count of pixels that differ (613106)
A normalised fraction in parentheses (0.665263), which is the integer divided by the total pixel count

It also writes a third image to the path you give it. That image highlights the differing regions in red against a faded version of the source. You can open it and see exactly where the divergence lives.

It is a one-shot deterministic comparison. No model. No judgment. No flake beyond anti-aliasing and font rendering. The two images are either pixel-identical or they are not, and if they are not you get a number plus a map.

Why This Shape Is Correct for Agents

Most “does this match” tasks I give agents look like this. I describe what I want, the agent produces something, I look at it and say “closer, but the header is too tall.” The agent guesses what I meant. We loop. Each cycle costs my attention, and the convergence rate is awful because the agent is optimising against my next message, not against the actual target.

Pixel diff inverts that loop. The target is the prototype PNG. The candidate is the app PNG. The loss is AE. The agent reads the loss, looks at the diff image to localise the error, edits the component, re-screenshots, reads the new loss. No human in the loop. The agent supervises itself because there is a scalar to minimise and a visual map of where to look.

This is the same dynamic that makes test suites valuable. A failing assertion is a scalar loss with a localiser. “Test X failed at line Y” is a finite target, so the agent converges. “Make it look like the prototype” is unbounded, so it wanders.

The Absolute Error metric is unusually well suited here because it is harsh and unforgiving. It does not normalise away small differences. Every off pixel is counted. That sounds like a disadvantage. It is the opposite. Harsh metrics are how you force the agent to actually fix the thing rather than rationalise about how it is “basically right.”

The Workflow I Now Insist On

Every prototype-to-app port now follows this loop:

Snapshot the prototype. Run the prototype in a controlled Playwright session, navigate to the route, take a full-page screenshot, save it as <route>-prototype.png. Lock viewport, fonts, and any animation timing. This file is the spec, committed to the repo.
Snapshot the app. Same harness, same viewport, same wait conditions, but pointed at the real apps/web route. Save as <route>.png.
Diff. Run magick compare -metric AE on the pair, write the diff PNG, capture the metric value, and write the result back into a per-project ledger. In this project the raw value lives both as a <route>-diff.metric.txt sibling to the screenshots and as a note on the route entry inside apps/web/prototype-parity.progress.json.
Threshold. Decide what “done” means. For a real app rendering on a real backend, pixel-identical is impossible. I set a normalised threshold (typically 0.02, sometimes tighter for marketing surfaces). Anything above threshold blocks merge.
Loop. The agent reads the metric, opens the diff, edits the component, re-runs. The exit condition is mechanical, not a vibe check.

The artefacts live in apps/web/prototype-parity-artifacts/<area>/. Same folder, same naming, every route. The discoverability matters: when the agent comes back to a route a week later, the prototype, the candidate, and the diff are all sitting in one place. No archaeology.

State Tracking Is Half the Trick

The scalar loss is one half of what makes this work. The other half is a single committed JSON file, apps/web/prototype-parity.progress.json, that holds the per-route ledger for the whole port. In this project it tracks 36 routes across three viewports each (mobile, tablet, desktop), with the current status histogram sitting at 21 migrating, 13 playwright-failed, and 2 tests-added.

Each entry looks roughly like this:

{
  "route": {
    "/": {
      "appRoute": "/trace-learn",
      "feature": "apps/web/features/student-dashboard",
      "status": "playwright-failed",
      "playwright": {
        "viewports": ["mobile", "tablet", "desktop"],
        "screenshots": [
          {
            "viewport": "desktop",
            "prototype": ".../student-dashboard/desktop-prototype.png",
            "app": ".../student-dashboard/desktop-app.png",
            "diff": ".../student-dashboard/desktop-diff.png",
            "result": "fail",
            "notes": ["diff metric 405890 (0.256244)."]
          },
          { "viewport": "tablet", "result": "fail", "notes": ["diff metric 317617 (0.403871)."] },
          { "viewport": "mobile",  "result": "fail", "notes": ["diff metric ..."] }
        ]
      },
      "knownGaps": ["..."],
      "implementedChanges": ["..."]
    }
  }
}

This is what I had been missing on day one. The agents under a /goal brief had no way to track which routes had passed and which had not, so they picked whichever screen most recently caught their attention, worked on that, and resurfaced claiming progress. Sometimes they fixed a route that was already passing. Sometimes they regressed a passing route while fixing a failing one. Sometimes they forgot a failing route existed entirely. playwright-cli will happily re-screenshot anything you point it at, so without a ledger there was no anchor that said “this one is done, leave it alone, go work on that one.” The bottleneck was state, not capability.

The JSON ledger fixes that. Before doing any work the agent reads prototype-parity.progress.json. That gives a full state vector for every route and every viewport: current status, latest AE per viewport, known gaps still outstanding, files touched, tests added. The agent sorts by AE descending, picks the worst-failing viewport, works on it, re-runs the snapshot, writes the new metric back into the ledger, and moves on. The list is the queue. The threshold is the exit condition. The orchestration loop becomes mechanical and the random component churn stops.

This is the same idea as a test report. A test report is not just pass or fail for one test, it is a ledger over the whole suite that lets you target work and detect regressions. prototype-parity.progress.json is a test report for visual parity. Once that ledger exists, the agent stops wandering.

Three Layers, One Loop

The tactic is not the ImageMagick call on its own. It is the stack of three layers that the harness already has sitting on the shelf.

Playwright CLI is layer one. It is what gets you two comparable PNGs in the first place. Same viewport, same fonts, same wait conditions, same headless Chromium. Without Playwright the pair of images is not actually a fair comparison, and any metric you compute on top is noise. Playwright is the camera. Both screenshots have to come out of the same camera or nothing downstream works.

ImageMagick is layer two. magick compare -metric AE turns the two PNGs into a scalar plus a localiser. The scalar is the optimisation target. The localiser is the diff PNG. ImageMagick is the cheap, deterministic, local subprocess that gives the loop something to minimise. No model, no flake, no API bill.

The harness’s native image read is layer three, and this is the one people forget. Codex and Claude Code can both read PNGs directly. Once the diff is on disk, the agent opens it and sees the red regions. It can say “the divergence is concentrated in the right rail, the avatar stack is too far down, the badge is the wrong colour” because it is looking at the picture, not guessing from the number. The scalar says how wrong. The image says where and how.

Any one of these layers alone is weak. Playwright on its own gives you screenshots and no opinion about them. ImageMagick on its own gives you a number with no visual context. Native image read on its own gives you the vague vision-model failure mode from the previous section. Stacked, they form a closed loop: take the picture, compute the loss, look at the diff, edit, repeat. The agent runs the whole thing without me.

This is the broader pattern. New tactics in agentic coding are rarely a single new tool. They are usually a fresh combination of capabilities that were already in the harness, where each layer covers another layer’s blind spot.

Why Pixel Diffs Beat LLM Visual Judgment

The obvious alternative is to hand both screenshots to a vision model and ask “do these match.” I have tried it. It is worse for this job for three reasons.

It is non-deterministic. The same pair of images can return “matches” on one run and “differs in the header” on the next. That is the opposite of what you want when the metric is supposed to drive convergence.

It is not localised. The model says “the spacing is off” without telling you which spacing. The agent then guesses. With ImageMagick the diff PNG lights up the exact pixels that disagree.

It is slow and expensive relative to the value. A vision model call is hundreds of milliseconds and a few cents. ImageMagick is a local subprocess, free, and runs in tens of milliseconds. For a loop that the agent runs dozens of times per route, the cost compounds.

The vision model is the right tool for “does this image contain a cat.” It is the wrong tool for “does layout A match layout B,” because the question already has a precise mathematical answer.

Caveats

Pixel diff is brittle in some specific ways and you should know them.

Font rendering across machines differs. macOS, Linux, and headless Chromium can each anti-alias slightly differently. Generate both screenshots in the same Playwright container or your AE will never approach zero.

Animations and transitions need to be frozen. Disable CSS transitions, set animation duration to zero, or wait until specific elements have settled. Otherwise you are diffing two random moments of motion.

Data-driven content needs to be deterministic. Avatar initials, timestamps, random IDs. The prototype is usually static, the app usually reads from a fixture. Match the fixtures to the prototype’s content.

AE is not the only metric. ImageMagick also offers PAE, MAE, RMSE, and others. AE has the property I want most for this workflow: it is interpretable as a count of bad pixels. For perceptual closeness you might prefer SSIM via another tool. For prototype parity, AE is the right hammer.

Where I Want to Take This Next

The next extension I am thinking about is a small React tree parser I would run inside Claude Code or Codex to widen the unit of capture. Right now the ledger keys on route plus viewport, which works for screens whose entire visual surface is determined by the URL. It does not work for screens with state hiding behind the URL: a modal open or closed, an accordion expanded, an error toast visible, a form on step three rather than step one.

Leanpub Book

Read The Meta-Engineer

A practical book on building autonomous AI systems with Claude Code, context engineering, verification loops, and production harnesses.

Continuously updated

Claude Code + agentic systems

View Book

A tree parser would walk the React tree (either at runtime via a dev probe, or statically via AST against the source) and enumerate the reachable states. Each state becomes another key on the ledger entry. The Playwright run drives the app into the state, takes the snapshot, and compares against the prototype’s matching state.

The shape I have in mind looks roughly like this:

{
  "route": {
    "/dashboard": {
      "states": {
        "default":                                          { "result": "pass",     "ae": "1024 (0.0007)" },
        "modal:create-task=open":                           { "result": "fail",     "ae": "88412 (0.0612)" },
        "panel:games=expanded":                             { "result": "fail",     "ae": "41200 (0.0285)" },
        "form:onboarding-step=3":                           { "result": "pass",     "ae": "2100 (0.0014)" },
        "modal:create-task=open + panel:games=expanded":    { "result": "untested", "ae": null },
        "modal:create-task=open + form:onboarding-step=3":  { "result": "untested", "ae": null }
      }
    }
  }
}

The state key is a deterministic descriptor of which toggles are on (modal:create-task=open, panel:games=expanded, form:onboarding-step=3), and combinations are explicit entries rather than implicit. Anything the parser enumerates but the run does not visit lands as "untested", which is the next paragraph’s point.

That multiplies the snapshot count by the state cardinality, which is real cost. A route with five modals and three loading states goes from one screenshot to twenty per viewport. Storage, runtime, and flake all scale. So it is probably overkill for most routes. But for the surfaces that carry the most product weight, capturing state-aware parity is the difference between “the page loads” and “the page works.”

The same parser doubles as a test-coverage map. React is a tree, so walking it gives a structured enumeration of every reachable rendering state. Map the existing snapshots, component tests, and e2e tests onto that enumeration and the uncovered cells fall out by subtraction. The combinatoric explosion that makes exhaustive snapshotting expensive is the same explosion that hides bugs in untested state combinations. A modal open with an error toast visible and the form on step three is a state the test suite almost certainly never touches. The parser would tell you that, and it would tell you for every route at once.

I am flagging this as the next addition to the party, not something I have already built. If I do build it, the loop above stays the same. The parser just widens the ledger and makes the test gaps fall out for free.

The Conclusion

Give the agent a scalar loss anywhere you can. Build the snapshot. Capture the metric. Commit both to the repo. Make the loop mechanical. The agent will close the gap.

A note on the hierarchy. True scalars (pixel counts, distances, latencies, anything a deterministic computation produces) are what I actually want, and I love them. LLM-assisted scalars (numeric ratings from a judge model) I avoid: they look quantitative but the judge’s underlying decision is binary and the numbers just add noise on top. LLM-assisted binary verdicts (true/false) are acceptable as a fallback when nothing deterministic exists, but only as a fallback. Order of preference: true scalar, then LLM binary, then go fix the spec.

When I cannot give the agent a scalar, or at least a good LLM-as-judge assisted binary metric (true/false), I treat that as a spec smell, not a tooling limitation. There is almost always a metric hiding inside “does this match.” Find it, automate it, and the convergence problem dissolves.