Parallel Worktrees Leak Dev Servers. Reap Them Idempotently.

James Phoenix
James Phoenix

I ran my dev loop across a dozen git worktrees for months and never thought about what happened to a worktree’s servers when the worktree went away. The answer, it turned out, was nothing. They kept running. By the time I looked, I had more than sixty orphaned API servers and a handful of zombie Mintlify processes all pinned at 100% CPU, all polling the same Postgres and Temporal. The fix was not a smarter kill command. It was binding cleanup to the worktree lifecycle and making it idempotent at both ends.

Author: James Phoenix | Date: June 2026


The Fans Told Me First

The symptom was physical. My MacBook fans would spin up to a roar within twenty minutes of opening a terminal, before I had run anything heavy. The Docker VM sat at a load it had no business being at. I assumed it was the integration suite or some runaway build, so I ignored it for longer than I should have.

Eventually I ran the obvious thing: count the node processes. I expected a handful, one API server and one worker per worktree I was actively using, maybe four or five total. The machine had more than sixty node API servers running. Alongside them sat three or four mint processes, the Mintlify dev server for my API-reference docs, each one spinning at roughly 100% CPU and fighting over port 3802.

None of these belonged to anything I had open. They were ghosts. Each one was still holding its port, still opening connections to the shared local Postgres, still polling Temporal for work that would never come. Sixty idle-but-not-idle servers is enough to load a Docker VM into the ground, which is exactly what the fans had been telling me.

How A Worktree Orphans Its Servers

I run parallel development out of git worktrees, one per feature, each with its own deterministic port offset so the servers never collide. My pnpm dev spawns an API server, a worker, a web server, and sometimes a docs server, all with their working directory inside that worktree. On a clean exit, the dev script’s EXIT trap tears them down.

The phrase doing the damage there is “on a clean exit.” A worktree dies in plenty of ways that are not clean:

  • git worktree remove deletes the directory but knows nothing about the node processes that were running inside it.
  • git worktree prune reaps the metadata for a directory I already rm -rf‘d, again touching nothing that was running.
  • A kill -9 on the terminal, or a crash, or closing the lid at the wrong moment, skips the EXIT trap entirely.

In every one of those cases the worktree’s directory vanishes and its servers keep running. They are now orphans: processes whose working directory no longer exists on disk. Multiply that by a few weeks of creating and removing feature worktrees, and you get sixty. The Mintlify case was its own flavour of the same disease. Mint loses the race for port 3802, fails to bind, but does not exit. It just spins. Start a fresh docs server a few times and you stack up zombies that each burn a core.

Clean exit fires the EXIT trap and tears servers down; an unclean exit (kill -9, crash, worktree remove, prune) skips it and leaves orphans, which accumulate into 60+ servers still polling shared Postgres and Temporal
Clean exit fires the EXIT trap and tears servers down; an unclean exit (kill -9, crash, worktree remove, prune) skips it and leaves orphans, which accumulate into 60+ servers still polling shared Postgres and Temporal

The Fix Is Idempotent Cleanup Bound To The Lifecycle

The instinct is to write a one-off “kill everything” script and run it when the fans get loud. That is a chore I would forget, and a blunt instrument that could kill a worktree I actually cared about. The better move is to attach cleanup to the two moments in a worktree’s life where it is provably safe, and make each one idempotent so running it twice is identical to running it once.

Leanpub Book

Read The Meta-Engineer

A practical book on building autonomous AI systems with Claude Code, context engineering, verification loops, and production harnesses.

Continuously updated
Claude Code + agentic systems
View Book
Three cleanup guards along a worktree's lifecycle: sweep orphans on setup via reap-orphan-servers.sh, guard the port on each service start via kill-stale-service.sh, and reap the worktree's own servers on teardown via manage.sh
Three cleanup guards along a worktree’s lifecycle: sweep orphans on setup via reap-orphan-servers.sh, guard the port on each service start via kill-stale-service.sh, and reap the worktree’s own servers on teardown via manage.sh

At teardown, reap the worktree’s own servers before deleting it. My scripts/worktree/manage.sh remove now calls a reap_worktree_processes step before git worktree remove. It finds every dev runtime whose working directory is inside that worktree and stops it, SIGTERM then SIGKILL for survivors. The worktree is about to cease to exist, so killing its servers cannot hurt anyone. This is the clean path, and it means a normally-removed worktree never becomes the problem in the first place.

At setup, sweep the orphans that escaped. Teardown only helps the worktrees I remove the sanctioned way. The kill -9s and crashes still leak. So pnpm dev runs a preflight, scripts/dev/reap-orphan-servers.sh, that sweeps any dev server whose worktree directory is gone:

# Only octospark worktree checkouts.
case "$cwd" in
  */.worktrees/*|*/.claude/worktrees/*) ;;
  *) continue ;;
esac
# An orphan is one whose worktree directory is gone. A live
# worktree's cwd still resolves, so it is skipped.
<a href="/posts/d-cwd/"> d "$cwd"</a> && continue

That <a href="/posts/d-cwd/"> d "$cwd"</a> && continue is the whole safety argument. A live worktree’s directory still resolves, so its servers are skipped. The only things this reaps are processes pointing at a directory that no longer exists, which by definition cannot be in use by anyone. That is why it is safe to run on every single pnpm dev, and why it ships with a --dry-run flag so I can look before it acts.

Per service, kill the stale holder before starting a new one. The docs server gets one more guard. The dev script for it now runs kill-stale-service.sh code-sdk-docs first, which catches both the process holding port 3802 and any zombie mint that lost the race and is spinning:

code-sdk-docs)
  collect_listeners "${DOCS_SDK_PORT:-3802}"
  collect_command_contains "@mintlify/cli/bin/start.js"
  ;;

Start the docs server twice and the second start cleans up after the first. Idempotent.

Working Directory Is The Right Identity

The detail I am most pleased with is that none of this matches on port numbers or process names from a hardcoded list. It matches on the working directory the process was launched from, read in a single lsof -d cwd -Fpn pass. A port is ambiguous, it gets reused. A process name is ambiguous, my editor and my git hooks are also node. But the working directory ties a process back to the exact worktree that spawned it, and “is that directory still on disk?” is an unambiguous, cheap question with a safe answer.

So the kill decision becomes a pure function of filesystem state. A runtime filter (*node*|*tsx*|*@mintlify*|*next*|*esbuild*) keeps it from touching a shell or an editor that happens to sit in the same tree. Beyond that, the logic never guesses. It asks the filesystem who is an orphan and trusts the answer.

After filtering to dev runtimes, each process is judged by one question: is its working directory still on disk? If gone, it is an orphan and safe to reap; if it resolves, it is a live worktree and is skipped
After filtering to dev runtimes, each process is judged by one question: is its working directory still on disk? If gone, it is an orphan and safe to reap; if it resolves, it is a live worktree and is skipped

The Lesson: Cleanup Belongs To Events, Not To Memory

The real bug was never the sixty servers. It was that I had made teardown depend on my discipline and on a clean exit, two things that fail constantly. Discipline is not a mechanism. The moment cleanup relies on a human remembering, or on a process always exiting politely, it has already failed; you just have not counted the corpses yet.

The fix is the same shape as everything else I have pushed down into the floor of the repo. Don’t ask the operator to clean up. Bind cleanup to the lifecycle events that already exist, removing a worktree and starting a dev server, and make each one idempotent so it is always safe to run again. The reaper sweeps on every pnpm dev. The teardown reaps before every removal. There is no state to track, no count to watch, no fans to listen for. Run it a hundred times and the end state is identical: zero servers that belong to a worktree that is gone.

That is the property I want from every piece of dev infrastructure. Not “it cleans up.” That “it is always safe to run, and running it always leaves the system correct.” Idempotence is what lets you stop thinking about a problem entirely.


Related

Topics
AutomationDebuggingDeveloper ExperienceGitSoftware Architecture

Newsletter

Become a better AI engineer

Weekly deep dives on production AI systems, context engineering, and the patterns that compound. No fluff, no tutorials. Just what works.

Join 306K+ developers. No spam. Unsubscribe anytime.


More Insights

Cover Image for When Parallelism Makes Tests Slower

When Parallelism Makes Tests Slower

*A thousand integration tests, sixteen cores, and not much faster. The fix was the easy part. The lasting lesson is that your test suite is the feedback loop your coding agents iterate against, and sc

James Phoenix
James Phoenix
Cover Image for The Five Levels Are Gated by Verification, Not Intelligence

The Five Levels Are Gated by Verification, Not Intelligence

Your autonomy level is a property of the repository, not the developer and not the model.

James Phoenix
James Phoenix