When Parallelism Makes Tests Slower

James Phoenix

A thousand integration tests, sixteen cores, and not much faster. The fix was the easy part. The lasting lesson is that your test suite is the feedback loop your coding agents iterate against, and scaling agentic development is mostly scaling that loop.

I crossed a thousand integration tests on Octospark, turned on Vitest parallelism, and watched the suite get worse. Sixteen cores pinned, fans audible, and the wall clock barely better than running them in a line. The reflex in that moment is to reach for more: more workers, a faster DOM, a bigger machine. Every one of those would have been wrong. The suite was not slow because of the tests. It was slow because of everything that happens around each test, and parallelism does not remove that cost. It copies it onto every core.

This is the part of test scaling nobody warns you about. A parallel runner is a tax-collection machine. Whatever fixed cost sits at the top of each file, it pays that cost once per file, on every worker, in parallel. Point sixteen cores at a two-second import and you have not made anything faster. You have built a very efficient way to burn a CPU on imports. And there is a quieter, worse cost underneath that one: past a point, turning the dial up does not just waste CPU, it trades wall-clock for flake.

Here is what actually happened, what I measured, and the small number of changes that moved the needle.

One thing before the numbers, because it changes why any of this matters. I did not spend days on this because a slow suite is annoying. I spent it because most of my development now runs through coding agents, and an agent is only as good as the loop it iterates against: how fast it gets a signal, how far it can trust that signal, and whether what it learns survives to the next task. The test suite is that loop. Make it slow and flaky and you have throttled every agent you point at the codebase. Make it fast and trustworthy and they compound. The front half of this piece is the unglamorous engineering that makes the loop fast. The back half is how I made it trustworthy enough to hand to a fleet.

Measure the phases before you touch the worker count

Vitest already tells you where the time goes. Every run prints a line like:

Duration 92s (transform 42s, setup 258s, collect/import 154s, tests 329s, environment 45s)

Those are sums across workers, not wall-clock, which is exactly what you want: they are attributable. Wall-clock on a busy machine is noise (I watched load average hit 12.6 on a 16-core box during these runs); the phase sums are signal.

Summed on the integration suite (180 files, 1,203 tests, roughly 10 workers), the breakdown was blunt:

tests: 329s
setup: 258s
import: 154s
environment (jsdom): 45s
transform: 42s

Phase sums across workers for 180 files: tests are a third of the work, while setup and import are the fixed per-file cost.

Setup and import dwarf the tests. Parallelism runs that fixed cost concurrently; it cannot remove it.

The tests themselves were about a third of the work. The other two thirds, more than 400 seconds of it, was setup and import: the fixed cost of standing each file up before a single assertion ran. No number of extra workers shrinks that. They just run it concurrently and call it progress.

That one breakdown reframed the entire problem. I stopped trying to parallelise harder and started trying to make each file cheaper to start.

The barrel import was the churn

Octospark’s web integration project runs pool: 'forks', isolate: true. That isolation is deliberate: every test file gets a clean module graph, which kills a whole class of cross-test contamination bugs. It also means every file re-evaluates its imports from scratch. Whatever you import at the top, you pay for, per file, 71 times over.

I instrumented the cost and the culprit was a single line. The test context imported the project’s @tx-agent-kit/testkit barrel just to pull one helper. That barrel eagerly drags in the redis client, the http-vcr and msw mocking stack, and every fixture in the package. Measured cost: about 2,023ms per file. The shared-API health check I had assumed was the villain was 9ms. It was never the tests. It was the front door.

The fix was not clever. It was narrow:

// before: imports redis, msw, every fixture, per file, 71 times
import { createDbAuthContext, type ApiFactoryContext } from '@tx-agent-kit/testkit'

// after: import only the seam you actually use
import { createDbAuthContext } from '@tx-agent-kit/testkit/dist/db-auth-context.js'
import type { ApiFactoryContext } from '@tx-agent-kit/testkit/dist/api-factories.js'

That moved the summed setup phase from 199.4s to 104s, a 48% cut, about 1,150ms saved on every one of those 71 files. The same disease lived in the worker and API suites, where the testkit barrel transitively pulled the entire core graph (dominated by the social-provider HTTP adapters, ~0.9s) at module load. Converting those reaches to import type plus a lazy import() inside the functions that needed them halved the barrel’s cold import, 1,224ms to 610ms, and took the API import phase from 36.1s to 29.4s.

Before and after: importing the narrow seam instead of the testkit barrel cut the web setup phase from 199s to 104s.

One import changed. The barrel was the front door the whole suite paid to walk through, 71 times over.

Barrels are a developer-ergonomics feature that quietly becomes a performance tax the moment you put one at the top of an isolated, parallelised file. The bigger your suite, the more cores you have, the more that tax compounds.

Your “parallel” suite might be a serial loop in a costume

The unit suite told a different and more embarrassing story. It ran in about 65 seconds and burned roughly 181 seconds of CPU doing it. On a 16-core machine, that is around 3x utilisation when it should have been closer to 16x. The cores were idle and the clock was slow at the same time, which is the signature of a serialised orchestrator.

The cause was a shell loop. The “parallel” script walked 26 packages and invoked the build tool once per package, in sequence:

# before: 26 sequential invocations, each a fresh, blocking call
for pkg in $(list_packages); do
  turbo run test --filter="$pkg"
done

Turbo can parallelise a dependency graph beautifully. It cannot do anything with a graph you feed it one node at a time. Collapsing that into a single invocation let it schedule the whole thing:

# after: one graph, turbo schedules it across cores
turbo run test $(list_packages | sed 's/^/--filter=/') --output-logs=errors-only

Sixty-five seconds became about eighteen, a 3.3x win, with no change to which tests ran. The identical pattern was hiding in the CI lint step, where 23 independent invariant checks were chained with &&, paying 23 sequential Node startups. Running them concurrently took that step from ~2,850ms to ~930ms.

The lesson stings a little: I had “parallel” in the script name and a serial loop in the body. Before you add workers to the runner, check that the thing calling the runner is actually parallel.

More workers cannot beat a shared-server ceiling

Here is the one most people get wrong, and the one I would build the whole talk around.

Every web integration test in Octospark talks to a single shared API server, on a single Node event loop. You can have ten Vitest workers firing requests, but they all funnel through that one process. And that process had a bottleneck I had no idea existed: its Effect PgClient pool was silently capped at 10 connections, because the code that built it never passed maxConnections. The DB_POOL_MAX=150 env var and Postgres’s max_connections=1000 were both irrelevant. Every request queued behind ten slots.

I only found it by running a single-variable benchmark, the change-one-thing-and-measure loop you would run for any optimisation. At a concurrency of 160, a pool of 10 sustained 71.7% success; a pool of 25 hit 100% and roughly 3.7x the throughput, while Postgres itself sat nearly idle. The ceiling was never the database. It was a default in our own client.

This is the structural point. Past the concurrency ceiling of whatever shared resource your tests funnel through, the connection pool, the event loop, the single server, adding test workers does nothing but deepen the queue. Worse, this was a real production throttle, not a test artifact. The same pool served live traffic. The flaky, slow suite was the symptom that surfaced a prod bug.

If your integration tests share a server, find that server’s ceiling before you scale the runner. The runner is almost never the thing that is saturated.

Wall-clock and flakiness pull in opposite directions

There is a second cost to turning the worker count up, and it is not CPU. It is reliability. Wall-clock time and flakiness sit at opposite ends of the same dial. Crank parallelism and you buy a faster clock by selling reliability, because the thing that makes a suite fast under load is exactly the thing that makes async tests miss their deadlines.

A two-by-two of speed against reliability: cranking workers past the shared-server ceiling slides you from fast-and-reliable into fast-and-flaky.

The worker count is a reliability dial too. Past the ceiling, it trades the top-right for the bottom-right.

Any test that awaits a real round-trip, a mutation, a refetch, a render that depends on a server response, carries an implicit deadline. Comfortably within budget when the machine is quiet, it now has to share one saturated API and event loop with every other worker. The awaited thing takes longer, the deadline does not move, and the test fails. It is not broken. It lost a race for a contended resource.

The signature is unmistakable once you have seen it: the failing test rotates between siblings on every run. In this codebase the heaviest web file, with its real seed, invite, and refetch round-trips, was the canary. Under load its post-mutation refetch did not surface the new row inside the 15-second budget, and whichever sibling happened to be mid-flight lost the dice roll. Same code, a different victim each run. That is contention, not a bug in any one test.

Leanpub Book

Read The Meta-Engineer

A practical book on building autonomous AI systems with Claude Code, context engineering, verification loops, and production harnesses.

Continuously updated

Claude Code + agentic systems

View Book

You do not fix it by lengthening timeouts, which only hides it and slows the green path. You fix it by reducing the contention and by waiting on the right signal:

Move tests down the pyramid. Every case you can answer with a unit test is one fewer file funneling through the shared server. I pulled redundant route and billing cases that exhaustive unit tests already covered, and demoted zero-seed component tests off the integration project entirely. Fewer integration tests is not just faster, it is less contention, which is less flake.
Wait on a signal, not a duration. await findByText(email) waits for the element to appear and retries efficiently; waitFor(() => getByText(email)) and fixed setTimeout sleeps wait on slack you guessed at. The prefer-find-by rule in eslint-plugin-testing-library flags exactly the waitFor(() => getBy...) anti-pattern, and banning fixed sleeps encodes the intent: wait for the thing to happen, not for a number to elapse. Correct waits recover real wall-clock and stop compounding the contention.
Raise the ceiling so the awaited thing stays fast under load. The pool fix above is also a flake fix: a request that no longer queues behind ten slots returns inside the budget.
Tune workers to the ceiling, not to the core count. There is an optimum where the shared resource is busy but not saturated. Past it you are paying reliability for wall-clock you are not even getting back.

The trap is treating “max out the workers” as free. It is a trade, and on a suite full of real async it is often a bad one.

The slow that looks like code but is really infrastructure

Some of the worst time sinks were not in the suite at all.

The same commit, run alone on the same machine, took 69.7s against a freshly cleaned database and 119s against a bloated one, with the tests phase climbing from 236s to 298s on byte-identical code. Nothing in the repo had changed. The CI Postgres had quietly accumulated 2,217 leaked per-test-file schemas from killed runs, and a bloated catalog slows every query planner-wide. Garbage-collecting it back to 97 schemas restored the baseline. The per-job cleanup only dropped schemas older than 24 hours, which is far too lax when a bad night creates hundreds in an hour.

Then there was the optimisation that was a pessimisation in disguise. To cut CI setup time I cached the pnpm store and the turbo cache through the standard hosted cache action. On a hosted runner that is free money. On our self-hosted Mac Studio it was a disaster: the store already lives on local disk, so shipping it out to the cache service and back made the setup step take five minutes where a clean local install plus build took thirty-six seconds. The derived estimate said it would help. The real topology said the opposite. I reverted it. An optimisation is just a pessimisation that has not met your hardware yet.

And the contention multiplier: four CI runs from four branches all landed on the one self-hosted machine and fought over the single shared Postgres, because the workflow’s concurrency group was keyed per-branch and never serialised across them. I caused some of that myself by opening several test-infrastructure PRs at once, each triggering its own integration run. On shared infrastructure, your own throughput becomes your own noise.

So, happy-dom?

The obvious lever I did not pull was swapping jsdom for happy-dom. It is the standard advice, and it came up mid-run. But the jsdom environment phase was only 45 seconds, and 74 of 75 web files genuinely render components, so the environment was already doing real and necessary work. Swapping the DOM engine to shave a number I had not first proven was the bottleneck is how you trade a known cost for an unknown regression. Measure, then swap if the phase breakdown actually points there. It did not.

Zoom out: the suite is the agent’s feedback loop

Everything above was in service of one thing: a loop an agent can iterate against without lying to itself. Once you run a fleet of coding agents instead of typing every change yourself, four properties of that loop matter more than the model.

Fast where you iterate, exhaustive where you commit. An agent editing one package does not need 1,200 tests to know whether it broke something; it needs the tests for what it touched, in seconds. So local iteration runs affected-only. The runner asks turbo’s native affected graph which packages changed since the base ref (turbo run test:integration --filter=...[<ref>] --dry-run=json), maps those packages to the integration projects they belong to, and runs only those. A single-package edit drops from a ninety-second full run to a few seconds. An agent does dozens of those per task, so the saving is the task.

Terminal: the affected runner prints its plan (api, web) before running, and notes that CI still runs the full suite.

--plan shows the agent exactly what its local loop will cover before it runs. The full suite still runs in CI.

The trap is letting that fast proxy become the gate. It must not. CI always runs the full suite. The asymmetry is the entire design: local is a latency-optimised proxy, CI is the truth, and you never collapse the two. An agent will happily declare victory on a narrow green and merge a break two packages over. The full CI run is what stops it.

Make the fast loop fail safe, because an agent will exploit any gap. The affected runner falls back to the full suite on any ambiguity at all: no base ref, a detached HEAD, a turbo or JSON failure, or a change to any sentinel path (the lockfile, the test harness, shared config). It always includes the core api project so a narrow change can never run with zero core coverage. And a --plan flag prints the decision (full, or the exact affected project list) without running anything, so the agent can see what its loop will cover before it trusts it. The rule is simple: the loop is allowed to be fast, never allowed to be wrong. A human reads the runner once and trusts it. An agent runs it a thousand times and will find every case where fast quietly meant incomplete, so those cases all have to resolve to “run everything.” Letting turbo compute the affected graph rather than parsing git ourselves is the same lesson in miniature: an earlier attempt hand-rolled a git-porcelain parser and it was subtly wrong. Change detection is a solved problem the build tool already does correctly.

The agent's local loop runs only the affected tests in seconds; CI always runs the full suite, and the affected runner fails safe to a full run on any ambiguity. — The agent’s local loop runs only the affected tests in seconds; CI always runs the full suite, and the affected runner fails safe to a full run on any ambiguity.

The fast loop is a proxy, the full CI run is the truth, and the affected runner is built to fall back to full whenever it is unsure.

Turn the painful investigations into skills, so the fleet stops repeating them. The flake hunt in this codebase cost a night of chasing wrong hypotheses. That cost should be paid once. So it is now a skill, fix-test-flake, that any agent reads before it touches a flaky test: reproduce under load not in isolation, instrument the real state instead of guessing, fix by signal not by duration, validate across five runs, and remember that local-green is not CI-green. It carries a root-cause catalog (the user.type keystroke drop, the requestAnimationFrame autofocus race, the react-select portal timing, the shared-singleton bleed, the auth-clear-on-network-error) and an infra catalog (leaked schemas, the pool cap, bcrypt on the event loop). Its companion, speed-up-test-suite, encodes the levers from the front half of this article. Without these, every agent re-derives the same diagnosis from zero. With them, it starts where the last investigation ended. That is the actual compounding in compound engineering: a one-time discovery becomes a permanent capability every future agent inherits.

Encode the rules as lint, not lore, because an agent has no memory of the fight. An agent that just fixed a flake will, on the very next task, cheerfully write the anti-pattern that caused it, because nothing in its context remembers the lesson. A lint rule remembers. So the hard-won fixes became guardrails scoped to test files: eslint-plugin-testing-library’s prefer-find-by flags the exact waitFor(() => getBy...) pattern that fast-fails on async data, the async-query rules force you to await, and a no-restricted-syntax ban kills fixed setTimeout sleeps in tests. The contention-robust dialog interactions became shared helpers instead of copy-paste. “We learned not to do X” is worthless to a stateless agent. “The lint gate rejects X” is the version that holds.

A one-time flake investigation is banked once as a skill and a lint gate, so every future agent starts ahead and cannot reintroduce the bug.

Pay the investigation once, then bank it twice: a skill the next agent reads, and a gate it cannot quietly undo.

Treat your dev infrastructure as the thing under load, because a fleet is a load test you did not write. A single developer runs one suite at a time and cleans up after themselves. A swarm of agents runs many in parallel, abandons runs mid-flight, spawns worktrees, and opens five PRs at once. That stresses shared infrastructure in ways no human workflow does. The symptoms were concrete: 2,217 leaked Postgres schemas from abandoned runs bloated the catalog and slowed everything; four CI runs from four branches fought over one shared Postgres because concurrency was keyed per-branch; I caused my own pile-up by opening several test-infra PRs at once. The fixes were infrastructural, not code: drop the leaked-schema garbage-collection age from twenty-four hours to two, isolate the database per run, add a flake-scan helper that reproduces under load on demand. When you go from one engineer to a fleet, your infrastructure’s contention behaviour stops being hygiene and becomes your throughput ceiling. The connection-pool cap from earlier was the same story: invisible to one developer, throttling to ten parallel workers, and a live production bug underneath.

The thread through all five is trust. An agent’s feedback loop has to be fast enough that it iterates often and trustworthy enough that fast-and-green actually means correct. You build the speed with the techniques in the first half of this piece. You build the trust with the full-CI floor, the skills, the lint gates, and infrastructure that does not buckle under parallelism. Get both and the agents compound. Get only the speed and you have built a very efficient way to merge broken code.

What I would tell anyone past a thousand tests

Parallelism is a multiplier, and it multiplies whatever you aim it at. Aim it at fixed per-file overhead and you get a faster way to waste a CPU. The wins here were not from more cores. They were from reading the phase breakdown, cutting the per-file import tax, making the orchestrator actually parallel, finding the shared-server ceiling, and ruling out infrastructure before blaming the code.

A short checklist I now run before touching a worker count:

Read the phase breakdown first. If tests are not the largest phase, more workers will not help.
Treat per-file imports as the enemy under isolate: true. Import seams, not barrels. Lazy-load heavy graphs.
Confirm the orchestrator is parallel, not just the runner.
Find the ceiling of any shared server (connection pool, event loop) before adding workers.
Treat the worker count as a reliability dial, not only a speed dial. Past the ceiling, more workers buy flake. Cut contention by pushing cases down to unit tests, and wait on signals (findBy, no fixed sleeps) rather than slack.
Rule out infrastructure: catalog bloat, cache topology, run concurrency.
Swap test environments only when the number tells you to.
Split the loop: affected-only locally, the full suite in CI. Fast where you iterate, exhaustive where you commit, and never let the fast proxy become the merge gate.
Make the fast path fail safe: fall back to the full suite on any ambiguity, always include a core smoke, and let the agent see the plan before it trusts it.
Capture each hard investigation as a skill and each hard rule as a lint gate, so the fleet starts where the last fix ended instead of relearning it.

The suite is fast again, and chasing its slowness turned up a production connection-pool throttle I would otherwise never have found. But the lasting result is not the speed. It is that the loop my agents iterate against is now fast enough to run constantly and trustworthy enough to believe, with the lessons baked into skills and lint so they are not relearned every task. That loop, not the model, is the thing you are really scaling. A slow, flaky suite throttles every agent you own; a fast, honest one lets them compound. It is the most honest load test you have, and increasingly it is the substrate your whole development process runs on.