Fast Playwright E2E Without the Bloat

James Phoenix
James Phoenix

Most end-to-end suites die the same way. They start fast, then they bloat. A run creeps from forty seconds to four minutes. A few specs go flaky, so someone bumps the retries. The retries hide the flake, the flake hides a real bug, and within a quarter nobody believes a red run means anything. The suite gets quarantined, then deleted.

The fix is not “write better tests.” It is three structural decisions that make the suite physically incapable of rotting: pre-build the server, put a hard wall-clock budget on the run and enforce it with lint, and seed everything through factories.

1. Pre-build the server, then test the thing you ship

The biggest speed win was the least glamorous. I stopped running Playwright against the Next.js dev server.

The suite first failed seven of eight specs. The obvious move is to crank timeouts and retries until it goes green, and that is the move that kills suites. So I bisected the cause instead, and it was the dev server. Next’s dev mode ships a roughly fourteen-megabyte eval-source-map bundle and lazy-compiles each route the first time it is hit. Run specs in parallel and those cold compiles serialise behind each other, so the first test to touch each route eats the compile cost and blows the per-test cap. The tests were not slow. The server was compiling itself while the clock ran.

The answer is to build once and serve the artifact: next build, then next start against a real API process, and only then does Playwright run. No lazy compilation, a small hashed bundle, every route warm. The same suite went from seven of eight failing to twelve of twelve passing first try, at zero retries, in about forty-five seconds.

Two scripts and a webServer block do it. The build is run once up front; Playwright just attaches to the already-running prod server:

// apps/web/package.json
"e2e:build": "NODE_ENV=production next build",
"e2e:serve": "NODE_ENV=production next start -p ${WEB_PORT:-3651}"
// apps/web/e2e/playwright.config.ts
webServer: {
  command: 'pnpm e2e:serve',   // serves the prebuilt artifact, never `next dev`
  url: baseURL,
  reuseExistingServer: true,   // CI builds + starts it; locally it boots on demand
  timeout: 60_000              // server-start ceiling, NOT the test budget
}

There is a second reason this matters: the production build is the thing your users actually load. Testing the dev server means testing an artifact you never ship. Pre-building exercises the real one.

The key nuance: the build is a prerequisite, not part of the budget. I time the test run, not the compile.

2. A 120-second wall-clock budget, enforced by lint

My integration tests have a hard time budget and I trust them because of it. I wanted the same contract for E2E: it must be impossible for the run to quietly get slower. Not discouraged. Impossible.

The budget is 120 seconds, in three runtime layers. First, globalTimeout in the Playwright config:

// apps/web/e2e/playwright.config.ts
export default defineConfig({
  globalTimeout: 120_000,   // whole run fails if it overruns
  timeout: 30_000,          // per test
  expect: { timeout: 10_000 },
  retries: process.env.CI ? 1 : 0
})

Second, a portable kill-wrapper around the test command. It sends SIGTERM at 120 seconds and SIGKILL ten seconds later, and falls back across timeout, gtimeout, then a tiny Node watchdog so it behaves the same on my laptop and on CI:

// apps/web/package.json  (exit 124 on overrun, like GNU timeout)
"test:e2e": "BUDGET=120; if command -v timeout >/dev/null; then timeout --kill-after=10 $BUDGET pnpm exec playwright test; elif command -v gtimeout >/dev/null; then gtimeout --kill-after=10 $BUDGET pnpm exec playwright test; else node ../../scripts/e2e/wall-clock-watchdog.mjs $BUDGET pnpm exec playwright test; fi"

Third, a cap on the CI step itself, so the run dies even if the wrapper is somehow bypassed:

# .github/workflows/integration-tests.yml
- name: Run E2E suite
  timeout-minutes: 2
  run: pnpm --filter @app/web run test:e2e

The layer that actually preserves the suite is the fourth one: a lint rule. I wrote enforce-e2e-budget.mjs and wired it into pnpm lint. It reads the three files above and fails the build if any budget layer is missing or widened:

// scripts/lint/enforce-e2e-budget.mjs
const GLOBAL_TIMEOUT_CAP_MS = 120_000
const CI_STEP_CAP_MINUTES   = 2

const globalTimeout = readNumber(playwrightConfig, /globalTimeout\s*:\s*([0-9_]+)/)
if (!globalTimeout || globalTimeout > GLOBAL_TIMEOUT_CAP_MS)
  fail(`globalTimeout missing or > ${GLOBAL_TIMEOUT_CAP_MS}`)

if (!/\b(?:g?timeout|wall-clock-watchdog\.mjs)\b/.test(testScript) || !/\b120\b/.test(testScript))
  fail('test:e2e must wrap playwright in a 120s kill-wrapper')

if (ciStepTimeoutMinutes(workflow) > CI_STEP_CAP_MINUTES)
  fail(`CI e2e step timeout-minutes must be <= ${CI_STEP_CAP_MINUTES}`)

if (errors.length) { errors.forEach((e) => console.error(`e2e-budget: ${e}`)); process.exit(1) }

Bump globalTimeout to 300000 and lint goes red. Strip the kill-wrapper out and lint goes red.

This is the part people skip and the part that matters most. A number in a config is a suggestion, and the path of least resistance under deadline pressure is always to widen it. Putting the budget behind lint inverts that: widening it now means editing the guard that protects it and writing down why, which is exactly the friction you want at the moment someone is tempted to paper over a slow test.

3. Factories everywhere, so tests never collide

Speed gets you a fast suite. Factories get you a suite that stays green for the right reasons.

The slow death of a shared-database suite is collision. One test seeds a fixed id, another asserts on a global table count, and the moment two specs share a schema, or a schema accumulates rows across runs, they step on each other. The discipline that kills this: every test seeds through a factory that generates unique ids, and every assertion is scoped to the rows that test created. Never a fixed id, never a global count.

Leanpub Book

Read The Meta-Engineer

A practical book on building autonomous AI systems with Claude Code, context engineering, verification loops, and production harnesses.

Continuously updated
Claude Code + agentic systems
View Book

It starts with one helper. Uniqueness is a prefix, a timestamp, an in-process sequence, and a slice of a UUID, so two parallel workers in two processes cannot collide even within the same millisecond:

export const generateId = (): string => randomUUID()

export const generateUniqueValue = (prefix: string): string =>
  `${prefix}-${Date.now()}-${nextSequence()}-${randomUUID().slice(0, 8)}`

export const generateEmail = (prefix = 'user'): string =>
  `${prefix}.${Date.now()}.${nextSequence()}.${randomUUID().slice(0, 8)}@example.com`

Every row builder leans on those defaults, and every field stays overridable for the rare case a test needs a specific value:

export const createUserFactory = (options: CreateUserOptions = {}): UserInsert => ({
  id:           options.id           ?? generateId(),
  email:        options.email        ?? generateEmail('user'),
  name:         options.name         ?? generateUniqueValue('User'),
  passwordHash: options.passwordHash ?? generateUniqueValue('password-hash'),
  createdAt:    options.createdAt    ?? generateTimestamp()
})

Bulk seeders take a per-run scope token, tag every row with it, and return the ids so the spec asserts on what it made and nothing else:

const scope = `e2e_${randomUUID().slice(0, 8)}`
const { questionIds } = await seedPublishedBank({ scope, perSubject: 8 })
// rows land as tlseed_<scope>_<subject>_<n>

const served = await fetchServedQuestions()
// scoped: assert membership of what THIS run created
expect(served.map((q) => q.id)).toEqual(expect.arrayContaining(questionIds))
// never: expect(served).toHaveLength(totalRowsInTable)  <-- collides

The proof that this works is the test I am most happy with. I ran the full suite twice back to back against a deliberately dirty schema, one already holding thousands of accumulated rows, and watched the counts grow between runs. Both runs stayed green. That is the real bar: not “passes on a clean database,” which any suite can do, but “passes on a database filling up underneath it.”

Why bother at all

Because it drives a real browser against the real production build, this layer exercises the glue that API and unit tests skip, and that glue is where the embarrassing bugs live. It caught a login that never installed its auth token, so every protected route silently bounced the user back to sign-in (the API tests injected the token themselves, so only a browser walking through login could see it). It caught a reward that never unlocked from a wrong id, and a diagnostic score that was computed but never saved because the submit call was missing. Every one is a thing a real user hits on day one, and none shows up in a unit test.

A slow, flaky E2E suite is worse than none, because it teaches you to ignore the one layer that tells you whether the product actually works. Pre-build the server, put the budget behind lint, seed with factories, and you keep the layer that catches the real bugs.


I am merging these improvements into tx-agent-kit as the reference setup, so every project I spin up from it inherits a fast, budgeted, collision-proof E2E suite from day one.

Topics
Devops StrategiesE2e TestingJavascript TestingPlaywrightTest Optimization

Newsletter

Become a better AI engineer

Weekly deep dives on production AI systems, context engineering, and the patterns that compound. No fluff, no tutorials. Just what works.

Join 306K+ developers. No spam. Unsubscribe anytime.


More Insights

Cover Image for Write the Pseudocode, Let the LLM Type

Write the Pseudocode, Let the LLM Type

I asked Codex to “build a worker.”

James Phoenix
James Phoenix
Cover Image for A DESIGN.md Should Be a Full Spec, Not a Mood Board

A DESIGN.md Should Be a Full Spec, Not a Mood Board

If you cannot delete the CSS and regenerate it from the doc, you do not have a design spec. You have a mood board with extra steps.

James Phoenix
James Phoenix