The Execution Harness That Lets Agents Ship Code

James Phoenix

Why determinism, schema isolation, and enforced layering are the real unlocks for agentic coding.

When I talk about agentic coding, the part that actually matters is not the model. It is the harness around it. If an agent cannot spin up its own isolated environment in milliseconds, run the full integration suite deterministically, and report a clean green or red without flakiness, then the loop dies. You spend all day babysitting.

This post walks through the harness I built for my current product. It is DDD plus Effect on top of Temporal, with Postgres, git worktrees, and Vitest doing the heavy lifting.

Schema isolation, not container isolation

The most important design choice is that each test run gets its own Postgres schema inside a single shared database. Not its own Docker container. Not its own database. A schema.

In packages/testkit/src/sql-context.ts, the test run ID becomes a schema name:

export const compactTestRunId = (testRunId: string): string =>
  testRunId.toLowerCase().replaceAll(/[^a-z0-9]/g, '')

export const buildSchemaName = (testRunId: string, prefix = 'test'): string => {
  const sanitizedRunId = compactTestRunId(testRunId)
  return `${prefix}_${sanitizedRunId}`.slice(0, 63)
}

Every test uses withSchemaClient, which runs SET search_path to pin queries to that schema, applies any missing migrations, and hands the client to the test body. Cleanup between tests is a TRUNCATE TABLE ... RESTART IDENTITY CASCADE, not a drop and recreate.

Why this matters. The startup cost of spinning up a Postgres container is 20 to 30 seconds. The startup cost of creating a schema is about two milliseconds. Multiply that across 150 integration test files and the difference is most of an hour of agent wall clock per run. Agents cannot iterate against a harness that costs an hour of their time every loop. They can iterate against one that costs under a minute.

Deterministic worktree ports via MD5

Agents work out of git worktrees. Multiple agents means multiple worktrees, which means multiple servers trying to claim the same ports. The naive fix is a port registry with a lease protocol. The better fix is to make the port a pure function of the worktree name.

From scripts/worktree/lib/ports.sh:

calculate_port_offset() {
  local worktree_name="$1"
  hash_output="$(printf '%s' "$worktree_name" | md5sum)"
  hash="${hash_output:0:4}"
  local hash_decimal=$((16#$hash))
  local range=$((PORT_OFFSET_MAX - PORT_OFFSET_MIN + 1))
  echo $(((hash_decimal % range) + PORT_OFFSET_MIN))
}

Four hex characters, modulo 1000, offset into a safe range. The consequence is that feat/assets-design always gets the same port offset, on every machine, for every worker. No coordination protocol. No registry. No lease. An agent can cold start a worktree and know, before allocating anything, exactly what ports it is going to land on.

Determinism is the feature here, not the hash function. I could swap MD5 for anything stable. The important thing is that a worktree name fully determines its network footprint.

Shared services, one shot spawn

The next move is to stop spawning servers per test file. vitest.integration.workspace.ts registers exactly one global setup that boots the API and the Temporal worker once, and every integration test file connects to those same instances.

From scripts/test/vitest-global-setup.ts:

const apiProcess = spawn(process.execPath, ['--import', 'tsx', apiServerEntryPath], {
  env: {
    ...process.env,
    NODE_ENV: 'test',
    API_PORT: String(integrationApiPort),
    AUTH_SECRET: sharedApiAuthSecret,
    DB_POOL_MAX: '20'
  }
})

The Temporal worker boots the same way, sharing the same auth secret, and both processes wait on health checks before tests start. This is the difference between a suite that takes 140 seconds and one that takes 55. The trick is that schema isolation removes the reason you thought you needed per test servers in the first place. Once database state is isolated per test run, one shared API is safe.

Effect plus DDD with enforced boundaries

The domain layer uses Either and Option to describe success and failure. The application layer bridges into Effect so errors surface as typed failures. UploadService in packages/core/src/domains/assets/application/upload-service.ts looks like this:

export class UploadService extends Context.Tag('UploadService')<UploadService, {
  requestUpload: (input: UploadInput) => Effect.Effect<
    UploadResult,
    CoreError,
    MediaAssetStorePort | StorageAdapterPort | StorageMeteringService | TeamLookupPort
  >
}>() {}

The dependencies live in the type parameter. There is no DI container. There is no runtime wiring I have to remember. If I forget to provide TeamLookupPort, the program does not compile.

Validation stays in the domain and returns an Either. The application layer lifts left into Effect.fail:

const validated = validateAssetUpload(input)
if (Either.isLeft(validated)) {
  return yield* Effect.fail(badRequest(validated.left.code))
}

Layer discipline is mechanical. packages/tooling/eslint-config/domain-structure-plugin.js defines the allowed imports:

const allowedLayerImports = {
  domain: new Set(['domain']),
  ports: new Set(['domain', 'ports']),
  application: new Set(['domain', 'ports', 'application']),
  adapters: new Set(['domain', 'ports', 'adapters']),
}

If an adapter tries to import from application, ESLint rejects it. This matters for agentic coding because models love to take shortcuts. When the rule is enforced by the linter, the shortcut is simply not available. The model rewrites the code the right way on the second try, not the tenth.

Auth as just another domain, because you cannot close a harness around a black box

This was the section of the harness where I was most wrong for the longest time. I used the Supabase local stack for over a year because “auth is solved” and I did not want to reimplement it. What I actually had was the opposite of solved. I had a 30 second container boot, a stack of Go services I could not step into, a migration story that lived in a different tool than the rest of my migrations, and a fault surface I did not understand. Every time something flaked, I had to guess whether it was my code, GoTrue, Postgrest, Kong, Realtime, or the Docker daemon itself. Guessing is the opposite of a tight feedback loop.

The harder I pushed on the harness, the more I noticed that Supabase was the place where control stopped. Schemas were deterministic. Ports were deterministic. Migrations were deterministic. Auth was a black box sitting in the middle of all of it, and the black box was where almost every flaky test came from. I was trying to close a feedback loop around a component I did not own.

The worst symptom of this was specific and concrete. Supabase does auth in its own bespoke way, in its own auth schema, managed by GoTrue, with its own migration history, its own trigger functions, and its own opinions about what a user row looks like. That schema is not something you can duplicate per worktree the way you can duplicate an application schema. GoTrue expects exactly one auth schema per database and owns it completely. So across every git worktree on my machine, I had this one global auth schema that every test run, every agent, and every branch had to share. My whole isolation story had an asterisk on it, and the asterisk was the single most sensitive part of the system.

The workarounds I tried were all hacks. I looked at spinning up a Supabase stack per worktree, which would have added minutes of boot time and gigabytes of RAM for something I did not want to own. I looked at namespacing users by email prefix so that tests in different worktrees would not collide, which is the kind of fix that works until the day it silently does not. I looked at truncating the auth schema between runs, which broke GoTrue in ways I did not want to debug. Every option ended with me reaching into internals that were not mine, to make them fit a harness shape they were never designed for.

That was the moment I stopped trying to patch it. If the isolation model has a single global exception, the exception becomes the shape of every flaky test you will ever write. So I rewrote auth as a domain.

packages/core/src/domains/auth/ follows the exact same four layer shape as every other domain in the codebase. The domain layer defines SignUpCommand, AuthSession, and the error types. The ports layer declares PasswordHasherPort, SessionTokenPort, and RefreshTokenPort. The application layer is an Effect service that composes them. The adapters layer binds the ports to node:crypto and a small shared helper in @tx-agent-kit/auth for password hashing and session signing. Password reset tokens are opaque strings hashed with SHA256 and stored on the user row. Sessions are signed with HMAC. Refresh tokens rotate on use. That is the entire surface area.

The change in my agent loop was dramatic, and it was not just about speed. Three things got strictly better at once.

First, the boot cost went from 30 seconds to zero. Auth is not a separate process any more. It is a module that runs inside the same Node process as the rest of the API, against the same schema isolated Postgres the rest of the suite uses. The Docker stack for local development lost its reason to exist.

Second, the fault surface collapsed. There is no longer a network hop between my code and my auth logic. There is no container that can be unhealthy. There is no version skew between my client library and the service it talks to. When an auth test fails, the stack trace goes straight into a function in my repo. An agent can read it, find the bug, and fix it in one pass. Before, an agent would read a GoTrue error code, get stuck, and ask me to intervene.

Third, and this is the one I underestimated the most, the code became legible to the model. Supabase auth is configured through environment variables, SQL policies, and JWT claim conventions spread across three layers. A model reading my repo could see the client call but could not see what happened after it. Every auth flow ended at an opaque boundary. Now every flow ends at a function the model can open. Sign up is a file. Password reset is a file. Session refresh is a file. The model can reason about the whole loop because the whole loop is in the repo.

The principle I took away from this is simple. The harness gets easier to close the more of the primitives you own. Every external service you depend on is a place where determinism leaks, where the fault surface widens, and where the model has to guess instead of read. There are good reasons to take that trade sometimes. Payments is one. Email delivery is one. But auth is not a good trade. The primitives are small, well understood, and less than a week of work to own end to end. In exchange you get a component that boots instantly, fails legibly, and lives inside the same type system as the rest of your code.

Leanpub Book

Read The Meta-Engineer

A practical book on building autonomous AI systems with Claude Code, context engineering, verification loops, and production harnesses.

Continuously updated

Claude Code + agentic systems

View Book

I should have done this a year earlier. The thing that stopped me was the feeling that auth is dangerous territory. It is not, if you keep the surface area small and let the linter and the type system enforce the boundaries. The dangerous thing is leaving a black box in the middle of a harness you are trying to make deterministic. Every other part of the system can be tightened to nothing, and you will still be bottlenecked on the component you did not build.

Why this combination matters

Any one of these choices is a nice optimisation. Together they are a different category of thing. Schema isolation makes parallelism cheap. Deterministic ports make parallelism safe. Shared services make the loop fast. Effect plus ESLint makes the code coherent as agents pile changes into it. Native auth removes the last piece of infrastructure that required babysitting.

The measure I care about is not test speed. It is whether an agent can open a worktree, make a change, and get a trustworthy signal back in under a minute, without any human setup. That number used to be ten minutes with false positives mixed in. Now it is under one, and the false positive rate is effectively zero.

That is the real unlock. The model does not get smarter. The feedback loop gets tight enough that the model can use the intelligence it already has.