The Environment Leads The Agent

James Phoenix
James Phoenix

For a long time I tried to lead my coding agents with better and better prompts, and they kept drifting. What finally worked was the opposite move. As I optimised the boilerplate of the repository I was building, I kept pushing each lesson I learned down into the floor of the repo: hermetic environments, typed contracts, mechanical lint, integration-first tests, queryable telemetry. Somewhere along the way the repo itself became the thing steering the agent, and it asks me what to do far less than it used to. This is the journey that got me there, and what it taught me.

Author: James Phoenix | Date: June 2026


Stop Steering The Agent. Build The Grooves It Falls Into.

It took me a while to accept that the highest-leverage thing I could do for an autonomous agent was not writing a better prompt. It was building an environment where every wrong move produces an immediate, named, mechanical signal. Prompts are advisory; the environment is coercive. An agent reads instructions and may drift, but it cannot drift past pnpm lint failing with Cross-domain import detected: ... expected-domain=team actual-domain=social. That error is not a suggestion. It is a wall.

The whole stack is one coercive surface. None of these signals come from me. They come from the repo:

  • Write new Promise() and effect-consistency.js blocks the commit, because a rejection there becomes a silent defect that bypasses the log and Sentry boundary.
  • Let a Temporal workflow file cross 1200 lines and enforceWorkerTemporalSourceFileSize fails the build, with the remedy in the message: split the implementation.
  • Reach for a secret and source-env.sh resolves it from op:// through op inject, while ESLint forbids reading process.env anywhere but the typed env module.
  • Try to slip a slow suite through by bumping INTEGRATION_TIMEOUT_SECONDS and block-integration-timeout-bump.sh denies it outright, with no escape hatch.

The agent never has to be told any of this. It reads the wall and turns.

This is what surprised me most: the environment leads the agent more than I do. A swarm in parallel worktrees never collides because WORKTREE_PORT_OFFSET is a deterministic hash. A new domain event cannot ship half-wired because six separate gates each fail pnpm lint with the exact file and the exact fix. I steered the setup once (the task queues, the timeouts, the error strategy, the coupling caps); after that, the environment steered every move. The rest of this article walks down those layers (hermeticity, parity, mechanical enforcement, integration-first tests, observability) and how each one carves a groove the agent simply falls into.

Lint, types, tests, telemetry, and parity form a channel that steers the agent forward
Lint, types, tests, telemetry, and parity form a channel that steers the agent forward

The Stack Is The First Constraint

I used to think of a stack as plumbing, interchangeable pipes you bolt together to move bytes. Building this repository changed my mind. I stopped picking tools for their feature lists and started picking each one for the specific class of mistake it makes impossible, or at least makes loud. The stack became the first layer of enforcement, before a single lint rule or test of mine runs. It fails fast and fails loud by design, and that is what lets an agent run without me watching.

Start with Temporal. The work that actually breaks in production is the long-running, multi-step, retriable kind: publishing a post, charging a card, collecting metrics across third-party APIs that fail at random. Every time I hand-rolled the retries, timeouts, idempotency, and crash recovery for that work myself, it was where my worst bugs lived. A durable workflow engine moves all of it into the framework, so a process can survive a deploy or a crash and resume exactly where it stopped. What sold me, though, is that the operational policy lives at the call site instead of in a runbook nobody opens. In apps/worker/src/workflows/activity-proxies.ts, every activity is a typed proxy carrying its own retry policy: publishScheduledSocialPost declares a 10 minutes start-to-close timeout with 3 attempts and a 1s initial interval, because media upload stalls; collectMetricsActivity gets 30 seconds because it is pure I/O. An agent writing a new workflow does not invent retry semantics, it sees them as typed constants at the point of call. The proxy file even embeds the invariant inline:

// @spec INV-BILLING-010, billing consumers are idempotent. The activity
// proxy retries with backoff; each activity is implemented so that a second
// execution of the same event produces the same terminal state.

The comment links the retry config to a spec the agent must go read in specs/design/. Discipline is not advice here, it is the shape of the API.

Temporal: retries and timeouts baked into the call site, an agent inherits the retry semantics it cannot forget to set
Temporal: retries and timeouts baked into the call site, an agent inherits the retry semantics it cannot forget to set

Effect was the hardest sell to myself, and the choice I am now most glad I made. The failure I fear most from an autonomous agent is the silent one: a swallowed exception, an unhandled rejection, an error logged at the wrong level and never seen again. Effect puts failures into the type system, so a function’s errors sit in its signature and cannot be quietly dropped. That single property, errors as values the compiler tracks rather than exceptions that vanish, is what made the learning curve worth it. In this repo, CoreError is the only error allowed across a domain boundary, every port returns Effect<Result, CoreError>, and effect-consistency.js bans both new Promise() and Effect.promise() outright. The rule message is the teacher: “a promise rejection under Effect.promise becomes a silent defect that bypasses the log+Sentry boundary. tryPromise keeps it in the observable error channel.” There is no quiet way to drop an exception. A rejected promise either lives in the typed error channel where mapCoreError logs and Sentry-captures it, or the lint run fails on commit. Silent defects do not exist in this system because the type and the linter conspire to make them un-writable.

Effect: a raw promise rejection is blocked at commit while a typed CoreError flows into the log and Sentry
Effect: a raw promise rejection is blocked at commit while a typed CoreError flows into the log and Sentry

The rest of the stack is deliberately boring, composable blocks I did not want to invent: native auth, Redis, Postgres, and OTEL. I picked each because it is a standard primitive an agent already understands and that I can run byte-identically on my laptop and in prod, not a bespoke service I would have to teach it. Auth is typed sessions in the API rather than a third-party SDK, Redis and Postgres come up from the same Docker images in every environment, and OTEL is the single telemetry seam the whole system speaks through, so the fewer novel moving parts there are, the fewer places the agent can be surprised.

1Password is the simplest of these, and it earns its place by removing a temptation rather than adding a feature. Plaintext .env files are the most reliable way I know to leak a credential, or to watch an agent paste one straight into a commit. Resolving secrets at the point of use, through a reference instead of a value, takes the plaintext out of the picture entirely. .env.example is the contract, and every credential line is an op:// reference (STRIPE_SECRET_KEY=op://app-services/dev/STRIPE_SECRET_KEY). scripts/lib/source-env.sh resolves these via op inject into a tempfile that is sourced and immediately deleted, while ESLint’s no-restricted-syntax rule forbids process.env outside the typed env modules. An agent cannot hardcode a secret because there is no place for one to live, and cannot read env scatter-shot because the boundary rejects it. If the op CLI is missing, the literal op:// string flows through and the service refuses to boot rather than running on garbage.

Temporal, Effect, and 1Password each turn a class of mistake into a boot-time, build-time, or lint-time failure, and the boring blocks under them (auth, Redis, Postgres, OTEL) give the agent a substrate it already understands. The agent does not need a human telling it to set retries, route errors, or handle secrets. There is a second reason each one earned its place: an agent can drive it directly. Temporal has a CLI, 1Password has a CLI, Postgres and Redis come up under the Docker CLI, and Effect’s OTEL integration lands traces in Jaeger that the agent reads back over the OTEL REST API. Every component either ships a CLI or is otherwise trivially traversable, and this is by design: an agent can only operate what it can interface with. Owning auth compounds it, because when sessions are typed code in my own API rather than a vendor black box, infra and auth tests are just more integration tests. The stack already decided, and it raises its objection the instant the agent steps off the path. That is the first constraint, and everything else is built on top of it.

Parity Is Not A Nice-To-Have, It Is The Substrate

Here is the thing I underestimated for too long: an agent’s local feedback loop is only worth trusting if local actually predicts prod. If the Redis you run on your laptop has different flags than the Redis in GKE, then a green test suite is a lie, and the agent is learning the wrong grooves. So before any of the clever stuff (the lint rules, the typed contracts, the observability), the repository spends its enforcement budget on one boring thing: making dev, staging, and prod the same shape, mechanically, with no human discipline involved.

Start with the compose files. docker-compose.staging.yml and docker-compose.prod.yml carry identical service structure: the same redis:8.6.0-alpine3.23, the same otel/opentelemetry-collector-contrib:0.146.1, the same curlimages/curl:8.12.1 healthcheck sidecar probing port 13133, the same stop_grace_period: 30s, the same Redis flags (--appendonly, --maxmemory, --maxmemory-policy allkeys-lru). The ONLY differences are port mappings (32080/32090 staging, 32081/32091 prod) and NODE_ENV, and even those come from deploy/env/*.env.template, not from the compose file. The local docker-compose.yml stays infra-only on purpose: api and worker run as hot-reload processes, but they talk to the exact same Redis and OTEL collector images as production does.

That parity is not maintained by anyone remembering to maintain it. scripts/lint/enforce-compose-runtime-contracts.mjs runs on every pnpm lint with 80+ structural fragment assertions. Bump the Redis version in the Kubernetes chart but not in compose, and you get Deployment compose \…` must preserve runtime fragment. Add an OTEL endpoint in the compose file but forget to mirror it into the env templates, and the PR fails before it merges. The linter also runs compose config –quiet` as live validation, so invalid YAML is structurally impossible to land. The linter, not human vigilance, is the keeper of the invariant. That is the lesson I keep coming back to: drift is not discouraged, it is rejected.

The OTEL collector parity is the sharpest example. There are three configs (otel-collector-config.yaml local, otel-collector.oss.yaml self-hosted, otel-collector.gcp.yaml GCP) that are structurally identical except for backend exporters. The same linter forbids the deprecated implicit ${OTEL_SERVICE_NAMESPACE} syntax and demands the explicit ${env:OTEL_SERVICE_NAMESPACE} provider fragment, failing with Production OTEL collector config must use explicit env provider fragment. So an agent literally cannot hard-code an environment-specific value into the telemetry pipeline. The deployment contract stays visible and auditable.

Then the env shape itself converges through three tiers: .env.example documents it with op:// references, deploy/env/{staging,prod}.env.template mirrors it with op://${OP_VAULT}/${OP_ENV}/ tokens, and apps/api/src/config/env.ts declares the canonical requiredApiEnvShape as an Effect Schema. One shape, validated at boot. And boot is where the floor drops out from under misconfiguration. main() in server-lib.ts calls getApiEnv() first, which decodes through Schema.decodeUnknownSync and then runs assertApiEnvInvariants() with 15+ assertions: weak AUTH_SECRET in production, missing GOOGLE_OIDC_*, absent Stripe webhook secret, wildcard CORS. The process exits before any handler runs. The same energy guards Temporal: TEMPORAL_API_KEY is required when TEMPORAL_RUNTIME_MODE=cloud, refused at process-startup, not at first publish.

This is the precondition for everything else in this essay. The floor-grooves (the lint rules, the error boundaries, the typed event contracts) only generalize because the floor under them is the same floor in every environment. The parity linter’s 80+ assertions and the boot-time invariant checks are what let the agent believe its own green checkmark. Take parity away and the agent is being led by a hallucination. Keep it, and local truth becomes prod truth, which is the only condition under which the environment can lead at all.

Mechanical Enforcement Is The Spec

Open packages/tooling/eslint-config/domain-invariants.js and you will find 1,579 lines and 62 rule blocks. That file is not a style guide. It is the actual specification of the repository’s architecture, written in a language the build can execute. When an agent drops a Date.now() into a Temporal workflow under apps/worker/src/workflows/**, ESLint does not shrug. It fires: “Temporal workflows must stay deterministic. Read wall-clock time in an activity, not directly in workflows.” The fix direction is embedded in the error. The agent does not interpret intent from a Slack thread or a half-remembered convention. It reads the violation and the remedy in one line, and it cannot ship until the line is gone.

This is the part that changed how I work. Human code review is fuzzy and conflicting: one reviewer wants the helper extracted, another wants it inlined, a third never gets to the PR. The linter gives instant, deterministic, named feedback. enforce-domain-event-contracts.mjs says “Event payload field drift for ‘organization.created’: TypeScript interface and effect-Schema have different fields.” It names the event, the two artifacts that disagree, and the two files to edit so the field sets match. There is no ambiguity to negotiate. The error IS the spec.

Mechanical enforcement: the agent hits a named lint wall that states the file, the rule, and the fix
Mechanical enforcement: the agent hits a named lint wall that states the file, the rule, and the fix

Walk the gauntlet. Topology comes first: eslint-plugin-boundaries v6.0.2 in boundaries.js declares four element types and the allow matrix { from: 'app', allow: ['package', 'app', 'temporal-activity'], disallow: ['temporal-workflow'] }. An app physically cannot import a workflow, which forces workflows to deploy as separate workers. Then effect-consistency.js bans new Promise() and Effect.promise() outright, because “a promise rejection under Effect.promise becomes a silent defect that bypasses the log+Sentry boundary.” The same file rejects a badRequest('Failed to read row') with “A ‘Failed to …’ message is an infra/port-call failure (a 500), not a 4xx client error. Use internalError().” That is a semantic contract, not syntax. It encodes the difference between a client’s fault and the system’s fault and refuses to let an agent mislabel a 500 as a 400, because a mislabeled error gets logged at warn and never reaches Sentry.

The coupling analyzer goes further than per-file lint. It reads contexts.json, walks every import with ts-morph, classifies each as read or write via verb-prefix regexes (WRITE_PREFIXES counts at writeCouplingMultiplier: 2), and gates on maxPortsPerPair: 5 with allowCycles: false. Adding a sixth symbol across the billing → notifications pair becomes visually impossible: pnpm coupling:check exits 1 and the PR does not merge. Cycles die under Tarjan’s SCC before review.

Centralization is mechanical too. no-bare-semantic-literals.js (433 lines) lexically scans for a bare ['tweet.read', 'tweet.write', 'users.read'] and rejects it with “Array literal duplicates the canonical X_OAUTH_SCOPES tuple from ‘@tx-agent-kit/contracts’. Replace with X_OAUTH_SCOPES.” That single rule exists because the scope tuple once appeared in three places with different element order (docs/social-types-duplication-audit.md, item #9), producing divergent token requests. The drift is now structurally unreachable.

And then the financial audit constraints, the ones I am most glad I added. enforce-domain-invariants.mjs rule [INV-BILLING-007] rejects any CASCADE foreign key on credit_ledger or usage_records: “Financial audit tables must use ON DELETE SET NULL to preserve 7-year tax compliance records.” The pgTAP suite 004_billing_immutability.pgtap.sql asserts the no_update_credit_ledger trigger exists, tagged [INV-BILLING-001], and no amount of TypeScript satisfies it until the database layer is correct.

The agent never guesses at architecture because violation is impossible to ship. The 62 blocks, the coupling caps, the FK tags: they are the language the environment uses to communicate structural intent back to whoever is typing, human or machine. The spec stopped being prose I hoped someone would read. It became the thing that fails the build.

DDD Boundaries The Agent Cannot Cross

Every domain in packages/core/src/domains/<domain>/ has exactly six folders: domain/ (pure logic), ports/ (abstract verbs), application/ (Effect use-cases), adapters/ (persistence), runtime/ (DI), and an optional ui/. The dependency arrow points inward only: domain ← ports ← application ← adapters ← runtime. This is not a convention an agent is asked to respect. It is a wall checked at three layers, and the agent that walks toward it gets stopped with a named error before any human sees the diff.

Two domains, each a stack of layers depending inward to a pure domain core, with events.ts as the only cross-domain door
Two domains, each a stack of layers depending inward to a pure domain core, with events.ts as the only cross-domain door

The shape on disk is rigid and identical for every domain:

packages/core/src/domains/billing/
├── domain/        # pure logic, no I/O
├── ports/         # abstract verbs (interfaces)
├── application/   # Effect use-cases
├── adapters/      # persistence
├── runtime/       # dependency injection
└── ui/            # optional

# dependencies point inward only:
# domain  ←  ports  ←  application  ←  adapters  ←  runtime

The first wall is eslint-plugin-boundaries v6.0.2 in packages/tooling/eslint-config/boundaries.js, declaring four element types and a single allow-rule:

// packages/tooling/eslint-config/boundaries.js
{ from: 'app', allow: ['package', 'app', 'temporal-activity'], disallow: ['temporal-workflow'] }

Workflows physically cannot be imported from apps, which forces them out into separate workers. The second wall is domain-structure-plugin.js (2,721 lines), which scans every import at parse time. An agent that writes import { SocialService } from '../social/application/social-service' from inside the team domain gets this, instantly, on save:

Cross-domain import detected:
  source=packages/core/src/domains/team/application/team-service.ts
  import=../social/application/social-service
  expected-domain=team  actual-domain=social

The error names the offending file, the forbidden layer, and the only legal alternative. The agent does not deliberate. It reads the message and reaches for the one door that opens.

That door is events.ts at the domain root, the ONLY file a sibling domain may import. Cross-domain communication flows through typed event payloads, never direct service calls. The repo’s own discipline, pinned in packages/core/AGENTS.md, is stated as a machine-checkable law: events are public nouns, ports are public verbs, errors are private semantics. Domain errors like UploadValidationError with codes EMPTY_FILENAME, FILE_TOO_LARGE are never exported. They stay encapsulated, translated at the application seam into a normalized CoreError. Only the event contract leaks.

// billing/events.ts : the public surface, nouns only
export interface CreditsRefunded { eventId: string; orgId: string; amount: number }

// billing/domain/errors.ts : private semantics, never exported
class UploadValidationError { code: 'EMPTY_FILENAME' | 'FILE_TOO_LARGE' }
// at the application seam this becomes a normalized CoreError

The part that took me longest to get right is field parity. A single domain event must exist, with identical field names, in four places:

billing.credits_refunded must agree across all four:
  1. TS interface   packages/core/src/domains/billing/domain/*-events.ts
  2. Effect schema  packages/temporal-client/src/types/
  3. worker case    the dispatch switch in apps/worker
  4. registry       domainEventTypes in packages/contracts/src/literals.ts

enforce-domain-event-contracts.mjs runs 13 rules at pnpm lint. Add refundReason: string to the TS interface and forget the Effect schema, and the build dies with:

Event payload field drift for 'billing.credits_refunded':
  TypeScript interface and effect-Schema have different fields.
  Only in TS interface: refundReason.
  Update both packages/core/.../billing/domain/*-events.ts
  and packages/temporal-client/src/types/*

The lint also checks that the insert sits inside a db.transaction() (it scans 3000 characters backward for .transaction()), that every registered type has a worker case, and that executeChild carries event.id in the workflowId for idempotency. Six steps, six independent gates. Miss one and the branch will not push.

This is why the scaffold CLI matters. pnpm scaffold:crud --domain billing --entity invoice generates the only shape that survives the gauntlet:

$ pnpm scaffold:crud --domain billing --entity invoice
  created  domains/billing/domain/invoice.ts
  created  domains/billing/ports/invoice-repository.ts
  created  domains/billing/application/create-invoice.ts
  created  domains/billing/adapters/invoice-repository.pg.ts
  created  domains/billing/runtime/invoice.layer.ts
  created  domains/billing/application/create-invoice.service.test.ts
  wired    packages/core/src/index.ts (barrel exports)

Its coreFilePlan (in packages/tooling/scaffold/src/index.ts) emits domain, ports, adapters, application, runtime, and a colocated test, then auto-wires barrel exports into packages/core/src/index.ts. The agent does not invent structure. The agent runs the scaffold because the 29 invariants in enforce-domain-invariants.mjs reject repositories/ and services/ folders by name, demand ports declare Effect.Effect<T, CoreError> returns, and check schema-to-factory-to-repository parity. The golden path is not chosen out of discipline. It is the only path where pnpm lint exits zero. Deviation is not discouraged. It does not compile.

Integration-First, And The Budget Has No Escape Hatch

Real, not mocked: parallel workers hit one warmed API server with a schema per suite, beside a 120s budget the agent cannot widen
Real, not mocked: parallel workers hit one warmed API server with a schema per suite, beside a 120s budget the agent cannot widen

I inverted the test pyramid on purpose. The primary verification is not a wall of mocked unit tests, it is integration tests that hit a real API server backed by real Postgres, so a failure is a real 401 or Expected 0 rows in credit_ledger ..., got 1, never a fabricated mock. The contract lives in db-auth-context.ts: every test runs against one shared, warmed API server (started once by vitest-global-setup.ts) and gets its own Postgres schema. That isolation plus the warm server is what buys parallelism: 10 thread-pool workers took the suite from a 129s baseline to a 28-31s warm run, and the server respawns automatically when any source file is newer than the lockfile, so no agent ever tests stale code.

The speed is not a hope, it is a budget with no escape hatch. test-integration-quiet.sh enforces a 120s wall-clock ceiling (300 in CI): cross it and a monitor SIGKILLs vitest mid-run and names every test over 10s, ranked. An agent’s natural move is to raise the limit, so the PreToolUse hook block-integration-timeout-bump.sh intercepts any command that sets INTEGRATION_TIMEOUT_SECONDS= and refuses, with no opt-out. The only way to fit the budget is to make the slow test fast, which is what stops a 60s suite ratcheting into a 300s suite with --retry 3 over a quarter.

The same discipline extends outward. Warmed-server Playwright e2e (apps/web/e2e/playwright.config.ts) reuses a prebuilt next start under its own 120s budget, guarded by enforce-e2e-budget.mjs, and AI tests (http-vcr.ts) replay recorded cassettes by default, gating live recording behind three env vars CI never sets. Integration, e2e, AI: the agent writes fast, hermetic tests not because a doc asks nicely, but because the environment refuses to ship slow ones.

Observability You Can Query Locally And In Prod

Most teams treat telemetry as something you bolt on for the ops dashboard. In this repository it is part of the agent’s decision loop, and the difference is structural. When you run pnpm infra:ensure, Spotlight comes up on 8969, Jaeger on 16686, Prometheus on 9090, Loki on 3100, and an OTEL collector on 4319/4320 fanning traces into Jaeger, metrics into Prometheus, and logs into Loki. None of that is optional scaffolding. scripts/start-dev-services.sh runs check_prometheus and check_loki with an INFRA_READY_TIMEOUT_SECONDS=240 gate, and spotlight.sh curls http://localhost:8969 and fails fast with “Start it with: pnpm infra:ensure” if the sidecar is missing. The agent cannot proceed into a broken observability state, because the environment refuses to let it.

OTEL: one pipeline fans metrics, traces, and logs into Grafana, the same pipeline local and in prod
OTEL: one pipeline fans metrics, traces, and logs into Grafana, the same pipeline local and in prod

The critical move is exposing all of this through .mcp.json as servers the agent queries directly: jaeger-local, prometheus-local, spotlight-local. So when a test fails, the agent does not reread the code and guess. It calls mcp__jaeger_local__find_traces, pulls the trace by trace_id, and sees the exact span where the failure originated, then queries Loki by that same trace_id for the structured logs. The debug-from-telemetry SKILL.md encodes the triage order: logs first, then traces, then metrics, then Temporal state. Because every app layer initializes Sentry with the placeholder DSN https://spotlight@local/0 and tracesSampleRate: 1.0 when Spotlight is on, every request from web through API through the Temporal worker generates a correlated, queryable span. A symptom in the browser has a direct trace path to its source in an activity. Observability becomes the ground truth that turns a symptom into a fix, not a hunch.

The metric names are not loose strings, they are typed constants: tx_agent_kit_api_request_total, tx_agent_kit_outbox_unprocessed_age_seconds, tx_agent_kit_ai_generation_cost_usd, recorded through helpers like recordApiRequest() and recordAiGeneration(). Cardinality is bounded by enums (provider, status_class, error_category), and units are annotated with braces ({usd}, {request}) so the Prometheus exporter does not append a spurious _usd suffix. This matters because dashboards and alerts depend on these exact rendered names, and the names are validated mechanically. pnpm monitoring:validate runs two layers. Layer 1 extracts every PromQL expression from the GCP alert and dashboard JSON and feeds it to promtool check rules for syntax, then promtool test rules synthesizes breach and healthy fixtures so each alert must fire above threshold and stay silent below it, fully offline. Layer 2 emits real metrics through the actual OTEL SDK and collector, then queries Prometheus to confirm every series the dashboards reference actually exists under its rendered name, catching _milliseconds versus _seconds drift before it ships. An agent that renames a metric in packages/contracts but forgets the alert JSON fails the integration test with the missing series name, not a silent 3am page.

The same OTEL pipeline runs in GKE and GCP. The collector configs (otel-collector.gcp.yaml, otel-collector.oss.yaml, local) are structurally identical, differing only in exporters, and enforce-compose-runtime-contracts.mjs blocks any config that hard-codes a value instead of using ${env:OTEL_SERVICE_NAMESPACE}. The pattern I grew to rely on is critical-outbox-dispatcher-stale.json: a dead-man’s switch firing on absent_over_time(tx_agent_kit_outbox_unprocessed_age_seconds[5m]). The absence of a metric is itself the alert, because a dead dispatcher emits nothing. That is observability as a contract the agent cannot ignore: the signal, present or absent, drives the next action.

Worktrees, Surfaces, And The Autonomous Loop

Start with the most boring file in the system, scripts/worktree/lib/ports.sh, because it is where stateless autonomy actually becomes safe. Every worktree gets a WORKTREE_PORT_OFFSET derived from a stable hash of its name:

hash="${hash:0:4}"
local hash_decimal=$((16#$hash))
local range=$((PORT_OFFSET_MAX - PORT_OFFSET_MIN + 1))
local resolved_offset=$(((hash_decimal % range) + PORT_OFFSET_MIN))

feat/tenancy-team-members always resolves to offset 316. feat/billing-refund always lands on 701. That determinism is what makes stateless autonomy safe. There is no port broker, no lockfile dance, no agent asking another agent “are you using 4100?” The Vitest global setup just reads the env var and computes const integrationApiPort = 4100 + worktreePortOffset, the dev server takes 3000 + offset, the API takes 4000 + offset, and the Postgres schema becomes wt_<sanitized_branch>. Two agents in two worktrees collide on nothing because the arithmetic guarantees it, and the primary checkout is pinned to offset 0. Shared Postgres, Redis, and Docker stay single-instance (max_connections=1000), so agents share I/O but never lock the same row. And the only door into this is scripts/worktree/create.sh: the PreToolUse hook block-raw-git-worktree-add.sh denies raw git worktree add outright, so an agent that tries the obvious thing reads an error that names the script it must use instead. The constraint is the onboarding.

On top of that isolation sits the second pillar: the generated agent-surface contract. The entire API is declared as an Effect HttpApi, pnpm openapi:generate introspects it into openapi.json, and scripts/generate-agent-client-surfaces.ts derives packages/contracts/src/agent-client-surface.generated.ts. That contract is pure data: every operation carries mutating, requiresEstimate, requiresConfirmation. The enforcement is one line of lint:

if (operation.mutating && (!operation.requiresEstimate || !operation.requiresConfirmation)) {
  throw new Error(`Mutating agent surface ${operation.key} must require estimate and confirmation.`)
}

An agent cannot add a credit-spending POST endpoint without declaring both flags, and at runtime the CLI/MCP refuse the call without --dry-run or --yes, with the error naming the missing flag. The agent does not learn this from a doc. The compiler and the linter teach it. enforce-agent-client-surface-sync.mjs runs 92 checks and fails with “OpenAPI operation X is neither registered nor excluded,” so every endpoint is visible to agents until someone hides it on purpose with a written rationale.

The third pillar is the ralph loop, and this is where the thesis closes. scripts/ralph.sh does not keep a session alive. It spawns a fresh agent per task with timeout "$TASK_TIMEOUT" claude --print --dangerously-skip-permissions "$PROMPT", injects context from CLAUDE.md pins, linked specs, and the tasks JSON, gives it 30 minutes, and only auto-commits if the agent ran tx done $TASK_ID. The .tx/tasks.db schema rejects malformed task IDs at the trigger level and gates status through a CHECK constraint. The watchdog reclaims orphaned tasks and trips a circuit breaker after consecutive failures.

Here is the strange part. That fresh agent has no persistent memory to drift. It cannot quietly accumulate a wrong assumption across hours of context, because there are no hours of context. Its only inheritance is the repo state: the hash-derived ports that make its worktree hermetic, the surface contract that makes every mutation declare its safety, the 333 invariants annotated as @spec INV-* tags, the 120-second test budget that block-integration-timeout-bump.sh will not let it widen. The invariants carry the knowledge forward, not the agent. When you make the environment deterministic enough to lead, you stop needing the agent to remember anything.

What The Repo Taught Me

When I watch an agent add a billing feature in this repository now, I stop seeing it ask questions, because the repo already answered them.

How do I structure a workflow? It doesn’t ask. It runs pnpm scaffold:crud --domain billing --entity invoice and gets twelve files in a fixed shape: domain/, ports/, application/, adapters/, runtime/, plus a colocated *.service.test.ts. The domain-structure-plugin.js forbids a repositories/ folder and rejects any cross-domain import that isn’t from events.ts. The scaffold isn’t a convenience, it’s the only path that survives pnpm lint. The shape is not a decision the agent makes, it is the floor it was dropped onto.

Did I handle the error? It doesn’t ask. If it writes badRequest('Failed to load invoice'), effect-consistency.js blocks the commit: “A ‘Failed to …’ message is an infra/port-call failure (a 500), not a 4xx.” If it reaches for new Promise() or Effect.promise(), the same config rejects it because a rejection there “becomes a silent defect that bypasses the log+Sentry boundary.” The agent learns, mechanically, that silent failures do not exist in this system. Every fault is typed CoreError, routed through mapCoreError, and fingerprinted into Sentry. The no-swallowed-errors rule even checks that catch handlers reference the original error. You cannot drop a stack trace here.

Is my file too big? It doesn’t ask. enforceWorkerTemporalSourceFileSize fails the build at 1201 lines: “Split Temporal workflow implementations so no file exceeds 1200 lines.” There is no flag to raise. The only move is to decompose.

Does this metric exist? It doesn’t ask. The Layer 2 check in emit-fake-metrics.integration.test.ts emits through the real OTEL SDK and queries Prometheus for the exact rendered series name, catching _milliseconds versus _seconds drift before it ships. validate-monitoring.mjs runs every alert’s PromQL through promtool with synthetic breach and healthy fixtures: the alert MUST fire on breach and stay silent on healthy. The agent doesn’t reason about whether monitoring works. The repo proves it offline in under a second.

Is this deploy safe? It doesn’t ask. deploy-compose.sh captures the previous image digest, rolls forward, runs internal smoke tests against the local API port, and auto-rolls-back on any failure. enforce-compose-runtime-contracts.mjs blocks the PR if staging and prod drift on Redis flags, OTEL health probes, or stop_grace_period. There is no human approval gate. The script is the gate.

And the invariant that matters most: change credit_ledger to ON DELETE CASCADE and enforce-domain-invariants.mjs fails with [INV-BILLING-007], naming the 7-year tax-compliance reason. The agent cannot break audit history even by accident, because the constraint is louder than its intent.

This is the part that compounds. Every constraint I just listed I carved once, after learning something the hard way: that scope tuples drift (no-bare-semantic-literals), that timeouts ratchet upward forever (the block-integration-timeout-bump.sh hook with no escape hatch), that a dead dispatcher emits no metrics so you need an absent_over_time() dead-man’s-switch on tx_agent_kit_outbox_unprocessed_age_seconds. Each groove I cut made the next agent more autonomous and made me less of a bottleneck. The 333 invariants, the 62 ESLint rule blocks, the 13 enforce scripts, the 120s test budget: these are not features for the agent to use. They are the floor it walks on.

Here is where the journey left me. I am no longer building an app for an agent to operate. I am building the ground beneath it: the error messages it reads, the tests that fail it, the traces it queries in Spotlight, the lint rules that block its commits. I carved those grooves one at a time, each after getting something wrong, and now I rarely have to lead the agent at all. The environment leads it.

The Code

The repository is open source at jamesaphoenix/tx-agent-kit. The pieces behind each section, if you want to read the real thing:

The stack

Mechanical enforcement

DDD and scaffolding

Leanpub Book

Read The Meta-Engineer

A practical book on building autonomous AI systems with Claude Code, context engineering, verification loops, and production harnesses.

Continuously updated
Claude Code + agentic systems
View Book

Integration-first testing

Observability

Worktrees and the loop

Topics
Agent ArchitectureAgent ReliabilityCoding AgentsLintingObservability

Newsletter

Become a better AI engineer

Weekly deep dives on production AI systems, context engineering, and the patterns that compound. No fluff, no tutorials. Just what works.

Join 306K+ developers. No spam. Unsubscribe anytime.


More Insights

Cover Image for Your Own Life Is a Queryable, Validated Corpus

Your Own Life Is a Queryable, Validated Corpus

Your private data exhaust deserves the same treatment as production data: indexed, validated, version-controlled, and queried by an agent. Once you make that move, writing a song, paying a tax bill, and updating a CV all become the same engineering problem.

James Phoenix
James Phoenix
Cover Image for Fabricate The Telemetry Before The Traffic Exists

Fabricate The Telemetry Before The Traffic Exists

You cannot validate a dashboard or an alert with zero traffic, so manufacture the traffic. Dashboards are a testable PromQL corpus, and synthetic telemetry is the integration test for your observability stack.

James Phoenix
James Phoenix