Building the Harness Around Claude Code

James Phoenix
James Phoenix

Claude Code is a harness around an LLM. Your job is to build a harness around Claude Code.

Udemy Bestseller

Learn Prompt Engineering

My O'Reilly book adapted for hands-on learning. Build production-ready prompts with practical exercises.

4.5/5 rating
306,000+ learners
View Course

The Mental Model

Treat this as a signal processing problem.

The LLM is a noisy channel. Each layer of harness:

  • Increases signal-to-noise ratio
  • Constrains the solution space
  • Provides feedback loops for self-correction

Building the Harness Around Claude Code


The Three Layers

┌─────────────────────────────────────────────────────────────┐
│                    Meta Engineering                         │
│  - Claude Agent Scripts for specific workflows              │
│  - Tests for tests                                          │
│  - Tests for telemetry                                      │
│  - Bespoke infrastructure that speeds up features/tests     │
│  - Agent Swarm Tactics                                      │
│  ┌─────────────────────────────────────────────────────┐   │
│  │         Cultural - Repository Engineering            │   │
│  │  - Metrics/logs/traces (OTEL + Jaeger)              │   │
│  │  - Testing infrastructure                            │   │
│  │  - Prod/Dev parity: Dockerised setup                │   │
│  │  - Code/package structure (DDD)                      │   │
│  │  ┌─────────────────────────────────────────────┐    │   │
│  │  │        Coding Agent (Claude Code)            │    │   │
│  │  │  ┌────────────┬─────┬──────────────┐        │    │   │
│  │  │  │claude.md   │ LLM │ claude hooks │        │    │   │
│  │  │  │setup       │     │              │        │    │   │
│  │  │  └────────────┴─────┴──────────────┘        │    │   │
│  │  └─────────────────────────────────────────────┘    │   │
│  └─────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────┘

Layer 1: Coding Agent (Claude Code)

The innermost layer. Claude Code is already a harness around the raw LLM.

Component Purpose
claude.md setup Onboard the agent with WHAT/WHY/HOW
LLM The raw model (noisy channel)
claude hooks Pre/post processing, linting, formatting

Your job here: Configure claude.md and hooks to maximise signal.

See: Writing a Good CLAUDE.md


Layer 2: Cultural – Repository Engineering

The environment the agent operates in. Better environment = better signal.

Observability Stack

# Metrics, logs, traces with OTEL + Jaeger
services:
  jaeger:
    image: jaegertracing/all-in-one:latest
    ports:
      - "16686:16686"  # UI
      - "4317:4317"    # OTLP gRPC

  otel-collector:
    image: otel/opentelemetry-collector:latest
    volumes:
      - ./otel-config.yaml:/etc/otel/config.yaml

Testing Infrastructure

# Tests that give clear signal
scripts/
├── test.sh          # Context-efficient test runner
├── lint.sh          # Auto-fixing linters
└── typecheck.sh     # Type verification

See: Context-Efficient Backpressure

Prod/Dev Parity

# Dockerised setup for maximum parity
FROM node:20-slim AS base
WORKDIR /app

# Same environment everywhere
COPY package.json bun.lockb ./
RUN bun install --frozen-lockfile

Code Structure (DDD)

src/
├── domain/           # Business logic (pure)
│   ├── entities/
│   └── value-objects/
├── application/      # Use cases
│   └── services/
├── infrastructure/   # External concerns
│   ├── database/
│   └── api/
└── presentation/     # UI/API layer

DDD gives LLMs clear boundaries. Each layer has explicit responsibilities.


Layer 3: Meta Engineering

The outer layer. Engineering the engineering process itself.

Claude Agent Scripts

# Workflow-specific agent invocations
.claude/
├── commands/
│   ├── implement-feature.md   # "Given this spec, implement..."
│   ├── fix-failing-test.md    # "This test is failing..."
│   ├── review-pr.md           # "Review this PR for..."
│   └── refactor-module.md     # "Refactor this to..."
└── hooks/
    ├── pre-commit.sh          # Lint + format before commit
    └── post-edit.sh           # Run tests after edits

Tests for Tests

// Meta-testing: verify your test infrastructure works
describe("test infrastructure", () => {
  it("run_silent captures failures correctly", async () => {
    const result = await runSilent("failing test", "exit 1");
    expect(result.success).toBe(false);
    expect(result.output).toContain("exit code");
  });

  it("backpressure compresses passing tests", async () => {
    const result = await runSilent("passing test", "exit 0");
    expect(result.output).toBe(""); // No noise on success
  });
});

Tests for Telemetry

// Verify observability works before you need it
describe("telemetry", () => {
  it("traces propagate through services", async () => {
    const span = tracer.startSpan("test-span");
    const traceId = span.spanContext().traceId;

    await callDownstreamService({ traceId });

    const traces = await jaeger.getTraces(traceId);
    expect(traces.spans.length).toBeGreaterThan(1);
  });
});

Agent Swarm Tactics

// Parallel agent execution for large tasks
async function swarmImplementation(spec: Spec) {
  const tasks = breakdownSpec(spec);

  // Launch agents in parallel
  const results = await Promise.all(
    tasks.map((task) =>
      spawnAgent({
        prompt: task.prompt,
        scope: task.files,
        constraints: task.constraints,
      })
    )
  );

  // Merge and resolve conflicts
  return mergeResults(results);
}

Layer 4: Closed-Loop Optimization

The outermost layer. Systems that optimize themselves.

Telemetry as Control Input

Instead of passive monitoring, use telemetry as active feedback:

Service under load
        ↓
Telemetry captured (memory, latency, errors)
        ↓
Constraints evaluated
        ↓
Violations detected?
        ↓
Agent proposes fix → Apply → Re-test
        ↓
Loop until constraints satisfied

Constraint-Driven Optimization

# constraints.yaml
performance:
  memory_max_mb: 300
  p99_latency_ms: 100
  heap_growth_slope: 0  # No leaks

triggers:
  on_violation: spawn_optimizer_agent
  max_iterations: 5
  escalate_to_human: true

Self-Healing Pipeline

async function optimizationLoop(service: Service) {
  while (true) {
    const metrics = await captureMetrics(service);
    const violations = evaluateConstraints(metrics, constraints);

    if (violations.length === 0) {
      return { status: 'healthy' };
    }

    const fix = await optimizerAgent.propose(violations);
    await applyFix(fix);
    await runTests();
  }
}

This is control theory applied to software. The system continuously measures, evaluates, and corrects itself.

See: Closed-Loop Telemetry-Driven Optimization


The Four Layers Summary

Layer Focus Signal Contribution
Layer 1: Claude Code Agent configuration Prompt quality
Layer 2: Repository Environment setup Environmental clarity
Layer 3: Meta Engineering Process automation Workflow efficiency
Layer 4: Closed-Loop Self-optimization Continuous improvement

Signal Processing View

Input (your intent)
    │
    ▼
┌─────────────────────┐
│  Meta Engineering   │  ← Shapes the problem space
│  (Scripts, Swarms)  │
└─────────────────────┘
    │
    ▼
┌─────────────────────┐
│  Repository Layer   │  ← Provides environmental signal
│  (OTEL, Tests, DDD) │
└─────────────────────┘
    │
    ▼
┌─────────────────────┐
│  Claude Code        │  ← Harness around LLM
│  (claude.md, hooks) │
└─────────────────────┘
    │
    ▼
┌─────────────────────┐
│       LLM           │  ← Noisy channel
└─────────────────────┘
    │
    ▼
Output (code, decisions)

Each layer amplifies signal and attenuates noise.


Key Insight

The LLM is the least controllable part of the system. Everything else is engineering.

Build the harness. Control what you can control. Let the LLM do what it’s good at within a well-constrained environment.


Long-Running Agent Harnesses

Source: Anthropic Engineering

For agents that span multiple context windows, the harness needs additional infrastructure.

The Core Problem

“Each new session begins with no memory of what came before.”

This mirrors engineers working in shifts without handoff documentation—gaps in continuity cause repeated work and lost progress.

Two-Agent Architecture

Split long-running work into specialized agents:

Agent Purpose
Initializer Agent Establishes foundation: init.sh, progress files, feature lists, initial commits
Coding Agent Works within constraints: one feature per session, reads progress, leaves clean state

Session Initialization Sequence

Each coding session follows a structured startup:

1. Check working directory
2. Review progress files (claude-progress.txt)
3. Examine git history
4. Run basic health checks
5. Select next feature

Feature Lists: JSON Over Markdown

Use JSON format for feature tracking—it resists inadvertent modifications better than markdown.

{
  "features": [
    {
      "id": "auth-001",
      "name": "User login flow",
      "status": "passing",
      "tests": ["login.spec.ts"]
    },
    {
      "id": "auth-002",
      "name": "Password reset",
      "status": "failing",
      "blockers": ["Email service not configured"]
    }
  ]
}

Why JSON: Agents are less likely to declare premature victory when status is explicitly tracked in structured data.

Browser Automation for Verification

Unit tests alone are insufficient. Add E2E verification with browser automation:

// Puppeteer MCP for realistic user-flow testing
import puppeteer from 'puppeteer';

async function verifyFeature(feature: Feature) {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();

  // Test actual user flow, not just API responses
  await page.goto('/login');
  await page.fill('#email', '[email protected]');
  await page.fill('#password', 'password');
  await page.click('button[type="submit"]');

  // Verify real state change
  await expect(page.locator('.dashboard')).toBeVisible();
}

Common Failure Modes

Problem Root Cause Solution
Premature victory declarations Incomplete tracking Structured JSON feature lists
Buggy handoffs between sessions Undocumented progress Git commits + progress files at session start
Incomplete feature marking Insufficient verification Browser automation for E2E testing
Environment setup delays Missing scripts Pre-written init.sh procedures

Progress File Pattern

# claude-progress.txt

## Session 2024-01-15 14:30

### Completed
- [x] auth-001: User login flow
- [x] auth-003: Session persistence

### In Progress
- [ ] auth-002: Password reset (blocked: email service)

### Learnings
- Rate limiting middleware must be added before auth routes
- Test user cleanup required after each E2E test

### Next Session Should
1. Configure email service mock
2. Complete password reset feature
3. Add rate limiting tests

The Compound Effect at Scale

“90% of traditional programming is becoming commoditised. The other 10%? It’s now worth 1000x more.” — Kent Beck

Real-world productivity data: At Every, 2 engineers produce output equivalent to a 15-person team using compound engineering practices.

Voice-to-Feature Pipeline

Advanced compound engineering workflows:

Voice input (feature idea)
        ↓
Research agents (analyze codebase + best practices)
        ↓
Planning agents (generate detailed GitHub issues)
        ↓
Execution agents (parallel terminals)
        ↓
Human review (architecture, not syntax)

The shift: System thinking and orchestration capability now outweigh raw coding syntax knowledge.


Related


Business Context

Topics
Agent Swarm TacticsClaudeLlm HarnessingSignal ProcessingTelemetry Testing

More Insights

Cover Image for Thought Leaders

Thought Leaders

People to follow for compound engineering, context engineering, and AI agent development.

James Phoenix
James Phoenix
Cover Image for Systems Thinking & Observability

Systems Thinking & Observability

Software should be treated as a measurable dynamical system, not as a collection of features.

James Phoenix
James Phoenix