CI/CD Patterns for AI Agents: GitHub Actions for Agent Verification

James Phoenix

Summary

CI/CD for AI agents requires different patterns than traditional software. Agents are non-deterministic, expensive to run, and require behavioral verification beyond unit tests. This article covers GitHub Actions patterns for running nightly RALPH loops, implementing verification gates before deployment, caching agent responses for speed, managing costs in CI, and building quality gates that catch regressions before they reach production.

The Problem

Traditional CI/CD assumes deterministic tests with fast feedback. AI agents break these assumptions.

Non-Determinism

The same prompt can produce different outputs:

# Traditional test: deterministic
test:
  runs-on: ubuntu-latest
  steps:
    - run: npm test  # Always same result

# Agent test: non-deterministic
test-agent:
  runs-on: ubuntu-latest
  steps:
    - run: bun test agents/  # Different output each run!

A test that passed yesterday might fail today because the model produced a different (but valid) response.

Cost Constraints

Running full agent flows is expensive:

Single agent task:
- Input tokens: ~5,000
- Output tokens: ~2,000
- Cost: $0.015-0.05

Full test suite (50 scenarios):
- Cost per run: $0.75-2.50
- Runs per day (on every PR): 20+
- Daily cost: $15-50
- Monthly: $450-1,500

Running agents on every PR burns budget without proportional value.

Behavioral Verification

Unit tests catch structural errors. Agents need behavioral verification:

// Unit test: does it compile?
test('parser handles JSON', () => {
  const result = parseJSON('{"key": "value"}');
  expect(result.key).toBe('value');
});

// Behavioral test: does the agent accomplish the task?
test('agent creates valid PR', async () => {
  const result = await codeReviewAgent.run({
    diff: sampleDiff,
    context: codebaseContext,
  });
  // How do you verify "good feedback"?
  expect(result.comments.length).toBeGreaterThan(0);  // Insufficient
  expect(result.approves).toBeDefined();  // Also insufficient
});

You need verification that the agent behavior meets quality standards, not just that it ran.

Regression Detection

Prompt changes can silently degrade performance:

- You are a helpful code reviewer. Provide specific, actionable feedback.
+ You are a thorough code reviewer. Analyze the code carefully.

This change might reduce feedback quality by 40% with no test failures.

The Solution

Build CI/CD pipelines that account for agent-specific challenges.

Core Architecture

┌─────────────────────────────────────────────────────────────────┐
│                    Agent CI/CD Pipeline                          │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  ┌─────────────────┐    ┌──────────────────┐                    │
│  │   PR Triggers   │    │ Scheduled Runs   │                    │
│  │ (lightweight)   │    │ (comprehensive)  │                    │
│  └────────┬────────┘    └────────┬─────────┘                    │
│           │                      │                               │
│           ▼                      ▼                               │
│  ┌─────────────────────────────────────────┐                    │
│  │            Gate Selection                │                    │
│  │   PR: Fast gates only (cached, cheap)   │                    │
│  │   Nightly: Full verification suite      │                    │
│  └─────────────────────────────────────────┘                    │
│           │                      │                               │
│           ▼                      ▼                               │
│  ┌───────────────┐      ┌───────────────────┐                   │
│  │  Type Checks  │      │  Full Agent Run   │                   │
│  │  Lint/Format  │      │  RALPH Loop       │                   │
│  │ Cached Tests  │      │  Behavioral Tests │                   │
│  └───────────────┘      └───────────────────┘                   │
│           │                      │                               │
│           ▼                      ▼                               │
│  ┌─────────────────────────────────────────┐                    │
│  │          Results & Metrics               │                   │
│  │   Cost tracking, quality scores, diffs  │                    │
│  └─────────────────────────────────────────┘                    │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Implementation Patterns

Pattern 1: Tiered Verification

Run different gates based on trigger type:

# .github/workflows/agent-ci.yml
name: Agent CI/CD

on:
  pull_request:
    branches: [main]
  push:
    branches: [main]
  schedule:
    # Nightly at 2 AM UTC
    - cron: '0 2 * * *'
  workflow_dispatch:
    inputs:
      run_full_suite:
        description: 'Run full agent verification'
        type: boolean
        default: false

jobs:
  # Always run: fast, cheap, deterministic
  static-analysis:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Setup
        uses: actions/setup-node@v4
        with:
          node-version: '20'
          cache: 'npm'

      - run: npm ci

      - name: Type Check
        run: npx tsc --noEmit

      - name: Lint Prompts
        run: |
          # Check for common prompt anti-patterns
          ! grep -r "delve\|crucial\|leverage" prompts/ || exit 1

      - name: Validate Schemas
        run: npx zod-to-json-schema --validate

  # PR: cached agent tests only
  cached-tests:
    runs-on: ubuntu-latest
    if: github.event_name == 'pull_request'
    needs: static-analysis
    steps:
      - uses: actions/checkout@v4

      - name: Restore Response Cache
        uses: actions/cache@v4
        with:
          path: .agent-cache/
          key: agent-responses-${{ hashFiles('prompts/**') }}
          restore-keys: |
            agent-responses-

      - name: Run Cached Agent Tests
        run: |
          # Only runs tests that have cached responses
          bun test --cached-only agents/
        env:
          AGENT_CACHE_DIR: .agent-cache

  # Nightly/manual: full verification
  full-verification:
    runs-on: ubuntu-latest
    if: github.event_name == 'schedule' || github.event.inputs.run_full_suite == 'true'
    needs: static-analysis
    steps:
      - uses: actions/checkout@v4

      - name: Setup
        uses: oven-sh/setup-bun@v1

      - name: Run Full Agent Suite
        run: bun test agents/
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
          AGENT_COST_LIMIT: '5.00'  # $5 max per run
        timeout-minutes: 60

      - name: Cache New Responses
        uses: actions/cache/save@v4
        with:
          path: .agent-cache/
          key: agent-responses-${{ hashFiles('prompts/**') }}-${{ github.run_id }}

Pattern 2: Nightly RALPH Loop

Run continuous development overnight:

# .github/workflows/nightly-ralph.yml
name: Nightly RALPH Loop

on:
  schedule:
    - cron: '0 22 * * *'  # 10 PM UTC
  workflow_dispatch:
    inputs:
      max_hours:
        description: 'Maximum runtime in hours'
        type: number
        default: 8
      focus_area:
        description: 'Focus area (e.g., chapter, feature)'
        type: string
        default: ''

jobs:
  ralph-loop:
    runs-on: ubuntu-latest
    timeout-minutes: 540  # 9 hours max

    steps:
      - uses: actions/checkout@v4
        with:
          token: ${{ secrets.GITHUB_TOKEN }}
          fetch-depth: 0

      - name: Configure Git
        run: |
          git config user.name "RALPH Bot"
          git config user.email "[email protected]"

      - name: Setup Environment
        uses: oven-sh/setup-bun@v1

      - run: bun install

      - name: Run RALPH Loop
        run: |
          ./scripts/ralph.sh \
            --max-hours ${{ github.event.inputs.max_hours || 8 }} \
            --review-every 6
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
          SLACK_WEBHOOK: ${{ secrets.SLACK_WEBHOOK }}

      - name: Push Changes
        run: |
          git push origin HEAD
        if: success()

      - name: Create Summary PR
        if: success()
        uses: peter-evans/create-pull-request@v5
        with:
          title: "[RALPH] Nightly progress - ${{ github.run_number }}"
          body: |
            ## Nightly RALPH Run Summary

            **Duration**: ${{ github.event.inputs.max_hours || 8 }} hours
            **Focus**: ${{ github.event.inputs.focus_area || 'General' }}

            See workflow logs for details.
          branch: ralph/nightly-${{ github.run_number }}
          base: main

      - name: Notify on Failure
        if: failure()
        run: |
          curl -X POST $SLACK_WEBHOOK \
            -H 'Content-Type: application/json' \
            -d '{"text": "RALPH nightly run failed. Check: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}"}'
        env:
          SLACK_WEBHOOK: ${{ secrets.SLACK_WEBHOOK }}

Pattern 3: Response Caching for Speed

Cache agent responses to enable fast PR checks:

// lib/agent-cache.ts
import { createHash } from 'crypto';
import { readFile, writeFile, mkdir } from 'fs/promises';
import { existsSync } from 'fs';

interface CachedResponse {
  prompt: string;
  promptHash: string;
  response: string;
  model: string;
  timestamp: string;
  tokens: { input: number; output: number };
}

const CACHE_DIR = process.env.AGENT_CACHE_DIR || '.agent-cache';

function hashPrompt(prompt: string, systemPrompt?: string): string {
  const content = `${systemPrompt || ''}|||${prompt}`;
  return createHash('sha256').update(content).digest('hex').slice(0, 16);
}

export async function getCachedResponse(
  prompt: string,
  systemPrompt?: string
): Promise<string | null> {
  const hash = hashPrompt(prompt, systemPrompt);
  const cachePath = `${CACHE_DIR}/${hash}.json`;

  if (!existsSync(cachePath)) {
    return null;
  }

  try {
    const cached: CachedResponse = JSON.parse(await readFile(cachePath, 'utf-8'));
    return cached.response;
  } catch {
    return null;
  }
}

export async function cacheResponse(
  prompt: string,
  response: string,
  systemPrompt?: string,
  metadata?: { model: string; tokens: { input: number; output: number } }
): Promise<void> {
  const hash = hashPrompt(prompt, systemPrompt);

  if (!existsSync(CACHE_DIR)) {
    await mkdir(CACHE_DIR, { recursive: true });
  }

  const cached: CachedResponse = {
    prompt,
    promptHash: hash,
    response,
    model: metadata?.model || 'unknown',
    timestamp: new Date().toISOString(),
    tokens: metadata?.tokens || { input: 0, output: 0 },
  };

  await writeFile(`${CACHE_DIR}/${hash}.json`, JSON.stringify(cached, null, 2));
}

// Integration with agent runner
export async function runWithCache(
  prompt: string,
  systemPrompt: string,
  runner: (prompt: string, system: string) => Promise<string>
): Promise<{ response: string; cached: boolean }> {
  // Check cache first
  const cached = await getCachedResponse(prompt, systemPrompt);
  if (cached) {
    return { response: cached, cached: true };
  }

  // Run agent
  const response = await runner(prompt, systemPrompt);

  // Cache for future runs
  await cacheResponse(prompt, response, systemPrompt);

  return { response, cached: false };
}

Pattern 4: Cost Budget Enforcement

Prevent runaway costs in CI:

Udemy Bestseller

Learn Prompt Engineering

My O'Reilly book adapted for hands-on learning. Build production-ready prompts with practical exercises.

★ 4.5/5 rating

306,000+ learners

View Course

# .github/workflows/cost-controlled-tests.yml
name: Cost-Controlled Agent Tests

on:
  pull_request:
    paths:
      - 'prompts/**'
      - 'agents/**'

jobs:
  agent-tests:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Calculate Cost Budget
        id: budget
        run: |
          # PRs get $2, main pushes get $10, nightly gets $50
          if [ "${{ github.event_name }}" == "pull_request" ]; then
            echo "budget=2.00" >> $GITHUB_OUTPUT
          elif [ "${{ github.event_name }}" == "schedule" ]; then
            echo "budget=50.00" >> $GITHUB_OUTPUT
          else
            echo "budget=10.00" >> $GITHUB_OUTPUT
          fi

      - name: Run Agent Tests with Budget
        run: |
          bun run test:agents --budget ${{ steps.budget.outputs.budget }}
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}

      - name: Report Costs
        if: always()
        run: |
          # Read cost log
          if [ -f .agent-costs.json ]; then
            TOTAL=$(jq '.total' .agent-costs.json)
            echo "### Agent Test Costs" >> $GITHUB_STEP_SUMMARY
            echo "**Total**: \$$TOTAL" >> $GITHUB_STEP_SUMMARY
            echo "**Budget**: \$${{ steps.budget.outputs.budget }}" >> $GITHUB_STEP_SUMMARY
            jq -r '.breakdown[] | "- \(.name): $\(.cost)"' .agent-costs.json >> $GITHUB_STEP_SUMMARY
          fi

// lib/cost-tracker.ts
interface CostEntry {
  name: string;
  model: string;
  inputTokens: number;
  outputTokens: number;
  cost: number;
}

class CostTracker {
  private entries: CostEntry[] = [];
  private budget: number;

  // Pricing per 1M tokens (approximate)
  private pricing: Record<string, { input: number; output: number }> = {
    'claude-3-haiku': { input: 0.25, output: 1.25 },
    'claude-sonnet-4-5-20250929': { input: 3.00, output: 15.00 },
    'claude-opus-4-5-20251101': { input: 15.00, output: 75.00 },
  };

  constructor(budget: number) {
    this.budget = budget;
  }

  track(name: string, model: string, inputTokens: number, outputTokens: number): void {
    const pricing = this.pricing[model] || this.pricing['claude-sonnet-4-5-20250929'];
    const cost = (inputTokens * pricing.input + outputTokens * pricing.output) / 1_000_000;

    this.entries.push({ name, model, inputTokens, outputTokens, cost });

    const total = this.getTotal();
    if (total > this.budget) {
      throw new Error(
        `Cost budget exceeded: $${total.toFixed(2)} > $${this.budget.toFixed(2)}. ` +
        `Stopping to prevent overspend.`
      );
    }
  }

  getTotal(): number {
    return this.entries.reduce((sum, e) => sum + e.cost, 0);
  }

  getReport(): { total: number; breakdown: CostEntry[] } {
    return {
      total: this.getTotal(),
      breakdown: this.entries,
    };
  }
}

// Global tracker
export const costTracker = new CostTracker(
  parseFloat(process.env.AGENT_COST_LIMIT || '10.00')
);

Pattern 5: Behavioral Regression Tests

Detect quality degradation in agent outputs:

// tests/agent-regression.test.ts
import { describe, test, expect } from 'bun:test';
import { codeReviewAgent } from '../agents/code-review';
import { loadGoldenSet } from './fixtures/golden-set';

describe('Code Review Agent Regression', () => {
  const goldenSet = loadGoldenSet('code-review');

  for (const golden of goldenSet) {
    test(`${golden.name}: maintains quality baseline`, async () => {
      const result = await codeReviewAgent.run({
        diff: golden.input.diff,
        context: golden.input.context,
      });

      // Structural checks (fast, deterministic)
      expect(result.comments).toBeDefined();
      expect(result.comments.length).toBeGreaterThan(0);

      // Quality metrics (may vary but should meet thresholds)
      const metrics = evaluateReview(result, golden.expectedIssues);

      // Precision: of the issues found, how many are real?
      expect(metrics.precision).toBeGreaterThan(0.7);

      // Recall: of the known issues, how many were found?
      expect(metrics.recall).toBeGreaterThan(0.6);

      // No hallucinated file paths
      expect(metrics.validPaths).toBe(true);

      // Comments reference actual line numbers
      expect(metrics.validLineNumbers).toBe(true);
    });
  }
});

interface QualityMetrics {
  precision: number;
  recall: number;
  validPaths: boolean;
  validLineNumbers: boolean;
}

function evaluateReview(
  result: { comments: Array<{ file: string; line: number; issue: string }> },
  expectedIssues: Array<{ file: string; line: number; type: string }>
): QualityMetrics {
  let truePositives = 0;
  let falsePositives = 0;

  for (const comment of result.comments) {
    const matchesExpected = expectedIssues.some(
      exp => exp.file === comment.file &&
             Math.abs(exp.line - comment.line) <= 3  // Within 3 lines
    );

    if (matchesExpected) {
      truePositives++;
    } else {
      falsePositives++;
    }
  }

  const precision = result.comments.length > 0
    ? truePositives / result.comments.length
    : 0;
  const recall = expectedIssues.length > 0
    ? truePositives / expectedIssues.length
    : 1;

  // Validate paths exist in the diff
  const validPaths = result.comments.every(c =>
    // Check path is mentioned in diff
    true  // Simplified - implement actual check
  );

  const validLineNumbers = result.comments.every(c =>
    c.line > 0 && c.line < 10000
  );

  return { precision, recall, validPaths, validLineNumbers };
}

Pattern 6: Multi-Model Testing

Test agents against multiple models:

# .github/workflows/multi-model-test.yml
name: Multi-Model Agent Tests

on:
  schedule:
    - cron: '0 4 * * 0'  # Weekly on Sunday
  workflow_dispatch:

jobs:
  test-matrix:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        model:
          - claude-3-haiku-20240307
          - claude-sonnet-4-5-20250929
          - claude-opus-4-5-20251101
      fail-fast: false

    steps:
      - uses: actions/checkout@v4

      - name: Setup
        uses: oven-sh/setup-bun@v1

      - name: Run Tests with ${{ matrix.model }}
        run: |
          bun test agents/ --model ${{ matrix.model }}
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
          TEST_MODEL: ${{ matrix.model }}
        continue-on-error: true

      - name: Upload Results
        uses: actions/upload-artifact@v4
        with:
          name: results-${{ matrix.model }}
          path: test-results/

  compare-results:
    needs: test-matrix
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Download All Results
        uses: actions/download-artifact@v4
        with:
          path: all-results/

      - name: Generate Comparison Report
        run: |
          bun scripts/compare-model-results.ts all-results/

      - name: Post Summary
        run: |
          cat model-comparison.md >> $GITHUB_STEP_SUMMARY

Pattern 7: Prompt Regression Detection

Detect when prompt changes degrade quality:

# .github/workflows/prompt-regression.yml
name: Prompt Regression Check

on:
  pull_request:
    paths:
      - 'prompts/**'
      - 'agents/**/system-prompt.md'

jobs:
  check-regression:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0

      - name: Get Changed Prompts
        id: changed
        run: |
          CHANGED=$(git diff --name-only ${{ github.event.pull_request.base.sha }} -- 'prompts/**' 'agents/**/system-prompt.md')
          echo "files=$CHANGED" >> $GITHUB_OUTPUT

      - name: Run Baseline (main branch)
        run: |
          git checkout ${{ github.event.pull_request.base.sha }}
          bun test:prompts --output baseline-results.json
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}

      - name: Run Current (PR branch)
        run: |
          git checkout ${{ github.sha }}
          bun test:prompts --output current-results.json
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}

      - name: Compare Results
        id: compare
        run: |
          bun scripts/compare-prompt-results.ts \
            baseline-results.json \
            current-results.json \
            --threshold 0.1  # 10% degradation threshold

      - name: Comment on PR
        if: steps.compare.outputs.degradation == 'true'
        uses: actions/github-script@v7
        with:
          script: |
            const fs = require('fs');
            const report = fs.readFileSync('regression-report.md', 'utf8');
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: report
            });

Advanced Patterns

Canary Deployments for Agents

Roll out agent changes gradually:

// lib/canary-router.ts
interface CanaryConfig {
  newVersion: string;
  oldVersion: string;
  percentage: number;  // 0-100
  metrics: {
    errorThreshold: number;
    latencyThreshold: number;
    qualityThreshold: number;
  };
}

class CanaryRouter {
  private config: CanaryConfig;
  private metrics = {
    new: { requests: 0, errors: 0, totalLatency: 0, qualitySum: 0 },
    old: { requests: 0, errors: 0, totalLatency: 0, qualitySum: 0 },
  };

  constructor(config: CanaryConfig) {
    this.config = config;
  }

  selectVersion(): 'new' | 'old' {
    return Math.random() * 100 < this.config.percentage ? 'new' : 'old';
  }

  recordResult(version: 'new' | 'old', result: {
    success: boolean;
    latencyMs: number;
    qualityScore: number;
  }): void {
    const m = this.metrics[version];
    m.requests++;
    if (!result.success) m.errors++;
    m.totalLatency += result.latencyMs;
    m.qualitySum += result.qualityScore;

    // Check if canary should be aborted
    if (version === 'new' && this.shouldAbort()) {
      this.rollback();
    }
  }

  private shouldAbort(): boolean {
    const m = this.metrics.new;
    if (m.requests < 10) return false;  // Need minimum sample

    const errorRate = m.errors / m.requests;
    const avgLatency = m.totalLatency / m.requests;
    const avgQuality = m.qualitySum / m.requests;

    return (
      errorRate > this.config.metrics.errorThreshold ||
      avgLatency > this.config.metrics.latencyThreshold ||
      avgQuality < this.config.metrics.qualityThreshold
    );
  }

  private rollback(): void {
    console.error('Canary aborted: metrics exceeded thresholds');
    this.config.percentage = 0;
    // Notify on-call
  }
}

Coverage Metrics for Agents

Track behavioral coverage:

// lib/agent-coverage.ts
interface BehaviorCoverage {
  scenarios: Map<string, {
    tested: boolean;
    lastTested: Date;
    passRate: number;
  }>;
  tools: Map<string, {
    called: boolean;
    callCount: number;
  }>;
  errorPaths: Map<string, boolean>;
}

const coverage: BehaviorCoverage = {
  scenarios: new Map(),
  tools: new Map(),
  errorPaths: new Map(),
};

// Define expected scenarios
const EXPECTED_SCENARIOS = [
  'happy-path-simple-request',
  'multi-turn-conversation',
  'error-recovery',
  'rate-limit-handling',
  'invalid-input-rejection',
  'tool-call-validation-failure',
  'context-overflow-handling',
];

// Track during tests
export function trackScenario(name: string, passed: boolean): void {
  const existing = coverage.scenarios.get(name) || {
    tested: false,
    lastTested: new Date(0),
    passRate: 0,
  };

  coverage.scenarios.set(name, {
    tested: true,
    lastTested: new Date(),
    passRate: (existing.passRate * 0.9) + (passed ? 0.1 : 0),  // EMA
  });
}

export function trackToolCall(toolName: string): void {
  const existing = coverage.tools.get(toolName) || { called: false, callCount: 0 };
  coverage.tools.set(toolName, {
    called: true,
    callCount: existing.callCount + 1,
  });
}

export function getCoverageReport(): {
  scenarioCoverage: number;
  toolCoverage: number;
  untestedScenarios: string[];
  unusedTools: string[];
} {
  const testedScenarios = [...coverage.scenarios.values()].filter(s => s.tested).length;
  const scenarioCoverage = testedScenarios / EXPECTED_SCENARIOS.length;

  const expectedTools = ['read_file', 'write_file', 'search', 'execute_command'];
  const calledTools = [...coverage.tools.entries()].filter(([_, v]) => v.called).length;
  const toolCoverage = calledTools / expectedTools.length;

  return {
    scenarioCoverage,
    toolCoverage,
    untestedScenarios: EXPECTED_SCENARIOS.filter(s => !coverage.scenarios.get(s)?.tested),
    unusedTools: expectedTools.filter(t => !coverage.tools.get(t)?.called),
  };
}

A/B Testing Prompts in CI

Compare prompt variants:

# .github/workflows/prompt-ab-test.yml
name: Prompt A/B Test

on:
  workflow_dispatch:
    inputs:
      prompt_a:
        description: 'Path to prompt variant A'
        required: true
      prompt_b:
        description: 'Path to prompt variant B'
        required: true
      sample_size:
        description: 'Number of test cases per variant'
        default: '50'

jobs:
  ab-test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Run A/B Test
        run: |
          bun scripts/ab-test.ts \
            --prompt-a ${{ github.event.inputs.prompt_a }} \
            --prompt-b ${{ github.event.inputs.prompt_b }} \
            --sample-size ${{ github.event.inputs.sample_size }} \
            --output ab-results.json
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}

      - name: Analyze Results
        run: |
          bun scripts/analyze-ab.ts ab-results.json --format markdown > ab-report.md
          cat ab-report.md >> $GITHUB_STEP_SUMMARY

      - name: Upload Results
        uses: actions/upload-artifact@v4
        with:
          name: ab-test-results
          path: |
            ab-results.json
            ab-report.md

Common Pitfalls

Pitfall 1: Running Expensive Tests on Every PR

# Wrong: Full agent suite on every PR
on:
  pull_request:

jobs:
  test:
    steps:
      - run: bun test agents/  # $2-5 per run!

# Correct: Tiered approach
on:
  pull_request:

jobs:
  quick-tests:
    steps:
      - run: bun test agents/ --cached-only  # Free

  full-tests:
    if: contains(github.event.pull_request.labels.*.name, 'run-full-tests')
    steps:
      - run: bun test agents/

Pitfall 2: No Response Caching

// Wrong: Every test run hits the API
test('agent handles input', async () => {
  const result = await agent.run(input);  // API call
  expect(result).toBeDefined();
});

// Correct: Cache for fast iteration
test('agent handles input', async () => {
  const { response, cached } = await runWithCache(input, systemPrompt, agent.run);
  expect(response).toBeDefined();
  console.log(`Cache ${cached ? 'hit' : 'miss'}`);
});

Pitfall 3: Missing Timeouts

# Wrong: No timeout
jobs:
  agent-test:
    steps:
      - run: bun test agents/  # Could run forever

# Correct: Explicit timeouts at multiple levels
jobs:
  agent-test:
    timeout-minutes: 30
    steps:
      - run: bun test agents/ --timeout 60000  # 60s per test
        timeout-minutes: 25  # Step timeout

Pitfall 4: Not Tracking Costs

// Wrong: No cost visibility
const response = await anthropic.messages.create({ ... });

// Correct: Track every call
const response = await anthropic.messages.create({ ... });
costTracker.track(
  'code-review',
  response.model,
  response.usage.input_tokens,
  response.usage.output_tokens
);

Pitfall 5: Ignoring Non-Determinism

// Wrong: Exact match assertions
test('agent returns expected output', async () => {
  const result = await agent.run(input);
  expect(result.text).toBe('Expected exact response');  // Will flake!
});

// Correct: Structural and behavioral assertions
test('agent returns valid output', async () => {
  const result = await agent.run(input);
  expect(result.text.length).toBeGreaterThan(50);
  expect(result.text).toContain('code review');  // Key concept present
  expect(result.suggestions).toBeInstanceOf(Array);
});

Pitfall 6: No Baseline Comparisons

# Wrong: Only test current version
jobs:
  test:
    steps:
      - run: bun test agents/

# Correct: Compare against baseline
jobs:
  test:
    steps:
      - name: Run baseline
        run: |
          git checkout main
          bun test agents/ --output baseline.json
          git checkout -

      - name: Run current
        run: bun test agents/ --output current.json

      - name: Compare
        run: bun scripts/compare.ts baseline.json current.json

Benefits

1. Cost Control

Budget enforcement prevents surprise bills:

Without cost control:
- Nightly run goes infinite loop
- 10,000 API calls overnight
- $500 surprise bill

With cost control:
- Budget of $5 per run
- Stops at limit with warning
- Maximum predictable spend

2. Fast Feedback

Caching enables quick PR checks:

Without caching:
- Every PR runs full agent suite
- 10 minutes, $2 per PR
- 20 PRs/day = 200 minutes, $40

With caching:
- PR runs cached tests only
- 30 seconds, $0 per PR
- Nightly run refreshes cache
- Daily cost: $5 for nightly only

3. Regression Detection

Behavioral tests catch quality degradation:

Without regression tests:
- Prompt change deployed
- Quality drops 30%
- Users complain
- Days to detect and fix

With regression tests:
- Prompt change fails CI
- Quality drop detected immediately
- Fix before merge
- No user impact

4. Continuous Progress

Nightly RALPH keeps projects moving:

Without automation:
- Human runs RALPH manually
- Inconsistent progress
- Weekends = zero progress

With nightly runs:
- 8 hours of work every night
- 56 hours/week of agent time
- Consistent, predictable progress

The RALPH Loop – The iteration pattern that nightly CI automates
Quality Gates That Compound – Gates to enforce in CI
AI Cost Protection with Timeouts – Budget enforcement patterns
Agent Reliability Chasm – Why CI testing matters
Test-Driven Prompting – Defining success criteria for agents
Tool Call Validation – Validation to test in CI
Clean Slate Trajectory Recovery – What nightly runs reset