Summary
CI/CD for AI agents requires different patterns than traditional software. Agents are non-deterministic, expensive to run, and require behavioral verification beyond unit tests. This article covers GitHub Actions patterns for running nightly RALPH loops, implementing verification gates before deployment, caching agent responses for speed, managing costs in CI, and building quality gates that catch regressions before they reach production.
The Problem
Traditional CI/CD assumes deterministic tests with fast feedback. AI agents break these assumptions.
Non-Determinism
The same prompt can produce different outputs:
# Traditional test: deterministic
test:
runs-on: ubuntu-latest
steps:
- run: npm test # Always same result
# Agent test: non-deterministic
test-agent:
runs-on: ubuntu-latest
steps:
- run: bun test agents/ # Different output each run!
A test that passed yesterday might fail today because the model produced a different (but valid) response.
Cost Constraints
Running full agent flows is expensive:
Single agent task:
- Input tokens: ~5,000
- Output tokens: ~2,000
- Cost: $0.015-0.05
Full test suite (50 scenarios):
- Cost per run: $0.75-2.50
- Runs per day (on every PR): 20+
- Daily cost: $15-50
- Monthly: $450-1,500
Running agents on every PR burns budget without proportional value.
Behavioral Verification
Unit tests catch structural errors. Agents need behavioral verification:
// Unit test: does it compile?
test('parser handles JSON', () => {
const result = parseJSON('{"key": "value"}');
expect(result.key).toBe('value');
});
// Behavioral test: does the agent accomplish the task?
test('agent creates valid PR', async () => {
const result = await codeReviewAgent.run({
diff: sampleDiff,
context: codebaseContext,
});
// How do you verify "good feedback"?
expect(result.comments.length).toBeGreaterThan(0); // Insufficient
expect(result.approves).toBeDefined(); // Also insufficient
});
You need verification that the agent behavior meets quality standards, not just that it ran.
Regression Detection
Prompt changes can silently degrade performance:
- You are a helpful code reviewer. Provide specific, actionable feedback.
+ You are a thorough code reviewer. Analyze the code carefully.
This change might reduce feedback quality by 40% with no test failures.
The Solution
Build CI/CD pipelines that account for agent-specific challenges.
Core Architecture
┌─────────────────────────────────────────────────────────────────┐
│ Agent CI/CD Pipeline │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────┐ ┌──────────────────┐ │
│ │ PR Triggers │ │ Scheduled Runs │ │
│ │ (lightweight) │ │ (comprehensive) │ │
│ └────────┬────────┘ └────────┬─────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌─────────────────────────────────────────┐ │
│ │ Gate Selection │ │
│ │ PR: Fast gates only (cached, cheap) │ │
│ │ Nightly: Full verification suite │ │
│ └─────────────────────────────────────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌───────────────┐ ┌───────────────────┐ │
│ │ Type Checks │ │ Full Agent Run │ │
│ │ Lint/Format │ │ RALPH Loop │ │
│ │ Cached Tests │ │ Behavioral Tests │ │
│ └───────────────┘ └───────────────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌─────────────────────────────────────────┐ │
│ │ Results & Metrics │ │
│ │ Cost tracking, quality scores, diffs │ │
│ └─────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
Implementation Patterns
Pattern 1: Tiered Verification
Run different gates based on trigger type:
# .github/workflows/agent-ci.yml
name: Agent CI/CD
on:
pull_request:
branches: [main]
push:
branches: [main]
schedule:
# Nightly at 2 AM UTC
- cron: '0 2 * * *'
workflow_dispatch:
inputs:
run_full_suite:
description: 'Run full agent verification'
type: boolean
default: false
jobs:
# Always run: fast, cheap, deterministic
static-analysis:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Setup
uses: actions/setup-node@v4
with:
node-version: '20'
cache: 'npm'
- run: npm ci
- name: Type Check
run: npx tsc --noEmit
- name: Lint Prompts
run: |
# Check for common prompt anti-patterns
! grep -r "delve\|crucial\|leverage" prompts/ || exit 1
- name: Validate Schemas
run: npx zod-to-json-schema --validate
# PR: cached agent tests only
cached-tests:
runs-on: ubuntu-latest
if: github.event_name == 'pull_request'
needs: static-analysis
steps:
- uses: actions/checkout@v4
- name: Restore Response Cache
uses: actions/cache@v4
with:
path: .agent-cache/
key: agent-responses-${{ hashFiles('prompts/**') }}
restore-keys: |
agent-responses-
- name: Run Cached Agent Tests
run: |
# Only runs tests that have cached responses
bun test --cached-only agents/
env:
AGENT_CACHE_DIR: .agent-cache
# Nightly/manual: full verification
full-verification:
runs-on: ubuntu-latest
if: github.event_name == 'schedule' || github.event.inputs.run_full_suite == 'true'
needs: static-analysis
steps:
- uses: actions/checkout@v4
- name: Setup
uses: oven-sh/setup-bun@v1
- name: Run Full Agent Suite
run: bun test agents/
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
AGENT_COST_LIMIT: '5.00' # $5 max per run
timeout-minutes: 60
- name: Cache New Responses
uses: actions/cache/save@v4
with:
path: .agent-cache/
key: agent-responses-${{ hashFiles('prompts/**') }}-${{ github.run_id }}
Pattern 2: Nightly RALPH Loop
Run continuous development overnight:
# .github/workflows/nightly-ralph.yml
name: Nightly RALPH Loop
on:
schedule:
- cron: '0 22 * * *' # 10 PM UTC
workflow_dispatch:
inputs:
max_hours:
description: 'Maximum runtime in hours'
type: number
default: 8
focus_area:
description: 'Focus area (e.g., chapter, feature)'
type: string
default: ''
jobs:
ralph-loop:
runs-on: ubuntu-latest
timeout-minutes: 540 # 9 hours max
steps:
- uses: actions/checkout@v4
with:
token: ${{ secrets.GITHUB_TOKEN }}
fetch-depth: 0
- name: Configure Git
run: |
git config user.name "RALPH Bot"
git config user.email "[email protected]"
- name: Setup Environment
uses: oven-sh/setup-bun@v1
- run: bun install
- name: Run RALPH Loop
run: |
./scripts/ralph.sh \
--max-hours ${{ github.event.inputs.max_hours || 8 }} \
--review-every 6
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
SLACK_WEBHOOK: ${{ secrets.SLACK_WEBHOOK }}
- name: Push Changes
run: |
git push origin HEAD
if: success()
- name: Create Summary PR
if: success()
uses: peter-evans/create-pull-request@v5
with:
title: "[RALPH] Nightly progress - ${{ github.run_number }}"
body: |
## Nightly RALPH Run Summary
**Duration**: ${{ github.event.inputs.max_hours || 8 }} hours
**Focus**: ${{ github.event.inputs.focus_area || 'General' }}
See workflow logs for details.
branch: ralph/nightly-${{ github.run_number }}
base: main
- name: Notify on Failure
if: failure()
run: |
curl -X POST $SLACK_WEBHOOK \
-H 'Content-Type: application/json' \
-d '{"text": "RALPH nightly run failed. Check: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}"}'
env:
SLACK_WEBHOOK: ${{ secrets.SLACK_WEBHOOK }}
Pattern 3: Response Caching for Speed
Cache agent responses to enable fast PR checks:
// lib/agent-cache.ts
import { createHash } from 'crypto';
import { readFile, writeFile, mkdir } from 'fs/promises';
import { existsSync } from 'fs';
interface CachedResponse {
prompt: string;
promptHash: string;
response: string;
model: string;
timestamp: string;
tokens: { input: number; output: number };
}
const CACHE_DIR = process.env.AGENT_CACHE_DIR || '.agent-cache';
function hashPrompt(prompt: string, systemPrompt?: string): string {
const content = `${systemPrompt || ''}|||${prompt}`;
return createHash('sha256').update(content).digest('hex').slice(0, 16);
}
export async function getCachedResponse(
prompt: string,
systemPrompt?: string
): Promise<string | null> {
const hash = hashPrompt(prompt, systemPrompt);
const cachePath = `${CACHE_DIR}/${hash}.json`;
if (!existsSync(cachePath)) {
return null;
}
try {
const cached: CachedResponse = JSON.parse(await readFile(cachePath, 'utf-8'));
return cached.response;
} catch {
return null;
}
}
export async function cacheResponse(
prompt: string,
response: string,
systemPrompt?: string,
metadata?: { model: string; tokens: { input: number; output: number } }
): Promise<void> {
const hash = hashPrompt(prompt, systemPrompt);
if (!existsSync(CACHE_DIR)) {
await mkdir(CACHE_DIR, { recursive: true });
}
const cached: CachedResponse = {
prompt,
promptHash: hash,
response,
model: metadata?.model || 'unknown',
timestamp: new Date().toISOString(),
tokens: metadata?.tokens || { input: 0, output: 0 },
};
await writeFile(`${CACHE_DIR}/${hash}.json`, JSON.stringify(cached, null, 2));
}
// Integration with agent runner
export async function runWithCache(
prompt: string,
systemPrompt: string,
runner: (prompt: string, system: string) => Promise<string>
): Promise<{ response: string; cached: boolean }> {
// Check cache first
const cached = await getCachedResponse(prompt, systemPrompt);
if (cached) {
return { response: cached, cached: true };
}
// Run agent
const response = await runner(prompt, systemPrompt);
// Cache for future runs
await cacheResponse(prompt, response, systemPrompt);
return { response, cached: false };
}
Pattern 4: Cost Budget Enforcement
Prevent runaway costs in CI:
# .github/workflows/cost-controlled-tests.yml
name: Cost-Controlled Agent Tests
on:
pull_request:
paths:
- 'prompts/**'
- 'agents/**'
jobs:
agent-tests:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Calculate Cost Budget
id: budget
run: |
# PRs get $2, main pushes get $10, nightly gets $50
if [ "${{ github.event_name }}" == "pull_request" ]; then
echo "budget=2.00" >> $GITHUB_OUTPUT
elif [ "${{ github.event_name }}" == "schedule" ]; then
echo "budget=50.00" >> $GITHUB_OUTPUT
else
echo "budget=10.00" >> $GITHUB_OUTPUT
fi
- name: Run Agent Tests with Budget
run: |
bun run test:agents --budget ${{ steps.budget.outputs.budget }}
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
- name: Report Costs
if: always()
run: |
# Read cost log
if [ -f .agent-costs.json ]; then
TOTAL=$(jq '.total' .agent-costs.json)
echo "### Agent Test Costs" >> $GITHUB_STEP_SUMMARY
echo "**Total**: \$$TOTAL" >> $GITHUB_STEP_SUMMARY
echo "**Budget**: \$${{ steps.budget.outputs.budget }}" >> $GITHUB_STEP_SUMMARY
jq -r '.breakdown[] | "- \(.name): $\(.cost)"' .agent-costs.json >> $GITHUB_STEP_SUMMARY
fi
// lib/cost-tracker.ts
interface CostEntry {
name: string;
model: string;
inputTokens: number;
outputTokens: number;
cost: number;
}
class CostTracker {
private entries: CostEntry[] = [];
private budget: number;
// Pricing per 1M tokens (approximate)
private pricing: Record<string, { input: number; output: number }> = {
'claude-3-haiku': { input: 0.25, output: 1.25 },
'claude-sonnet-4-5-20250929': { input: 3.00, output: 15.00 },
'claude-opus-4-5-20251101': { input: 15.00, output: 75.00 },
};
constructor(budget: number) {
this.budget = budget;
}
track(name: string, model: string, inputTokens: number, outputTokens: number): void {
const pricing = this.pricing[model] || this.pricing['claude-sonnet-4-5-20250929'];
const cost = (inputTokens * pricing.input + outputTokens * pricing.output) / 1_000_000;
this.entries.push({ name, model, inputTokens, outputTokens, cost });
const total = this.getTotal();
if (total > this.budget) {
throw new Error(
`Cost budget exceeded: $${total.toFixed(2)} > $${this.budget.toFixed(2)}. ` +
`Stopping to prevent overspend.`
);
}
}
getTotal(): number {
return this.entries.reduce((sum, e) => sum + e.cost, 0);
}
getReport(): { total: number; breakdown: CostEntry[] } {
return {
total: this.getTotal(),
breakdown: this.entries,
};
}
}
// Global tracker
export const costTracker = new CostTracker(
parseFloat(process.env.AGENT_COST_LIMIT || '10.00')
);
Pattern 5: Behavioral Regression Tests
Detect quality degradation in agent outputs:
// tests/agent-regression.test.ts
import { describe, test, expect } from 'bun:test';
import { codeReviewAgent } from '../agents/code-review';
import { loadGoldenSet } from './fixtures/golden-set';
describe('Code Review Agent Regression', () => {
const goldenSet = loadGoldenSet('code-review');
for (const golden of goldenSet) {
test(`${golden.name}: maintains quality baseline`, async () => {
const result = await codeReviewAgent.run({
diff: golden.input.diff,
context: golden.input.context,
});
// Structural checks (fast, deterministic)
expect(result.comments).toBeDefined();
expect(result.comments.length).toBeGreaterThan(0);
// Quality metrics (may vary but should meet thresholds)
const metrics = evaluateReview(result, golden.expectedIssues);
// Precision: of the issues found, how many are real?
expect(metrics.precision).toBeGreaterThan(0.7);
// Recall: of the known issues, how many were found?
expect(metrics.recall).toBeGreaterThan(0.6);
// No hallucinated file paths
expect(metrics.validPaths).toBe(true);
// Comments reference actual line numbers
expect(metrics.validLineNumbers).toBe(true);
});
}
});
interface QualityMetrics {
precision: number;
recall: number;
validPaths: boolean;
validLineNumbers: boolean;
}
function evaluateReview(
result: { comments: Array<{ file: string; line: number; issue: string }> },
expectedIssues: Array<{ file: string; line: number; type: string }>
): QualityMetrics {
let truePositives = 0;
let falsePositives = 0;
for (const comment of result.comments) {
const matchesExpected = expectedIssues.some(
exp => exp.file === comment.file &&
Math.abs(exp.line - comment.line) <= 3 // Within 3 lines
);
if (matchesExpected) {
truePositives++;
} else {
falsePositives++;
}
}
const precision = result.comments.length > 0
? truePositives / result.comments.length
: 0;
const recall = expectedIssues.length > 0
? truePositives / expectedIssues.length
: 1;
// Validate paths exist in the diff
const validPaths = result.comments.every(c =>
// Check path is mentioned in diff
true // Simplified - implement actual check
);
const validLineNumbers = result.comments.every(c =>
c.line > 0 && c.line < 10000
);
return { precision, recall, validPaths, validLineNumbers };
}
Pattern 6: Multi-Model Testing
Test agents against multiple models:
# .github/workflows/multi-model-test.yml
name: Multi-Model Agent Tests
on:
schedule:
- cron: '0 4 * * 0' # Weekly on Sunday
workflow_dispatch:
jobs:
test-matrix:
runs-on: ubuntu-latest
strategy:
matrix:
model:
- claude-3-haiku-20240307
- claude-sonnet-4-5-20250929
- claude-opus-4-5-20251101
fail-fast: false
steps:
- uses: actions/checkout@v4
- name: Setup
uses: oven-sh/setup-bun@v1
- name: Run Tests with ${{ matrix.model }}
run: |
bun test agents/ --model ${{ matrix.model }}
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
TEST_MODEL: ${{ matrix.model }}
continue-on-error: true
- name: Upload Results
uses: actions/upload-artifact@v4
with:
name: results-${{ matrix.model }}
path: test-results/
compare-results:
needs: test-matrix
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Download All Results
uses: actions/download-artifact@v4
with:
path: all-results/
- name: Generate Comparison Report
run: |
bun scripts/compare-model-results.ts all-results/
- name: Post Summary
run: |
cat model-comparison.md >> $GITHUB_STEP_SUMMARY
Pattern 7: Prompt Regression Detection
Detect when prompt changes degrade quality:
# .github/workflows/prompt-regression.yml
name: Prompt Regression Check
on:
pull_request:
paths:
- 'prompts/**'
- 'agents/**/system-prompt.md'
jobs:
check-regression:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
- name: Get Changed Prompts
id: changed
run: |
CHANGED=$(git diff --name-only ${{ github.event.pull_request.base.sha }} -- 'prompts/**' 'agents/**/system-prompt.md')
echo "files=$CHANGED" >> $GITHUB_OUTPUT
- name: Run Baseline (main branch)
run: |
git checkout ${{ github.event.pull_request.base.sha }}
bun test:prompts --output baseline-results.json
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
- name: Run Current (PR branch)
run: |
git checkout ${{ github.sha }}
bun test:prompts --output current-results.json
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
- name: Compare Results
id: compare
run: |
bun scripts/compare-prompt-results.ts \
baseline-results.json \
current-results.json \
--threshold 0.1 # 10% degradation threshold
- name: Comment on PR
if: steps.compare.outputs.degradation == 'true'
uses: actions/github-script@v7
with:
script: |
const fs = require('fs');
const report = fs.readFileSync('regression-report.md', 'utf8');
github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body: report
});
Advanced Patterns
Canary Deployments for Agents
Roll out agent changes gradually:
// lib/canary-router.ts
interface CanaryConfig {
newVersion: string;
oldVersion: string;
percentage: number; // 0-100
metrics: {
errorThreshold: number;
latencyThreshold: number;
qualityThreshold: number;
};
}
class CanaryRouter {
private config: CanaryConfig;
private metrics = {
new: { requests: 0, errors: 0, totalLatency: 0, qualitySum: 0 },
old: { requests: 0, errors: 0, totalLatency: 0, qualitySum: 0 },
};
constructor(config: CanaryConfig) {
this.config = config;
}
selectVersion(): 'new' | 'old' {
return Math.random() * 100 < this.config.percentage ? 'new' : 'old';
}
recordResult(version: 'new' | 'old', result: {
success: boolean;
latencyMs: number;
qualityScore: number;
}): void {
const m = this.metrics[version];
m.requests++;
if (!result.success) m.errors++;
m.totalLatency += result.latencyMs;
m.qualitySum += result.qualityScore;
// Check if canary should be aborted
if (version === 'new' && this.shouldAbort()) {
this.rollback();
}
}
private shouldAbort(): boolean {
const m = this.metrics.new;
if (m.requests < 10) return false; // Need minimum sample
const errorRate = m.errors / m.requests;
const avgLatency = m.totalLatency / m.requests;
const avgQuality = m.qualitySum / m.requests;
return (
errorRate > this.config.metrics.errorThreshold ||
avgLatency > this.config.metrics.latencyThreshold ||
avgQuality < this.config.metrics.qualityThreshold
);
}
private rollback(): void {
console.error('Canary aborted: metrics exceeded thresholds');
this.config.percentage = 0;
// Notify on-call
}
}
Coverage Metrics for Agents
Track behavioral coverage:
// lib/agent-coverage.ts
interface BehaviorCoverage {
scenarios: Map<string, {
tested: boolean;
lastTested: Date;
passRate: number;
}>;
tools: Map<string, {
called: boolean;
callCount: number;
}>;
errorPaths: Map<string, boolean>;
}
const coverage: BehaviorCoverage = {
scenarios: new Map(),
tools: new Map(),
errorPaths: new Map(),
};
// Define expected scenarios
const EXPECTED_SCENARIOS = [
'happy-path-simple-request',
'multi-turn-conversation',
'error-recovery',
'rate-limit-handling',
'invalid-input-rejection',
'tool-call-validation-failure',
'context-overflow-handling',
];
// Track during tests
export function trackScenario(name: string, passed: boolean): void {
const existing = coverage.scenarios.get(name) || {
tested: false,
lastTested: new Date(0),
passRate: 0,
};
coverage.scenarios.set(name, {
tested: true,
lastTested: new Date(),
passRate: (existing.passRate * 0.9) + (passed ? 0.1 : 0), // EMA
});
}
export function trackToolCall(toolName: string): void {
const existing = coverage.tools.get(toolName) || { called: false, callCount: 0 };
coverage.tools.set(toolName, {
called: true,
callCount: existing.callCount + 1,
});
}
export function getCoverageReport(): {
scenarioCoverage: number;
toolCoverage: number;
untestedScenarios: string[];
unusedTools: string[];
} {
const testedScenarios = [...coverage.scenarios.values()].filter(s => s.tested).length;
const scenarioCoverage = testedScenarios / EXPECTED_SCENARIOS.length;
const expectedTools = ['read_file', 'write_file', 'search', 'execute_command'];
const calledTools = [...coverage.tools.entries()].filter(([_, v]) => v.called).length;
const toolCoverage = calledTools / expectedTools.length;
return {
scenarioCoverage,
toolCoverage,
untestedScenarios: EXPECTED_SCENARIOS.filter(s => !coverage.scenarios.get(s)?.tested),
unusedTools: expectedTools.filter(t => !coverage.tools.get(t)?.called),
};
}
A/B Testing Prompts in CI
Compare prompt variants:
# .github/workflows/prompt-ab-test.yml
name: Prompt A/B Test
on:
workflow_dispatch:
inputs:
prompt_a:
description: 'Path to prompt variant A'
required: true
prompt_b:
description: 'Path to prompt variant B'
required: true
sample_size:
description: 'Number of test cases per variant'
default: '50'
jobs:
ab-test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run A/B Test
run: |
bun scripts/ab-test.ts \
--prompt-a ${{ github.event.inputs.prompt_a }} \
--prompt-b ${{ github.event.inputs.prompt_b }} \
--sample-size ${{ github.event.inputs.sample_size }} \
--output ab-results.json
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
- name: Analyze Results
run: |
bun scripts/analyze-ab.ts ab-results.json --format markdown > ab-report.md
cat ab-report.md >> $GITHUB_STEP_SUMMARY
- name: Upload Results
uses: actions/upload-artifact@v4
with:
name: ab-test-results
path: |
ab-results.json
ab-report.md
Common Pitfalls
Pitfall 1: Running Expensive Tests on Every PR
# Wrong: Full agent suite on every PR
on:
pull_request:
jobs:
test:
steps:
- run: bun test agents/ # $2-5 per run!
# Correct: Tiered approach
on:
pull_request:
jobs:
quick-tests:
steps:
- run: bun test agents/ --cached-only # Free
full-tests:
if: contains(github.event.pull_request.labels.*.name, 'run-full-tests')
steps:
- run: bun test agents/
Pitfall 2: No Response Caching
// Wrong: Every test run hits the API
test('agent handles input', async () => {
const result = await agent.run(input); // API call
expect(result).toBeDefined();
});
// Correct: Cache for fast iteration
test('agent handles input', async () => {
const { response, cached } = await runWithCache(input, systemPrompt, agent.run);
expect(response).toBeDefined();
console.log(`Cache ${cached ? 'hit' : 'miss'}`);
});
Pitfall 3: Missing Timeouts
# Wrong: No timeout
jobs:
agent-test:
steps:
- run: bun test agents/ # Could run forever
# Correct: Explicit timeouts at multiple levels
jobs:
agent-test:
timeout-minutes: 30
steps:
- run: bun test agents/ --timeout 60000 # 60s per test
timeout-minutes: 25 # Step timeout
Pitfall 4: Not Tracking Costs
// Wrong: No cost visibility
const response = await anthropic.messages.create({ ... });
// Correct: Track every call
const response = await anthropic.messages.create({ ... });
costTracker.track(
'code-review',
response.model,
response.usage.input_tokens,
response.usage.output_tokens
);
Pitfall 5: Ignoring Non-Determinism
// Wrong: Exact match assertions
test('agent returns expected output', async () => {
const result = await agent.run(input);
expect(result.text).toBe('Expected exact response'); // Will flake!
});
// Correct: Structural and behavioral assertions
test('agent returns valid output', async () => {
const result = await agent.run(input);
expect(result.text.length).toBeGreaterThan(50);
expect(result.text).toContain('code review'); // Key concept present
expect(result.suggestions).toBeInstanceOf(Array);
});
Pitfall 6: No Baseline Comparisons
# Wrong: Only test current version
jobs:
test:
steps:
- run: bun test agents/
# Correct: Compare against baseline
jobs:
test:
steps:
- name: Run baseline
run: |
git checkout main
bun test agents/ --output baseline.json
git checkout -
- name: Run current
run: bun test agents/ --output current.json
- name: Compare
run: bun scripts/compare.ts baseline.json current.json
Benefits
1. Cost Control
Budget enforcement prevents surprise bills:
Without cost control:
- Nightly run goes infinite loop
- 10,000 API calls overnight
- $500 surprise bill
With cost control:
- Budget of $5 per run
- Stops at limit with warning
- Maximum predictable spend
2. Fast Feedback
Caching enables quick PR checks:
Without caching:
- Every PR runs full agent suite
- 10 minutes, $2 per PR
- 20 PRs/day = 200 minutes, $40
With caching:
- PR runs cached tests only
- 30 seconds, $0 per PR
- Nightly run refreshes cache
- Daily cost: $5 for nightly only
3. Regression Detection
Behavioral tests catch quality degradation:
Without regression tests:
- Prompt change deployed
- Quality drops 30%
- Users complain
- Days to detect and fix
With regression tests:
- Prompt change fails CI
- Quality drop detected immediately
- Fix before merge
- No user impact
4. Continuous Progress
Nightly RALPH keeps projects moving:
Without automation:
- Human runs RALPH manually
- Inconsistent progress
- Weekends = zero progress
With nightly runs:
- 8 hours of work every night
- 56 hours/week of agent time
- Consistent, predictable progress
Related
- The RALPH Loop – The iteration pattern that nightly CI automates
- Quality Gates That Compound – Gates to enforce in CI
- AI Cost Protection with Timeouts – Budget enforcement patterns
- Agent Reliability Chasm – Why CI testing matters
- Test-Driven Prompting – Defining success criteria for agents
- Tool Call Validation – Validation to test in CI
- Clean Slate Trajectory Recovery – What nightly runs reset

