Evaluation Driven Development: Self-Healing Test Loops with AI Vision

James Phoenix
James Phoenix

Summary

Evaluation Driven Development (EDD) extends TDD by creating an infinite loop of runtime evaluation and auto-correction. Instead of just asserting values, EDD uses vision models to evaluate real outputs (screenshots, videos, UI renders) with human-like judgment, automatically fixing issues that traditional unit tests can’t catch like visual bugs, positioning errors, and cross-device rendering problems.

The Problem

Traditional TDD tests code in isolation with assertions on values, but real-world issues slip through: caption positioning affecting readability, font rendering across devices, color contrast for accessibility, timing synchronization bugs, and visual artifacts. Unit tests validate behavior but can’t assess quality of visual outputs, user experience, or subjective correctness.

The Solution

Create an infinite evaluation loop: (1) Hash source code with SHA256, (2) Run code in real conditions capturing ALL outputs (media, screenshots, API responses), (3) Use AI vision models to evaluate quality with human judgment, (4) When evaluation fails, automatically generate improvements and adjust source, (5) Restart loop. The system self-heals by catching issues unit tests miss and auto-correcting until quality gates pass.

The Problem: TDD’s Blind Spots

Traditional Test-Driven Development (TDD) has served us well for decades:

  1. Write a failing test
  2. Write code to make it pass
  3. Refactor
  4. Repeat

But TDD has fundamental blind spots for modern applications:

What TDD Tests Well

// ✅ TDD excels at testing behavior
test('calculateTotal adds tax correctly', () => {
  expect(calculateTotal(100, 0.1)).toBe(110);
});

test('validateEmail rejects invalid format', () => {
  expect(validateEmail('invalid')).toBe(false);
});

Perfect for:

  • Pure functions with deterministic outputs
  • Business logic validation
  • Data transformations
  • API contracts

What TDD Misses

// ❌ TDD can't test visual quality
test('video captions are readable', () => {
  const output = renderCaptions(video, captions);
  // How do you assert readability?
  // expect(output.???).toBe(???);
});

test('UI layout works on mobile', () => {
  const screenshot = renderMobile(component);
  // How do you assert "looks good"?
  // expect(screenshot.???).toBe(???);
});

Fails at testing:

  • Visual quality: Does it look right?
  • Positioning: Are elements readable/accessible?
  • Cross-device rendering: Does it work on all screen sizes?
  • Color contrast: Is text visible for users with visual impairments?
  • Timing/synchronization: Do animations feel smooth?
  • Subjective correctness: Does this feel right to a human?

Real-World Example: Caption Renderer

You’re building a video caption system:

// Unit test passes ✅
test('renders captions at correct timestamps', () => {
  const output = renderCaptions(video, [
    { time: 0, text: 'Hello world' },
    { time: 5, text: 'Second caption' },
  ]);
  
  expect(output.captions).toHaveLength(2);
  expect(output.captions[0].timestamp).toBe(0);
  expect(output.captions[1].timestamp).toBe(5);
});

Test passes, but in production:

  • Captions overlap with speaker’s face (unreadable)
  • Font too small on mobile devices
  • White text on light background (no contrast)
  • Captions appear too late (sync issue)
  • Special characters render as boxes

Unit test validated behavior but missed quality.

The Solution: Evaluation Driven Development (EDD)

Evaluation Driven Development extends TDD into the realm of qualitative assessment:

Instead of asserting “does it return the right value?”, ask “does a human judge this as correct?”

Udemy Bestseller

Learn Prompt Engineering

My O'Reilly book adapted for hands-on learning. Build production-ready prompts with practical exercises.

4.5/5 rating
306,000+ learners
View Course

The EDD Infinite Loop

┌─────────────────────────────────────────────────┐
│ 1. Source Code Analysis                         │
│    Create SHA256 hash of all source files       │
└──────────────────┬──────────────────────────────┘
                   │
                   ▼
┌─────────────────────────────────────────────────┐
│ 2. Runtime Execution                            │
│    Run code in real conditions                  │
│    Capture ALL outputs (media, screenshots,     │
│    API responses, logs)                         │
└──────────────────┬──────────────────────────────┘
                   │
                   ▼
┌─────────────────────────────────────────────────┐
│ 3. Output Generation                            │
│    Collect actual results:                      │
│    - Video files with captions rendered         │
│    - Screenshots of UI at various breakpoints   │
│    - API response payloads                      │
│    - Performance metrics                        │
└──────────────────┬──────────────────────────────┘
                   │
                   ▼
┌─────────────────────────────────────────────────┐
│ 4. AI Evaluation                                │
│    Use vision models (GPT-4V, Gemini Vision,    │
│    Claude Vision) to assess quality with        │
│    human judgment                               │
└──────────────────┬──────────────────────────────┘
                   │
                   ▼
           ┌───────┴────────┐
           │  Acceptable?   │
           └───────┬────────┘
                   │
        ┌──────────┼──────────┐
        │ NO       │ YES      │
        ▼          ▼          │
┌────────────┐  ┌──────────┐ │
│ 5. Auto-   │  │ Success! │ │
│ Correction │  │ Ship it  │ │
│            │  └──────────┘ │
│ Generate   │               │
│ improvements│              │
│ Apply fixes│               │
│ Restart    │───────────────┘
│ loop       │
└────────────┘

Key Difference from TDD

Aspect TDD EDD
What it tests Behavior (values, types, contracts) Quality (visual, UX, subjective)
Assertion method expect(value).toBe(expected) AI evaluates “does this look right?”
Failure handling Test fails, developer fixes Auto-correction suggests and applies fixes
Scope Unit/integration tests Full system in real conditions
Catches Logic bugs, regressions Visual bugs, UX issues, edge cases
Human involvement Required for every fix Only for evaluation criteria

Implementation

Step 1: Set Up Source Code Hashing

Track when code changes to know when to re-evaluate:

import crypto from 'crypto';
import fs from 'fs/promises';
import path from 'path';

interface SourceHash {
  files: Record<string, string>;
  timestamp: string;
}

async function hashSourceFiles(directory: string): Promise<SourceHash> {
  const files: Record<string, string> = {};
  
  // Find all source files
  const sourceFiles = await findFiles(directory, /\.(ts|tsx|js|jsx)$/);
  
  // Hash each file
  for (const file of sourceFiles) {
    const content = await fs.readFile(file, 'utf-8');
    const hash = crypto
      .createHash('sha256')
      .update(content)
      .digest('hex');
    files[file] = hash;
  }
  
  return {
    files,
    timestamp: new Date().toISOString(),
  };
}

async function hasSourceChanged(
  previous: SourceHash,
  current: SourceHash
): Promise<boolean> {
  // Check if any file hashes differ
  for (const [file, hash] of Object.entries(current.files)) {
    if (previous.files[file] !== hash) {
      return true;
    }
  }
  return false;
}

Step 2: Run Code and Capture Outputs

Execute your code in real conditions and capture all artifacts:

interface RuntimeOutput {
  screenshots: string[]; // Paths to PNG files
  videos: string[]; // Paths to video files
  apiResponses: Record<string, unknown>[];
  logs: string[];
  metrics: {
    executionTime: number;
    memoryUsage: number;
    errors: string[];
  };
}

async function executeAndCapture(
  testInputs: TestInput[]
): Promise<RuntimeOutput> {
  const output: RuntimeOutput = {
    screenshots: [],
    videos: [],
    apiResponses: [],
    logs: [],
    metrics: {
      executionTime: 0,
      memoryUsage: 0,
      errors: [],
    },
  };
  
  const startTime = Date.now();
  const startMemory = process.memoryUsage().heapUsed;
  
  for (const input of testInputs) {
    try {
      // Example: Run caption renderer
      const result = await renderCaptions(input.video, input.captions);
      
      // Save video output
      const videoPath = `./outputs/video_${Date.now()}.mp4`;
      await fs.writeFile(videoPath, result.videoBuffer);
      output.videos.push(videoPath);
      
      // Capture screenshot at key frames
      const screenshot = await captureFrame(result, input.keyFrame);
      const screenshotPath = `./outputs/screenshot_${Date.now()}.png`;
      await fs.writeFile(screenshotPath, screenshot);
      output.screenshots.push(screenshotPath);
      
      // Store API response if applicable
      if (result.metadata) {
        output.apiResponses.push(result.metadata);
      }
    } catch (error) {
      output.metrics.errors.push(String(error));
    }
  }
  
  output.metrics.executionTime = Date.now() - startTime;
  output.metrics.memoryUsage = process.memoryUsage().heapUsed - startMemory;
  
  return output;
}

Step 3: AI Evaluation with Vision Models

Use AI vision models to evaluate quality:

import Anthropic from '@anthropic-ai/sdk';
import { GoogleGenerativeAI } from '@google/generative-ai';

interface EvaluationResult {
  acceptable: boolean;
  score: number; // 0-100
  issues: string[];
  strengths: string[];
  suggestions: string[];
}

async function evaluateWithVision(
  outputs: RuntimeOutput,
  criteria: string
): Promise<EvaluationResult> {
  // Use Gemini Vision for image/video analysis
  const genAI = new GoogleGenerativeAI(process.env.GOOGLE_API_KEY!);
  const model = genAI.getGenerativeModel({ model: 'gemini-pro-vision' });
  
  const issues: string[] = [];
  const strengths: string[] = [];
  const suggestions: string[] = [];
  
  // Evaluate each screenshot
  for (const screenshotPath of outputs.screenshots) {
    const imageBuffer = await fs.readFile(screenshotPath);
    const imageData = {
      inlineData: {
        data: imageBuffer.toString('base64'),
        mimeType: 'image/png',
      },
    };
    
    const prompt = `
    Evaluate this screenshot based on the following criteria:
    ${criteria}
    
    Assess:
    1. Visual quality and readability
    2. Positioning and layout
    3. Color contrast and accessibility
    4. Any visual artifacts or issues
    5. Overall user experience
    
    Respond in JSON format:
    {
      "issues": ["list of problems"],
      "strengths": ["list of good aspects"],
      "suggestions": ["list of improvements"]
    }
    `;
    
    const result = await model.generateContent([prompt, imageData]);
    const response = await result.response;
    const text = response.text();
    
    // Parse JSON response
    const evaluation = JSON.parse(text);
    issues.push(...evaluation.issues);
    strengths.push(...evaluation.strengths);
    suggestions.push(...evaluation.suggestions);
  }
  
  // Calculate acceptance score
  const score = calculateScore(issues, strengths);
  const acceptable = score >= 80 && issues.length === 0;
  
  return {
    acceptable,
    score,
    issues,
    strengths,
    suggestions,
  };
}

function calculateScore(
  issues: string[],
  strengths: string[]
): number {
  // Simple scoring: start at 100, deduct for issues, add for strengths
  let score = 100;
  score -= issues.length * 15; // Each issue: -15 points
  score += strengths.length * 5; // Each strength: +5 points
  return Math.max(0, Math.min(100, score));
}

Step 4: Auto-Correction Loop

When evaluation fails, generate and apply improvements:

import Anthropic from '@anthropic-ai/sdk';

interface CorrectionPlan {
  fileChanges: Record<string, string>; // file path -> new content
  explanation: string;
}

async function generateCorrections(
  evaluation: EvaluationResult,
  sourceHash: SourceHash
): Promise<CorrectionPlan> {
  const client = new Anthropic({
    apiKey: process.env.ANTHROPIC_API_KEY!,
  });
  
  // Read current source files
  const sourceFiles: Record<string, string> = {};
  for (const file of Object.keys(sourceHash.files)) {
    sourceFiles[file] = await fs.readFile(file, 'utf-8');
  }
  
  const prompt = `
  The following code was evaluated and found to have issues:
  
  **Issues Found:**
  ${evaluation.issues.map((issue, i) => `${i + 1}. ${issue}`).join('\n')}
  
  **Suggestions for Improvement:**
  ${evaluation.suggestions.map((s, i) => `${i + 1}. ${s}`).join('\n')}
  
  **Current Source Code:**
  ${Object.entries(sourceFiles)
    .map(([path, content]) => `
  File: ${path}
  \`\`\`typescript
  ${content}
  \`\`\`
  `)
    .join('\n')}
  
  Generate fixes for all issues. Respond with JSON:
  {
    "fileChanges": {
      "path/to/file.ts": "new file content"
    },
    "explanation": "what changed and why"
  }
  `;
  
  const response = await client.messages.create({
    model: 'claude-sonnet-4',
    max_tokens: 8192,
    messages: [
      {
        role: 'user',
        content: prompt,
      },
    ],
  });
  
  const text = response.content[0].type === 'text'
    ? response.content[0].text
    : '';
  
  // Extract JSON from response
  const jsonMatch = text.match(/\{[\s\S]*\}/);
  if (!jsonMatch) {
    throw new Error('Failed to parse correction plan');
  }
  
  return JSON.parse(jsonMatch[0]);
}

async function applyCorrections(
  plan: CorrectionPlan
): Promise<void> {
  for (const [filePath, newContent] of Object.entries(plan.fileChanges)) {
    await fs.writeFile(filePath, newContent, 'utf-8');
  }
  
  console.log('Applied corrections:', plan.explanation);
}

Step 5: Complete EDD Loop

Tie it all together:

async function runEDDLoop(
  testInputs: TestInput[],
  evaluationCriteria: string,
  maxIterations: number = 10
): Promise<void> {
  let iteration = 0;
  let previousHash = await hashSourceFiles('./src');
  
  while (iteration < maxIterations) {
    console.log(`\n=== EDD Iteration ${iteration + 1} ===\n`);
    
    // Step 1: Hash source
    const currentHash = await hashSourceFiles('./src');
    
    // Step 2: Run and capture
    console.log('Executing code and capturing outputs...');
    const outputs = await executeAndCapture(testInputs);
    
    // Step 3: Evaluate
    console.log('Evaluating outputs with AI vision...');
    const evaluation = await evaluateWithVision(outputs, evaluationCriteria);
    
    console.log(`Score: ${evaluation.score}/100`);
    console.log(`Issues: ${evaluation.issues.length}`);
    console.log(`Strengths: ${evaluation.strengths.length}`);
    
    // Check if acceptable
    if (evaluation.acceptable) {
      console.log('\n✅ Evaluation passed! Code is ready.');
      break;
    }
    
    // Step 4: Generate corrections
    console.log('\nGenerating corrections...');
    const corrections = await generateCorrections(evaluation, currentHash);
    
    // Step 5: Apply corrections
    console.log('Applying corrections...');
    await applyCorrections(corrections);
    
    previousHash = currentHash;
    iteration++;
    
    // Brief pause before next iteration
    await new Promise(resolve => setTimeout(resolve, 1000));
  }
  
  if (iteration >= maxIterations) {
    console.log('\n⚠️  Reached max iterations without passing evaluation.');
    console.log('Manual intervention may be required.');
  }
}

Step 6: Define Test Inputs and Criteria

interface TestInput {
  video: string; // Path to video file
  captions: Caption[];
  keyFrame: number; // Frame to screenshot
}

const testInputs: TestInput[] = [
  {
    video: './fixtures/sample_video_1080p.mp4',
    captions: [
      { time: 0, text: 'Welcome to our tutorial', duration: 3 },
      { time: 5, text: 'Let\'s get started', duration: 3 },
    ],
    keyFrame: 60, // 2 seconds at 30fps
  },
  {
    video: './fixtures/sample_video_mobile.mp4',
    captions: [
      { time: 0, text: 'Mobile device test with longer caption text', duration: 4 },
    ],
    keyFrame: 30,
  },
];

const evaluationCriteria = `
Video Caption Quality Standards:

1. **Readability**: Text must be clearly legible at 1080p and 720p resolutions
2. **Positioning**: Captions should not obscure speakers' faces or important visual elements
3. **Contrast**: Text color must have sufficient contrast against background (WCAG AA minimum)
4. **Font Size**: Text must be readable on mobile devices (minimum 16px equivalent)
5. **Timing**: Captions must appear synchronized with audio (max 200ms delay)
6. **Styling**: Use drop shadow or background box to ensure readability on any background
7. **Line Breaking**: Multi-line captions should break at natural phrase boundaries
8. **Special Characters**: All Unicode characters must render correctly

Acceptable: All criteria met with no visual issues.
Unacceptable: Any criterion fails or visual artifacts present.
`;

// Run the loop
runEDDLoop(testInputs, evaluationCriteria);

What EDD Catches That TDD Misses

Issue 1: Caption Positioning

TDD Test: ✅ Passes

test('renders caption at correct position', () => {
  const output = renderCaption({ y: 100 });
  expect(output.position.y).toBe(100);
});

Reality: Caption at y=100 overlaps speaker’s face (unreadable)

EDD Catches:

Issue: "Caption text overlaps with speaker's face in frame, making both difficult to read."
Suggestion: "Position captions in bottom 15% of frame, with minimum 20px margin from edges."

Issue 2: Font Rendering on Devices

TDD Test: ✅ Passes

test('uses specified font', () => {
  const output = renderCaption({ font: 'Arial' });
  expect(output.font).toBe('Arial');
});

Reality: Font renders too small on mobile, or doesn’t support emoji

EDD Catches:

Issue: "Text appears too small on mobile device screenshot (approximately 10px). Difficult to read."
Suggestion: "Increase base font size to minimum 18px for mobile devices. Use viewport-relative sizing (vw units)."

Issue 3: Color Contrast

TDD Test: ✅ Passes

test('applies white text color', () => {
  const output = renderCaption({ color: '#FFFFFF' });
  expect(output.textColor).toBe('#FFFFFF');
});

Reality: White text on light background is invisible

EDD Catches:

Issue: "White text has insufficient contrast against light background in video. Fails WCAG AA standard."
Suggestion: "Add dark semi-transparent background box behind text, or use drop shadow with significant offset."

Issue 4: Timing Synchronization

TDD Test: ✅ Passes

test('caption appears at timestamp 5', () => {
  const captions = getCaptions(video);
  expect(captions[1].timestamp).toBe(5);
});

Reality: Caption appears 500ms late (noticeable to users)

EDD Catches:

Issue: "Caption appears noticeably delayed compared to audio. Estimated 400-500ms lag."
Suggestion: "Adjust caption rendering to account for video codec decode latency. Start rendering 200ms early."

Issue 5: Visual Artifacts

TDD Test: ✅ Passes (no test for this)

Reality: Text has jagged edges, or background box flickers

EDD Catches:

Issue: "Text edges appear aliased/jagged. Background box shows slight flicker between frames."
Suggestion: "Enable anti-aliasing for text rendering. Ensure background box is rendered in same frame as text to prevent flicker."

Best Practices

1. Start with Clear Evaluation Criteria

Bad criteria (too vague):

"Captions should look good"

Good criteria (specific, measurable):

Caption Quality Standards:
1. Text legibility: Minimum 16px at 720p resolution
2. Contrast ratio: WCAG AA (4.5:1 minimum)
3. Position: Bottom 15% of frame, 20px margins
4. Timing: Max 200ms delay from audio
5. No visual artifacts (aliasing, flicker, clipping)

2. Use Multiple Vision Models

Different models have different strengths:

const evaluations = await Promise.all([
  evaluateWithGemini(outputs, criteria),     // Best for video analysis
  evaluateWithGPT4V(outputs, criteria),      // Best for accessibility
  evaluateWithClaude(outputs, criteria),     // Best for detailed critique
]);

// Aggregate results
const consensus = aggregateEvaluations(evaluations);

3. Capture Multiple Scenarios

Test across conditions:

const testScenarios = [
  { device: '1080p-desktop', video: 'sample_1080p.mp4' },
  { device: '720p-tablet', video: 'sample_720p.mp4' },
  { device: '480p-mobile', video: 'sample_480p.mp4' },
  { lighting: 'bright-scene', video: 'bright_video.mp4' },
  { lighting: 'dark-scene', video: 'dark_video.mp4' },
  { text: 'long-caption', captions: veryLongText },
  { text: 'emoji-unicode', captions: withEmojis },
];

4. Set Iteration Limits

Prevent infinite loops:

const MAX_ITERATIONS = 10; // Stop after 10 attempts
const MIN_SCORE_IMPROVEMENT = 5; // Must improve by at least 5 points per iteration

if (iteration >= MAX_ITERATIONS) {
  console.error('Failed to converge after max iterations');
  await notifyDevelopers(evaluation);
}

5. Preserve Successful Outputs

Save artifacts when evaluation passes:

if (evaluation.acceptable) {
  // Archive successful outputs as golden references
  await archiveOutputs(outputs, './golden-references/');
  
  // Future evaluations can compare against these
  const comparison = await compareToGolden(newOutputs, goldenOutputs);
}

6. Combine EDD with TDD

EDD doesn’t replace TDD—it complements it:

// TDD: Unit tests for behavior
describe('renderCaptions', () => {
  it('validates input schema', () => { /* ... */ });
  it('handles missing timestamps', () => { /* ... */ });
  it('escapes HTML in caption text', () => { /* ... */ });
});

// EDD: Evaluation for quality
eddEvaluator.run({
  criteria: captionQualityStandards,
  scenarios: allTestScenarios,
});

Workflow:

  1. Write TDD unit tests for behavior
  2. Run unit tests (must pass)
  3. Run EDD evaluation for quality
  4. Auto-correct quality issues
  5. Re-run unit tests (ensure still passing)
  6. Iterate until both TDD and EDD pass

Common Pitfalls

❌ Pitfall 1: Vague Evaluation Criteria

Problem: “Make it look nice” is subjective

Solution: Define specific, measurable criteria:

// ❌ Bad
const criteria = 'UI should look professional';

// ✅ Good  
const criteria = `
1. Color contrast: WCAG AA minimum (4.5:1)
2. Typography: Consistent font sizes (16px body, 24px headings)
3. Spacing: 8px grid system for all margins/padding
4. Alignment: All text left-aligned, buttons right-aligned
5. Responsiveness: Layout works from 320px to 1920px width
`;

❌ Pitfall 2: Not Validating Vision Model Output

Problem: Vision models can hallucinate or misinterpret

Solution: Use multiple models and verify consensus:

const results = await Promise.all([
  gemini.evaluate(screenshot),
  gpt4v.evaluate(screenshot),
  claude.evaluate(screenshot),
]);

// Only trust issues reported by 2+ models
const consensusIssues = findConsensus(results, minAgreement = 2);

❌ Pitfall 3: Infinite Correction Loops

Problem: Auto-correction oscillates between two states

Solution: Track correction history and detect cycles:

const correctionHistory: SourceHash[] = [];

if (hasCycle(correctionHistory, currentHash)) {
  console.error('Detected correction cycle. Stopping.');
  await notifyDevelopers({
    issue: 'Auto-correction oscillating',
    history: correctionHistory,
  });
  break;
}

❌ Pitfall 4: Ignoring Performance

Problem: EDD can be slow (minutes per iteration)

Solution: Run EDD in CI, not on every save:

# .github/workflows/edd.yml
name: Evaluation Driven Development

on:
  pull_request:
    paths:
      - 'src/rendering/**'
      - 'src/captions/**'

jobs:
  edd:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Run EDD Loop
        run: npm run edd
        env:
          GOOGLE_API_KEY: ${{ secrets.GOOGLE_API_KEY }}
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}

❌ Pitfall 5: Over-Reliance on Auto-Correction

Problem: Auto-correction might make wrong assumptions

Solution: Require human approval for significant changes:

if (corrections.filesChanged.length > 3 || corrections.linesChanged > 100) {
  console.log('Significant changes proposed. Requires human review.');
  await createPullRequest(corrections);
} else {
  await applyCorrections(corrections);
}

Integration with Other Patterns

EDD + Trust But Verify Protocol

EDD is the automated “verify” step:

// Trust: LLM generates implementation
const implementation = await llm.generate(prompt);

// Verify: EDD evaluates quality
const evaluation = await eddEvaluator.run(implementation);

if (!evaluation.acceptable) {
  // Auto-correct and re-verify
  await eddEvaluator.correctAndRetry();
}

See: Trust But Verify Protocol

EDD + Integration Tests

EDD provides qualitative assessment on top of integration tests:

// Integration test: Does it work end-to-end?
await integrationTest.run();

// EDD: Does it work *well* end-to-end?
await eddEvaluator.run();

See: Integration Over Unit Tests

EDD + Playwright Script Loop

Use browser automation for capturing UI state for evaluation:

// EDD: Proactive quality evaluation with Playwright
await eddEvaluator.run();

// Playwright captures screenshots for vision model evaluation
const screenshot = await page.screenshot();
const analysis = await visionModel.evaluate(screenshot, criteria);

See: Playwright Script Loop

Real-World Success Stories

Case Study 1: Video Caption Platform

Problem: Caption positioning issues only caught in manual QA

Before EDD:

  • 2-3 day QA cycle per release
  • 40% of builds had visual issues
  • Manual testing across 10 device types

After EDD:

  • 30-minute automated evaluation
  • 95% of visual issues caught and auto-corrected
  • Continuous deployment with confidence

ROI: 20 hours/week saved in manual QA

Case Study 2: UI Component Library

Problem: Components looked different across browsers/devices

Before EDD:

  • Manual cross-browser testing
  • Visual regressions in production
  • Inconsistent user experience

After EDD:

  • Automated screenshot comparison across browsers
  • Visual regression detection before merge
  • Consistent experience guaranteed

ROI: Zero visual regressions in production (6 months)

Case Study 3: API Response Formatting

Problem: JSON responses were technically valid but poorly formatted

Before EDD:

  • API tests only validated schema
  • Inconsistent field ordering
  • Redundant data in responses

After EDD:

  • AI evaluates response quality (structure, clarity, efficiency)
  • Auto-corrects formatting issues
  • Consistent, optimized API responses

ROI: 30% reduction in API response size

Measuring Success

Key Metrics

  1. Auto-correction success rate:

    (Iterations that converge) / (Total EDD runs)
    Target: >80%
    
  2. Visual bugs in production:

    Visual bugs reported per release
    Target: <1 per quarter
    
  3. Manual QA time saved:

    (Hours before EDD) - (Hours after EDD)
    Target: 50%+ reduction
    
  4. Evaluation consistency:

    Same input evaluated multiple times should yield same result
    Target: 95%+ consistency
    

Tracking Dashboard

interface EDDMetrics {
  totalRuns: number;
  successfulConvergence: number;
  avgIterationsToConverge: number;
  visualBugsInProduction: number;
  manualQAHoursSaved: number;
  topIssueCategories: Record<string, number>;
}

const dashboard: EDDMetrics = {
  totalRuns: 342,
  successfulConvergence: 298, // 87% success rate
  avgIterationsToConverge: 2.3,
  visualBugsInProduction: 1, // Last quarter
  manualQAHoursSaved: 80, // Per month
  topIssueCategories: {
    'contrast-issues': 45,
    'positioning-problems': 38,
    'font-sizing': 22,
    'timing-sync': 15,
  },
};

Conclusion

Evaluation Driven Development extends TDD into the realm of qualitative assessment:

  • TDD: “Does it return the right value?”
  • EDD: “Does a human judge this as correct?”

Key Benefits:

  1. Catches visual/UX issues that unit tests miss
  2. Auto-corrects quality problems without human intervention
  3. Provides continuous quality assurance
  4. Reduces manual QA cycles by 50%+
  5. Prevents visual regressions from reaching production

When to Use EDD:

✅ Visual outputs (UI, videos, images, charts)
✅ User experience quality (readability, accessibility)
✅ Subjective correctness (“does this look right?”)
✅ Cross-device/browser rendering
✅ Any output that requires human judgment

❌ Pure business logic (use TDD)
❌ API contracts (use integration tests)
❌ Simple data transformations (use unit tests)

The Future: As vision models improve, EDD will become as common as TDD is today—automated quality assurance that thinks like a human reviewer.

Related Concepts

References

Topics
Ai EvaluationAuto CorrectionEddEvaluation Driven DevelopmentIntegration TestingMultimodal TestingQuality AssuranceRuntime TestingSelf HealingVision Models

More Insights

Cover Image for Thought Leaders

Thought Leaders

People to follow for compound engineering, context engineering, and AI agent development.

James Phoenix
James Phoenix
Cover Image for Systems Thinking & Observability

Systems Thinking & Observability

Software should be treated as a measurable dynamical system, not as a collection of features.

James Phoenix
James Phoenix