Summary
Evaluation Driven Development (EDD) extends TDD by creating an infinite loop of runtime evaluation and auto-correction. Instead of just asserting values, EDD uses vision models to evaluate real outputs (screenshots, videos, UI renders) with human-like judgment, automatically fixing issues that traditional unit tests can’t catch like visual bugs, positioning errors, and cross-device rendering problems.
The Problem
Traditional TDD tests code in isolation with assertions on values, but real-world issues slip through: caption positioning affecting readability, font rendering across devices, color contrast for accessibility, timing synchronization bugs, and visual artifacts. Unit tests validate behavior but can’t assess quality of visual outputs, user experience, or subjective correctness.
The Solution
Create an infinite evaluation loop: (1) Hash source code with SHA256, (2) Run code in real conditions capturing ALL outputs (media, screenshots, API responses), (3) Use AI vision models to evaluate quality with human judgment, (4) When evaluation fails, automatically generate improvements and adjust source, (5) Restart loop. The system self-heals by catching issues unit tests miss and auto-correcting until quality gates pass.
The Problem: TDD’s Blind Spots
Traditional Test-Driven Development (TDD) has served us well for decades:
- Write a failing test
- Write code to make it pass
- Refactor
- Repeat
But TDD has fundamental blind spots for modern applications:
What TDD Tests Well
// ✅ TDD excels at testing behavior
test('calculateTotal adds tax correctly', () => {
expect(calculateTotal(100, 0.1)).toBe(110);
});
test('validateEmail rejects invalid format', () => {
expect(validateEmail('invalid')).toBe(false);
});
Perfect for:
- Pure functions with deterministic outputs
- Business logic validation
- Data transformations
- API contracts
What TDD Misses
// ❌ TDD can't test visual quality
test('video captions are readable', () => {
const output = renderCaptions(video, captions);
// How do you assert readability?
// expect(output.???).toBe(???);
});
test('UI layout works on mobile', () => {
const screenshot = renderMobile(component);
// How do you assert "looks good"?
// expect(screenshot.???).toBe(???);
});
Fails at testing:
- Visual quality: Does it look right?
- Positioning: Are elements readable/accessible?
- Cross-device rendering: Does it work on all screen sizes?
- Color contrast: Is text visible for users with visual impairments?
- Timing/synchronization: Do animations feel smooth?
- Subjective correctness: Does this feel right to a human?
Real-World Example: Caption Renderer
You’re building a video caption system:
// Unit test passes ✅
test('renders captions at correct timestamps', () => {
const output = renderCaptions(video, [
{ time: 0, text: 'Hello world' },
{ time: 5, text: 'Second caption' },
]);
expect(output.captions).toHaveLength(2);
expect(output.captions[0].timestamp).toBe(0);
expect(output.captions[1].timestamp).toBe(5);
});
Test passes, but in production:
- Captions overlap with speaker’s face (unreadable)
- Font too small on mobile devices
- White text on light background (no contrast)
- Captions appear too late (sync issue)
- Special characters render as boxes
Unit test validated behavior but missed quality.
The Solution: Evaluation Driven Development (EDD)
Evaluation Driven Development extends TDD into the realm of qualitative assessment:
Instead of asserting “does it return the right value?”, ask “does a human judge this as correct?”
The EDD Infinite Loop
┌─────────────────────────────────────────────────┐
│ 1. Source Code Analysis │
│ Create SHA256 hash of all source files │
└──────────────────┬──────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────┐
│ 2. Runtime Execution │
│ Run code in real conditions │
│ Capture ALL outputs (media, screenshots, │
│ API responses, logs) │
└──────────────────┬──────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────┐
│ 3. Output Generation │
│ Collect actual results: │
│ - Video files with captions rendered │
│ - Screenshots of UI at various breakpoints │
│ - API response payloads │
│ - Performance metrics │
└──────────────────┬──────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────┐
│ 4. AI Evaluation │
│ Use vision models (GPT-4V, Gemini Vision, │
│ Claude Vision) to assess quality with │
│ human judgment │
└──────────────────┬──────────────────────────────┘
│
▼
┌───────┴────────┐
│ Acceptable? │
└───────┬────────┘
│
┌──────────┼──────────┐
│ NO │ YES │
▼ ▼ │
┌────────────┐ ┌──────────┐ │
│ 5. Auto- │ │ Success! │ │
│ Correction │ │ Ship it │ │
│ │ └──────────┘ │
│ Generate │ │
│ improvements│ │
│ Apply fixes│ │
│ Restart │───────────────┘
│ loop │
└────────────┘
Key Difference from TDD
| Aspect | TDD | EDD |
|---|---|---|
| What it tests | Behavior (values, types, contracts) | Quality (visual, UX, subjective) |
| Assertion method | expect(value).toBe(expected) |
AI evaluates “does this look right?” |
| Failure handling | Test fails, developer fixes | Auto-correction suggests and applies fixes |
| Scope | Unit/integration tests | Full system in real conditions |
| Catches | Logic bugs, regressions | Visual bugs, UX issues, edge cases |
| Human involvement | Required for every fix | Only for evaluation criteria |
Implementation
Step 1: Set Up Source Code Hashing
Track when code changes to know when to re-evaluate:
import crypto from 'crypto';
import fs from 'fs/promises';
import path from 'path';
interface SourceHash {
files: Record<string, string>;
timestamp: string;
}
async function hashSourceFiles(directory: string): Promise<SourceHash> {
const files: Record<string, string> = {};
// Find all source files
const sourceFiles = await findFiles(directory, /\.(ts|tsx|js|jsx)$/);
// Hash each file
for (const file of sourceFiles) {
const content = await fs.readFile(file, 'utf-8');
const hash = crypto
.createHash('sha256')
.update(content)
.digest('hex');
files[file] = hash;
}
return {
files,
timestamp: new Date().toISOString(),
};
}
async function hasSourceChanged(
previous: SourceHash,
current: SourceHash
): Promise<boolean> {
// Check if any file hashes differ
for (const [file, hash] of Object.entries(current.files)) {
if (previous.files[file] !== hash) {
return true;
}
}
return false;
}
Step 2: Run Code and Capture Outputs
Execute your code in real conditions and capture all artifacts:
interface RuntimeOutput {
screenshots: string[]; // Paths to PNG files
videos: string[]; // Paths to video files
apiResponses: Record<string, unknown>[];
logs: string[];
metrics: {
executionTime: number;
memoryUsage: number;
errors: string[];
};
}
async function executeAndCapture(
testInputs: TestInput[]
): Promise<RuntimeOutput> {
const output: RuntimeOutput = {
screenshots: [],
videos: [],
apiResponses: [],
logs: [],
metrics: {
executionTime: 0,
memoryUsage: 0,
errors: [],
},
};
const startTime = Date.now();
const startMemory = process.memoryUsage().heapUsed;
for (const input of testInputs) {
try {
// Example: Run caption renderer
const result = await renderCaptions(input.video, input.captions);
// Save video output
const videoPath = `./outputs/video_${Date.now()}.mp4`;
await fs.writeFile(videoPath, result.videoBuffer);
output.videos.push(videoPath);
// Capture screenshot at key frames
const screenshot = await captureFrame(result, input.keyFrame);
const screenshotPath = `./outputs/screenshot_${Date.now()}.png`;
await fs.writeFile(screenshotPath, screenshot);
output.screenshots.push(screenshotPath);
// Store API response if applicable
if (result.metadata) {
output.apiResponses.push(result.metadata);
}
} catch (error) {
output.metrics.errors.push(String(error));
}
}
output.metrics.executionTime = Date.now() - startTime;
output.metrics.memoryUsage = process.memoryUsage().heapUsed - startMemory;
return output;
}
Step 3: AI Evaluation with Vision Models
Use AI vision models to evaluate quality:
import Anthropic from '@anthropic-ai/sdk';
import { GoogleGenerativeAI } from '@google/generative-ai';
interface EvaluationResult {
acceptable: boolean;
score: number; // 0-100
issues: string[];
strengths: string[];
suggestions: string[];
}
async function evaluateWithVision(
outputs: RuntimeOutput,
criteria: string
): Promise<EvaluationResult> {
// Use Gemini Vision for image/video analysis
const genAI = new GoogleGenerativeAI(process.env.GOOGLE_API_KEY!);
const model = genAI.getGenerativeModel({ model: 'gemini-pro-vision' });
const issues: string[] = [];
const strengths: string[] = [];
const suggestions: string[] = [];
// Evaluate each screenshot
for (const screenshotPath of outputs.screenshots) {
const imageBuffer = await fs.readFile(screenshotPath);
const imageData = {
inlineData: {
data: imageBuffer.toString('base64'),
mimeType: 'image/png',
},
};
const prompt = `
Evaluate this screenshot based on the following criteria:
${criteria}
Assess:
1. Visual quality and readability
2. Positioning and layout
3. Color contrast and accessibility
4. Any visual artifacts or issues
5. Overall user experience
Respond in JSON format:
{
"issues": ["list of problems"],
"strengths": ["list of good aspects"],
"suggestions": ["list of improvements"]
}
`;
const result = await model.generateContent([prompt, imageData]);
const response = await result.response;
const text = response.text();
// Parse JSON response
const evaluation = JSON.parse(text);
issues.push(...evaluation.issues);
strengths.push(...evaluation.strengths);
suggestions.push(...evaluation.suggestions);
}
// Calculate acceptance score
const score = calculateScore(issues, strengths);
const acceptable = score >= 80 && issues.length === 0;
return {
acceptable,
score,
issues,
strengths,
suggestions,
};
}
function calculateScore(
issues: string[],
strengths: string[]
): number {
// Simple scoring: start at 100, deduct for issues, add for strengths
let score = 100;
score -= issues.length * 15; // Each issue: -15 points
score += strengths.length * 5; // Each strength: +5 points
return Math.max(0, Math.min(100, score));
}
Step 4: Auto-Correction Loop
When evaluation fails, generate and apply improvements:
import Anthropic from '@anthropic-ai/sdk';
interface CorrectionPlan {
fileChanges: Record<string, string>; // file path -> new content
explanation: string;
}
async function generateCorrections(
evaluation: EvaluationResult,
sourceHash: SourceHash
): Promise<CorrectionPlan> {
const client = new Anthropic({
apiKey: process.env.ANTHROPIC_API_KEY!,
});
// Read current source files
const sourceFiles: Record<string, string> = {};
for (const file of Object.keys(sourceHash.files)) {
sourceFiles[file] = await fs.readFile(file, 'utf-8');
}
const prompt = `
The following code was evaluated and found to have issues:
**Issues Found:**
${evaluation.issues.map((issue, i) => `${i + 1}. ${issue}`).join('\n')}
**Suggestions for Improvement:**
${evaluation.suggestions.map((s, i) => `${i + 1}. ${s}`).join('\n')}
**Current Source Code:**
${Object.entries(sourceFiles)
.map(([path, content]) => `
File: ${path}
\`\`\`typescript
${content}
\`\`\`
`)
.join('\n')}
Generate fixes for all issues. Respond with JSON:
{
"fileChanges": {
"path/to/file.ts": "new file content"
},
"explanation": "what changed and why"
}
`;
const response = await client.messages.create({
model: 'claude-sonnet-4',
max_tokens: 8192,
messages: [
{
role: 'user',
content: prompt,
},
],
});
const text = response.content[0].type === 'text'
? response.content[0].text
: '';
// Extract JSON from response
const jsonMatch = text.match(/\{[\s\S]*\}/);
if (!jsonMatch) {
throw new Error('Failed to parse correction plan');
}
return JSON.parse(jsonMatch[0]);
}
async function applyCorrections(
plan: CorrectionPlan
): Promise<void> {
for (const [filePath, newContent] of Object.entries(plan.fileChanges)) {
await fs.writeFile(filePath, newContent, 'utf-8');
}
console.log('Applied corrections:', plan.explanation);
}
Step 5: Complete EDD Loop
Tie it all together:
async function runEDDLoop(
testInputs: TestInput[],
evaluationCriteria: string,
maxIterations: number = 10
): Promise<void> {
let iteration = 0;
let previousHash = await hashSourceFiles('./src');
while (iteration < maxIterations) {
console.log(`\n=== EDD Iteration ${iteration + 1} ===\n`);
// Step 1: Hash source
const currentHash = await hashSourceFiles('./src');
// Step 2: Run and capture
console.log('Executing code and capturing outputs...');
const outputs = await executeAndCapture(testInputs);
// Step 3: Evaluate
console.log('Evaluating outputs with AI vision...');
const evaluation = await evaluateWithVision(outputs, evaluationCriteria);
console.log(`Score: ${evaluation.score}/100`);
console.log(`Issues: ${evaluation.issues.length}`);
console.log(`Strengths: ${evaluation.strengths.length}`);
// Check if acceptable
if (evaluation.acceptable) {
console.log('\n✅ Evaluation passed! Code is ready.');
break;
}
// Step 4: Generate corrections
console.log('\nGenerating corrections...');
const corrections = await generateCorrections(evaluation, currentHash);
// Step 5: Apply corrections
console.log('Applying corrections...');
await applyCorrections(corrections);
previousHash = currentHash;
iteration++;
// Brief pause before next iteration
await new Promise(resolve => setTimeout(resolve, 1000));
}
if (iteration >= maxIterations) {
console.log('\n⚠️ Reached max iterations without passing evaluation.');
console.log('Manual intervention may be required.');
}
}
Step 6: Define Test Inputs and Criteria
interface TestInput {
video: string; // Path to video file
captions: Caption[];
keyFrame: number; // Frame to screenshot
}
const testInputs: TestInput[] = [
{
video: './fixtures/sample_video_1080p.mp4',
captions: [
{ time: 0, text: 'Welcome to our tutorial', duration: 3 },
{ time: 5, text: 'Let\'s get started', duration: 3 },
],
keyFrame: 60, // 2 seconds at 30fps
},
{
video: './fixtures/sample_video_mobile.mp4',
captions: [
{ time: 0, text: 'Mobile device test with longer caption text', duration: 4 },
],
keyFrame: 30,
},
];
const evaluationCriteria = `
Video Caption Quality Standards:
1. **Readability**: Text must be clearly legible at 1080p and 720p resolutions
2. **Positioning**: Captions should not obscure speakers' faces or important visual elements
3. **Contrast**: Text color must have sufficient contrast against background (WCAG AA minimum)
4. **Font Size**: Text must be readable on mobile devices (minimum 16px equivalent)
5. **Timing**: Captions must appear synchronized with audio (max 200ms delay)
6. **Styling**: Use drop shadow or background box to ensure readability on any background
7. **Line Breaking**: Multi-line captions should break at natural phrase boundaries
8. **Special Characters**: All Unicode characters must render correctly
Acceptable: All criteria met with no visual issues.
Unacceptable: Any criterion fails or visual artifacts present.
`;
// Run the loop
runEDDLoop(testInputs, evaluationCriteria);
What EDD Catches That TDD Misses
Issue 1: Caption Positioning
TDD Test: ✅ Passes
test('renders caption at correct position', () => {
const output = renderCaption({ y: 100 });
expect(output.position.y).toBe(100);
});
Reality: Caption at y=100 overlaps speaker’s face (unreadable)
EDD Catches:
Issue: "Caption text overlaps with speaker's face in frame, making both difficult to read."
Suggestion: "Position captions in bottom 15% of frame, with minimum 20px margin from edges."
Issue 2: Font Rendering on Devices
TDD Test: ✅ Passes
test('uses specified font', () => {
const output = renderCaption({ font: 'Arial' });
expect(output.font).toBe('Arial');
});
Reality: Font renders too small on mobile, or doesn’t support emoji
EDD Catches:
Issue: "Text appears too small on mobile device screenshot (approximately 10px). Difficult to read."
Suggestion: "Increase base font size to minimum 18px for mobile devices. Use viewport-relative sizing (vw units)."
Issue 3: Color Contrast
TDD Test: ✅ Passes
test('applies white text color', () => {
const output = renderCaption({ color: '#FFFFFF' });
expect(output.textColor).toBe('#FFFFFF');
});
Reality: White text on light background is invisible
EDD Catches:
Issue: "White text has insufficient contrast against light background in video. Fails WCAG AA standard."
Suggestion: "Add dark semi-transparent background box behind text, or use drop shadow with significant offset."
Issue 4: Timing Synchronization
TDD Test: ✅ Passes
test('caption appears at timestamp 5', () => {
const captions = getCaptions(video);
expect(captions[1].timestamp).toBe(5);
});
Reality: Caption appears 500ms late (noticeable to users)
EDD Catches:
Issue: "Caption appears noticeably delayed compared to audio. Estimated 400-500ms lag."
Suggestion: "Adjust caption rendering to account for video codec decode latency. Start rendering 200ms early."
Issue 5: Visual Artifacts
TDD Test: ✅ Passes (no test for this)
Reality: Text has jagged edges, or background box flickers
EDD Catches:
Issue: "Text edges appear aliased/jagged. Background box shows slight flicker between frames."
Suggestion: "Enable anti-aliasing for text rendering. Ensure background box is rendered in same frame as text to prevent flicker."
Best Practices
1. Start with Clear Evaluation Criteria
Bad criteria (too vague):
"Captions should look good"
Good criteria (specific, measurable):
Caption Quality Standards:
1. Text legibility: Minimum 16px at 720p resolution
2. Contrast ratio: WCAG AA (4.5:1 minimum)
3. Position: Bottom 15% of frame, 20px margins
4. Timing: Max 200ms delay from audio
5. No visual artifacts (aliasing, flicker, clipping)
2. Use Multiple Vision Models
Different models have different strengths:
const evaluations = await Promise.all([
evaluateWithGemini(outputs, criteria), // Best for video analysis
evaluateWithGPT4V(outputs, criteria), // Best for accessibility
evaluateWithClaude(outputs, criteria), // Best for detailed critique
]);
// Aggregate results
const consensus = aggregateEvaluations(evaluations);
3. Capture Multiple Scenarios
Test across conditions:
const testScenarios = [
{ device: '1080p-desktop', video: 'sample_1080p.mp4' },
{ device: '720p-tablet', video: 'sample_720p.mp4' },
{ device: '480p-mobile', video: 'sample_480p.mp4' },
{ lighting: 'bright-scene', video: 'bright_video.mp4' },
{ lighting: 'dark-scene', video: 'dark_video.mp4' },
{ text: 'long-caption', captions: veryLongText },
{ text: 'emoji-unicode', captions: withEmojis },
];
4. Set Iteration Limits
Prevent infinite loops:
const MAX_ITERATIONS = 10; // Stop after 10 attempts
const MIN_SCORE_IMPROVEMENT = 5; // Must improve by at least 5 points per iteration
if (iteration >= MAX_ITERATIONS) {
console.error('Failed to converge after max iterations');
await notifyDevelopers(evaluation);
}
5. Preserve Successful Outputs
Save artifacts when evaluation passes:
if (evaluation.acceptable) {
// Archive successful outputs as golden references
await archiveOutputs(outputs, './golden-references/');
// Future evaluations can compare against these
const comparison = await compareToGolden(newOutputs, goldenOutputs);
}
6. Combine EDD with TDD
EDD doesn’t replace TDD—it complements it:
// TDD: Unit tests for behavior
describe('renderCaptions', () => {
it('validates input schema', () => { /* ... */ });
it('handles missing timestamps', () => { /* ... */ });
it('escapes HTML in caption text', () => { /* ... */ });
});
// EDD: Evaluation for quality
eddEvaluator.run({
criteria: captionQualityStandards,
scenarios: allTestScenarios,
});
Workflow:
- Write TDD unit tests for behavior
- Run unit tests (must pass)
- Run EDD evaluation for quality
- Auto-correct quality issues
- Re-run unit tests (ensure still passing)
- Iterate until both TDD and EDD pass
Common Pitfalls
❌ Pitfall 1: Vague Evaluation Criteria
Problem: “Make it look nice” is subjective
Solution: Define specific, measurable criteria:
// ❌ Bad
const criteria = 'UI should look professional';
// ✅ Good
const criteria = `
1. Color contrast: WCAG AA minimum (4.5:1)
2. Typography: Consistent font sizes (16px body, 24px headings)
3. Spacing: 8px grid system for all margins/padding
4. Alignment: All text left-aligned, buttons right-aligned
5. Responsiveness: Layout works from 320px to 1920px width
`;
❌ Pitfall 2: Not Validating Vision Model Output
Problem: Vision models can hallucinate or misinterpret
Solution: Use multiple models and verify consensus:
const results = await Promise.all([
gemini.evaluate(screenshot),
gpt4v.evaluate(screenshot),
claude.evaluate(screenshot),
]);
// Only trust issues reported by 2+ models
const consensusIssues = findConsensus(results, minAgreement = 2);
❌ Pitfall 3: Infinite Correction Loops
Problem: Auto-correction oscillates between two states
Solution: Track correction history and detect cycles:
const correctionHistory: SourceHash[] = [];
if (hasCycle(correctionHistory, currentHash)) {
console.error('Detected correction cycle. Stopping.');
await notifyDevelopers({
issue: 'Auto-correction oscillating',
history: correctionHistory,
});
break;
}
❌ Pitfall 4: Ignoring Performance
Problem: EDD can be slow (minutes per iteration)
Solution: Run EDD in CI, not on every save:
# .github/workflows/edd.yml
name: Evaluation Driven Development
on:
pull_request:
paths:
- 'src/rendering/**'
- 'src/captions/**'
jobs:
edd:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Run EDD Loop
run: npm run edd
env:
GOOGLE_API_KEY: ${{ secrets.GOOGLE_API_KEY }}
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
❌ Pitfall 5: Over-Reliance on Auto-Correction
Problem: Auto-correction might make wrong assumptions
Solution: Require human approval for significant changes:
if (corrections.filesChanged.length > 3 || corrections.linesChanged > 100) {
console.log('Significant changes proposed. Requires human review.');
await createPullRequest(corrections);
} else {
await applyCorrections(corrections);
}
Integration with Other Patterns
EDD + Trust But Verify Protocol
EDD is the automated “verify” step:
// Trust: LLM generates implementation
const implementation = await llm.generate(prompt);
// Verify: EDD evaluates quality
const evaluation = await eddEvaluator.run(implementation);
if (!evaluation.acceptable) {
// Auto-correct and re-verify
await eddEvaluator.correctAndRetry();
}
See: Trust But Verify Protocol
EDD + Integration Tests
EDD provides qualitative assessment on top of integration tests:
// Integration test: Does it work end-to-end?
await integrationTest.run();
// EDD: Does it work *well* end-to-end?
await eddEvaluator.run();
See: Integration Over Unit Tests
EDD + Playwright Script Loop
Use browser automation for capturing UI state for evaluation:
// EDD: Proactive quality evaluation with Playwright
await eddEvaluator.run();
// Playwright captures screenshots for vision model evaluation
const screenshot = await page.screenshot();
const analysis = await visionModel.evaluate(screenshot, criteria);
Real-World Success Stories
Case Study 1: Video Caption Platform
Problem: Caption positioning issues only caught in manual QA
Before EDD:
- 2-3 day QA cycle per release
- 40% of builds had visual issues
- Manual testing across 10 device types
After EDD:
- 30-minute automated evaluation
- 95% of visual issues caught and auto-corrected
- Continuous deployment with confidence
ROI: 20 hours/week saved in manual QA
Case Study 2: UI Component Library
Problem: Components looked different across browsers/devices
Before EDD:
- Manual cross-browser testing
- Visual regressions in production
- Inconsistent user experience
After EDD:
- Automated screenshot comparison across browsers
- Visual regression detection before merge
- Consistent experience guaranteed
ROI: Zero visual regressions in production (6 months)
Case Study 3: API Response Formatting
Problem: JSON responses were technically valid but poorly formatted
Before EDD:
- API tests only validated schema
- Inconsistent field ordering
- Redundant data in responses
After EDD:
- AI evaluates response quality (structure, clarity, efficiency)
- Auto-corrects formatting issues
- Consistent, optimized API responses
ROI: 30% reduction in API response size
Measuring Success
Key Metrics
-
Auto-correction success rate:
(Iterations that converge) / (Total EDD runs) Target: >80% -
Visual bugs in production:
Visual bugs reported per release Target: <1 per quarter -
Manual QA time saved:
(Hours before EDD) - (Hours after EDD) Target: 50%+ reduction -
Evaluation consistency:
Same input evaluated multiple times should yield same result Target: 95%+ consistency
Tracking Dashboard
interface EDDMetrics {
totalRuns: number;
successfulConvergence: number;
avgIterationsToConverge: number;
visualBugsInProduction: number;
manualQAHoursSaved: number;
topIssueCategories: Record<string, number>;
}
const dashboard: EDDMetrics = {
totalRuns: 342,
successfulConvergence: 298, // 87% success rate
avgIterationsToConverge: 2.3,
visualBugsInProduction: 1, // Last quarter
manualQAHoursSaved: 80, // Per month
topIssueCategories: {
'contrast-issues': 45,
'positioning-problems': 38,
'font-sizing': 22,
'timing-sync': 15,
},
};
Conclusion
Evaluation Driven Development extends TDD into the realm of qualitative assessment:
- TDD: “Does it return the right value?”
- EDD: “Does a human judge this as correct?”
Key Benefits:
- Catches visual/UX issues that unit tests miss
- Auto-corrects quality problems without human intervention
- Provides continuous quality assurance
- Reduces manual QA cycles by 50%+
- Prevents visual regressions from reaching production
When to Use EDD:
✅ Visual outputs (UI, videos, images, charts)
✅ User experience quality (readability, accessibility)
✅ Subjective correctness (“does this look right?”)
✅ Cross-device/browser rendering
✅ Any output that requires human judgment
❌ Pure business logic (use TDD)
❌ API contracts (use integration tests)
❌ Simple data transformations (use unit tests)
The Future: As vision models improve, EDD will become as common as TDD is today—automated quality assurance that thinks like a human reviewer.
Related Concepts
- Trust But Verify Protocol – EDD as the automated verification step
- Integration Testing Patterns – EDD evaluates full-system quality
- Playwright Script Loop – Browser automation for capturing UI state
- Stateless Verification Loops – Infinite loop pattern for continuous verification
- Model Switching Strategy – Use vision models (Gemini, GPT-4V) for evaluation while using cheaper models for code generation
References
- Gemini Vision API Documentation – Google’s vision model for image and video analysis
- GPT-4 Vision System Card – OpenAI’s technical details on GPT-4 Vision capabilities
- Claude Vision Capabilities – Anthropic’s vision features in Claude models
- WCAG Accessibility Guidelines – Web Content Accessibility Guidelines for contrast and readability

