Measuring Context Effectiveness with Mutual Information

James Phoenix
James Phoenix

Summary

Mutual information quantifies how much your context reduces uncertainty in LLM outputs. This article provides practical methods to measure context effectiveness: output variance testing, A/B comparison, test pass rate proxies, and semantic similarity scoring. When outputs vary widely, your context has low mutual information and needs improvement. When outputs converge, your context is effectively constraining the model.

The Problem

You’ve written a CLAUDE.md file. You’ve added type definitions. You’ve included examples. But how do you know if your context actually helps?

Signs your context might be ineffective:

  • High output variance: Same prompt produces wildly different code each time
  • Ignored instructions: LLM doesn’t follow patterns you specified
  • Frequent corrections: You’re constantly steering the model back on track
  • Test failures: Generated code fails tests despite clear specifications

Without measurement, context engineering is guesswork. With measurement, it becomes science.

Mutual Information: Quick Recap

Mutual information I(X;Y) measures how much knowing Y tells you about X:

I(X;Y) = H(X) - H(X|Y)

Where:

Udemy Bestseller

Learn Prompt Engineering

My O'Reilly book adapted for hands-on learning. Build production-ready prompts with practical exercises.

4.5/5 rating
306,000+ learners
View Course
  • H(X) = Entropy (uncertainty) of output X without context
  • H(X|Y) = Entropy of output X given context Y
  • I(X;Y) = Reduction in uncertainty (bits) from providing context

High mutual information: Context strongly predicts output (good!)
Low mutual information: Context barely affects output (improve it!)

For the full theoretical foundation, see Information Theory for Coding Agents.

Method 1: Output Variance Testing

The simplest measurement: generate the same output multiple times and count unique results.

Implementation

interface ContextEffectivenessResult {
  uniqueOutputs: number;
  totalGenerations: number;
  entropyEstimate: number;  // log2(uniqueOutputs)
  effectiveness: 'high' | 'medium' | 'low';
  samples: string[];
}

async function measureContextEffectiveness(
  context: string,
  prompt: string,
  numTrials: number = 10,
  temperature: number = 0.7
): Promise<ContextEffectivenessResult> {
  const outputs: string[] = [];

  for (let i = 0; i < numTrials; i++) {
    const response = await llm.generate({
      systemPrompt: context,
      userPrompt: prompt,
      temperature
    });
    outputs.push(normalizeCode(response));
  }

  const uniqueOutputs = new Set(outputs).size;
  const entropyEstimate = Math.log2(uniqueOutputs);

  let effectiveness: 'high' | 'medium' | 'low';
  if (uniqueOutputs <= 2) {
    effectiveness = 'high';  // Strong constraint
  } else if (uniqueOutputs <= 5) {
    effectiveness = 'medium';  // Moderate constraint
  } else {
    effectiveness = 'low';  // Weak constraint
  }

  return {
    uniqueOutputs,
    totalGenerations: numTrials,
    entropyEstimate,
    effectiveness,
    samples: outputs.slice(0, 3)  // Keep first 3 for inspection
  };
}

function normalizeCode(code: string): string {
  // Remove whitespace variations to focus on semantic differences
  return code
    .replace(/\s+/g, ' ')
    .replace(/;\s*/g, ';')
    .trim();
}

Interpretation

Unique Outputs Entropy Interpretation
1-2 0-1 bits High MI: Context very effective
3-5 1.6-2.3 bits Medium MI: Context somewhat effective
6-10 2.6-3.3 bits Low MI: Context needs improvement
10+ >3.3 bits Very Low MI: Context not constraining

Example: Measuring a Prompt

// Vague context
const vagueContext = `
Write clean code.
Use TypeScript.
Handle errors properly.
`;

// Specific context
const specificContext = `
Write TypeScript functions that:
- Return Result<T, Error> for operations that can fail
- Use strict null checks
- Follow the pattern:

function processUser(user: User): Result<ProcessedUser, ValidationError> {
  if (!user.email) {
    return { success: false, error: new ValidationError('Email required') };
  }
  return { success: true, data: { ...user, processed: true } };
}
`;

const prompt = "Write a function to validate a payment method";

// Measure both
const vagueResult = await measureContextEffectiveness(vagueContext, prompt);
// { uniqueOutputs: 8, effectiveness: 'low' }

const specificResult = await measureContextEffectiveness(specificContext, prompt);
// { uniqueOutputs: 2, effectiveness: 'high' }

Method 2: A/B Testing Contexts

Compare two context versions directly by measuring output quality.

Implementation

interface ABTestResult {
  contextA: {
    testPassRate: number;
    avgSimilarity: number;
    uniqueOutputs: number;
  };
  contextB: {
    testPassRate: number;
    avgSimilarity: number;
    uniqueOutputs: number;
  };
  winner: 'A' | 'B' | 'tie';
  confidence: number;
}

async function abTestContexts(
  contextA: string,
  contextB: string,
  prompt: string,
  testCases: TestCase[],
  numTrials: number = 20
): Promise<ABTestResult> {
  const resultsA = await runTrials(contextA, prompt, testCases, numTrials);
  const resultsB = await runTrials(contextB, prompt, testCases, numTrials);

  // Compare test pass rates
  const passRateDiff = resultsB.testPassRate - resultsA.testPassRate;

  // Statistical significance (simplified)
  const confidence = calculateConfidence(resultsA, resultsB);

  let winner: 'A' | 'B' | 'tie';
  if (passRateDiff > 0.1 && confidence > 0.95) {
    winner = 'B';
  } else if (passRateDiff < -0.1 && confidence > 0.95) {
    winner = 'A';
  } else {
    winner = 'tie';
  }

  return { contextA: resultsA, contextB: resultsB, winner, confidence };
}

async function runTrials(
  context: string,
  prompt: string,
  testCases: TestCase[],
  numTrials: number
) {
  let totalPassed = 0;
  const outputs: string[] = [];

  for (let i = 0; i < numTrials; i++) {
    const output = await llm.generate({ systemPrompt: context, userPrompt: prompt });
    outputs.push(output);

    const passed = testCases.every(tc => tc.validate(output));
    if (passed) totalPassed++;
  }

  return {
    testPassRate: totalPassed / numTrials,
    avgSimilarity: calculateAverageSimilarity(outputs),
    uniqueOutputs: new Set(outputs.map(normalizeCode)).size
  };
}

Example: Testing Context Improvements

const baseContext = `
Use TypeScript. Handle errors with try/catch.
`;

const improvedContext = `
Use TypeScript with Result types. Never throw in business logic.

Pattern:
type Result<T, E> = { success: true; data: T } | { success: false; error: E };

function parseJSON<T>(input: string): Result<T, ParseError> {
  try {
    return { success: true, data: JSON.parse(input) };
  } catch (e) {
    return { success: false, error: new ParseError(e.message) };
  }
}
`;

const testCases = [
  {
    name: 'Returns Result type',
    validate: (code: string) => code.includes('Result<') || code.includes('success:')
  },
  {
    name: 'No throw statements',
    validate: (code: string) => !code.includes('throw new')
  }
];

const result = await abTestContexts(
  baseContext,
  improvedContext,
  "Write a function to parse user input",
  testCases
);

// Result: { winner: 'B', confidence: 0.98 }

Method 3: Test Pass Rate as Proxy

Use automated tests as a proxy for mutual information. Higher pass rate = higher effective MI.

Implementation

interface TestPassRateMetrics {
  passRate: number;
  failedTests: string[];
  estimatedMI: 'high' | 'medium' | 'low';
}

async function measureViaTestPassRate(
  context: string,
  prompt: string,
  testSuite: TestSuite,
  numTrials: number = 10
): Promise<TestPassRateMetrics> {
  let totalPassed = 0;
  const failedTests = new Map<string, number>();

  for (let i = 0; i < numTrials; i++) {
    const code = await llm.generate({ systemPrompt: context, userPrompt: prompt });

    for (const test of testSuite.tests) {
      try {
        const passed = await test.run(code);
        if (passed) totalPassed++;
        else {
          failedTests.set(test.name, (failedTests.get(test.name) ?? 0) + 1);
        }
      } catch {
        failedTests.set(test.name, (failedTests.get(test.name) ?? 0) + 1);
      }
    }
  }

  const totalTests = testSuite.tests.length * numTrials;
  const passRate = totalPassed / totalTests;

  let estimatedMI: 'high' | 'medium' | 'low';
  if (passRate > 0.9) {
    estimatedMI = 'high';
  } else if (passRate > 0.7) {
    estimatedMI = 'medium';
  } else {
    estimatedMI = 'low';
  }

  // Find consistently failing tests
  const consistentFailures = [...failedTests.entries()]
    .filter(([_, count]) => count > numTrials * 0.5)
    .map(([name]) => name);

  return { passRate, failedTests: consistentFailures, estimatedMI };
}

Interpreting Test Pass Rates

Pass Rate Interpretation Action
>90% High MI Context is effective
70-90% Medium MI Identify weak areas, add examples
50-70% Low MI Major context gaps, restructure
<50% Very Low MI Context may be counterproductive

Identifying Weak Areas

Consistently failing tests reveal context gaps:

const metrics = await measureViaTestPassRate(context, prompt, testSuite);

if (metrics.failedTests.includes('Returns correct error type')) {
  // Context needs better error handling examples
  context += `
Always return typed errors:
- ValidationError for input issues
- NotFoundError for missing resources
- AuthError for permission problems
`;
}

if (metrics.failedTests.includes('Handles null input')) {
  // Context needs null handling pattern
  context += `
Guard against null at function entry:
if (input === null || input === undefined) {
  return { success: false, error: new ValidationError('Input required') };
}
`;
}

Method 4: Semantic Similarity Scoring

Measure how semantically similar outputs are to a reference implementation.

Implementation

interface SimilarityMetrics {
  avgSimilarity: number;  // 0-1 scale
  minSimilarity: number;
  maxSimilarity: number;
  variance: number;
  estimatedMI: 'high' | 'medium' | 'low';
}

async function measureSemanticSimilarity(
  context: string,
  prompt: string,
  referenceOutput: string,
  numTrials: number = 10
): Promise<SimilarityMetrics> {
  const similarities: number[] = [];

  for (let i = 0; i < numTrials; i++) {
    const output = await llm.generate({ systemPrompt: context, userPrompt: prompt });
    const similarity = await calculateSemanticSimilarity(output, referenceOutput);
    similarities.push(similarity);
  }

  const avg = similarities.reduce((a, b) => a + b, 0) / similarities.length;
  const variance = similarities.reduce((sum, s) => sum + Math.pow(s - avg, 2), 0) / similarities.length;

  let estimatedMI: 'high' | 'medium' | 'low';
  if (avg > 0.85 && variance < 0.01) {
    estimatedMI = 'high';
  } else if (avg > 0.7 && variance < 0.05) {
    estimatedMI = 'medium';
  } else {
    estimatedMI = 'low';
  }

  return {
    avgSimilarity: avg,
    minSimilarity: Math.min(...similarities),
    maxSimilarity: Math.max(...similarities),
    variance,
    estimatedMI
  };
}

async function calculateSemanticSimilarity(
  output: string,
  reference: string
): Promise<number> {
  // Use embeddings for semantic comparison
  const outputEmbedding = await llm.embed(output);
  const referenceEmbedding = await llm.embed(reference);

  return cosineSimilarity(outputEmbedding, referenceEmbedding);
}

function cosineSimilarity(a: number[], b: number[]): number {
  const dotProduct = a.reduce((sum, val, i) => sum + val * b[i], 0);
  const magnitudeA = Math.sqrt(a.reduce((sum, val) => sum + val * val, 0));
  const magnitudeB = Math.sqrt(b.reduce((sum, val) => sum + val * val, 0));
  return dotProduct / (magnitudeA * magnitudeB);
}

When to Use Semantic Similarity

Best for:

  • Prose generation (documentation, explanations)
  • Code that should match a specific style
  • Translations or transformations

Less useful for:

  • Tasks with multiple valid solutions
  • Creative outputs where variance is acceptable

Combined Measurement Dashboard

Combine all methods for comprehensive measurement:

interface ContextEffectivenessDashboard {
  varianceTest: ContextEffectivenessResult;
  testPassRate: TestPassRateMetrics;
  semanticSimilarity?: SimilarityMetrics;
  overallScore: number;  // 0-100
  recommendations: string[];
}

async function fullContextAudit(
  context: string,
  prompt: string,
  testSuite: TestSuite,
  referenceOutput?: string
): Promise<ContextEffectivenessDashboard> {
  const varianceTest = await measureContextEffectiveness(context, prompt);
  const testPassRate = await measureViaTestPassRate(context, prompt, testSuite);

  let semanticSimilarity: SimilarityMetrics | undefined;
  if (referenceOutput) {
    semanticSimilarity = await measureSemanticSimilarity(context, prompt, referenceOutput);
  }

  // Calculate overall score
  const varianceScore = varianceTest.effectiveness === 'high' ? 100 :
                        varianceTest.effectiveness === 'medium' ? 70 : 40;
  const testScore = testPassRate.passRate * 100;
  const similarityScore = semanticSimilarity ? semanticSimilarity.avgSimilarity * 100 : testScore;

  const overallScore = (varianceScore + testScore + similarityScore) / 3;

  // Generate recommendations
  const recommendations: string[] = [];

  if (varianceTest.uniqueOutputs > 5) {
    recommendations.push('Add more specific examples to reduce output variance');
  }

  if (testPassRate.failedTests.length > 0) {
    recommendations.push(`Address failing tests: ${testPassRate.failedTests.join(', ')}`);
  }

  if (semanticSimilarity && semanticSimilarity.avgSimilarity < 0.7) {
    recommendations.push('Include reference output in context to improve similarity');
  }

  return {
    varianceTest,
    testPassRate,
    semanticSimilarity,
    overallScore,
    recommendations
  };
}

Improving Low-MI Context

When measurements show low mutual information, apply these fixes:

1. Add Concrete Examples

Low MI often means missing examples:

# Before (low MI)
Use error handling best practices.

# After (higher MI)
Handle errors with Result types:

function fetchUser(id: string): Result<User, ApiError> {
  if (!id) {
    return { success: false, error: new ValidationError('ID required') };
  }
  // ...
}

2. Add Anti-Patterns

Show what NOT to do:

# Higher MI with anti-patterns

DO NOT:
- throw new Error() in business logic
- return null for errors
- swallow errors silently

DO:
- return Result<T, E> types
- log errors before returning
- include error context

3. Add Type Constraints

Types provide high information density:

// Low MI: "Write a user validator"

// High MI:
interface UserValidation {
  validate(user: unknown): ValidationResult<User>;
}

type ValidationResult<T> =
  | { valid: true; data: T }
  | { valid: false; errors: ValidationError[] };

4. Add Test Cases as Examples

Tests are highly constraining:

The function must pass these tests:

test('returns success for valid email', () => {
  expect(validate('[email protected]')).toEqual({
    valid: true,
    data: { email: '[email protected]' }
  });
});

test('returns error for invalid email', () => {
  expect(validate('not-an-email')).toEqual({
    valid: false,
    errors: [{ field: 'email', message: 'Invalid email format' }]
  });
});

Best Practices

1. Measure Before Optimizing

// Baseline measurement
const baseline = await fullContextAudit(context, prompt, testSuite);
console.log(`Baseline score: ${baseline.overallScore}`);

// Make improvement
const improvedContext = context + additionalExamples;

// Measure improvement
const improved = await fullContextAudit(improvedContext, prompt, testSuite);
console.log(`Improved score: ${improved.overallScore}`);
console.log(`Delta: ${improved.overallScore - baseline.overallScore}`);

2. Use Multiple Measurement Methods

No single method captures everything:

Method Measures Best For
Variance Constraint strength General context quality
Test pass rate Correctness Functional requirements
Semantic similarity Style adherence Code style consistency

3. Track Over Time

interface ContextMetricsHistory {
  timestamp: Date;
  contextHash: string;
  metrics: ContextEffectivenessDashboard;
}

async function trackContextEvolution(
  metricsLog: ContextMetricsHistory[]
): Promise<void> {
  // Analyze trends
  const recentScores = metricsLog.slice(-10).map(m => m.metrics.overallScore);
  const trend = calculateTrend(recentScores);

  if (trend < 0) {
    console.warn('Context effectiveness declining. Review recent changes.');
  }
}

4. Set Quality Gates

Require minimum MI before deploying context changes:

async function validateContextChange(
  oldContext: string,
  newContext: string,
  prompt: string,
  testSuite: TestSuite
): Promise<boolean> {
  const oldMetrics = await fullContextAudit(oldContext, prompt, testSuite);
  const newMetrics = await fullContextAudit(newContext, prompt, testSuite);

  // Gate: New context must not decrease effectiveness
  if (newMetrics.overallScore < oldMetrics.overallScore - 5) {
    console.error('Context change decreases effectiveness. Blocked.');
    return false;
  }

  // Gate: Minimum score threshold
  if (newMetrics.overallScore < 70) {
    console.error('Context below minimum quality threshold.');
    return false;
  }

  return true;
}

Common Pitfalls

Pitfall 1: Testing at Temperature 0

// BAD: Temperature 0 always gives same output
const result = await measureContextEffectiveness(context, prompt, 10, 0);
// Always shows 1 unique output - meaningless!

// GOOD: Use moderate temperature to reveal variance
const result = await measureContextEffectiveness(context, prompt, 10, 0.7);

Pitfall 2: Too Few Trials

// BAD: 3 trials isn't statistically meaningful
const result = await measureContextEffectiveness(context, prompt, 3);

// GOOD: Use 10-20 trials for reliable measurement
const result = await measureContextEffectiveness(context, prompt, 15);

Pitfall 3: Ignoring Normalization

// BAD: Whitespace differences inflate unique count
const unique = new Set(outputs).size;  // 8 "unique" outputs

// GOOD: Normalize before comparison
const unique = new Set(outputs.map(normalizeCode)).size;  // 3 truly unique

Pitfall 4: Over-Optimizing for Single Metric

Context that scores perfectly on one metric may fail on others:

// High test pass rate but high variance = brittle context
const metrics = {
  testPassRate: 0.95,
  uniqueOutputs: 9  // Different implementations all happen to pass
};

// Add more specific examples to reduce variance while maintaining pass rate

Conclusion

Measuring context effectiveness transforms prompt engineering from art to science. By applying mutual information principles through practical methods, you can:

  1. Quantify context quality with variance testing
  2. Compare context versions with A/B testing
  3. Validate correctness with test pass rates
  4. Ensure style consistency with semantic similarity
  5. Improve systematically based on measurements

The goal isn’t zero variance (that would be over-constraining). The goal is predictable, high-quality outputs that meet your requirements.

Key Takeaways:

  • High mutual information = Context strongly predicts output
  • Measure with multiple methods for complete picture
  • Output variance is the simplest, most universal metric
  • Test pass rate connects MI to practical correctness
  • Track metrics over time to catch regressions
  • Set quality gates to prevent context degradation

Context without measurement is hope. Context with measurement is engineering.

Related

References

Topics
Context EffectivenessContext EngineeringInformation TheoryMeasurementMetricsMutual InformationOutput VarianceQuality Assurance

More Insights

Cover Image for Own Your Control Plane

Own Your Control Plane

If you use someone else’s task manager, you inherit all of their abstractions. In a world where LLMs make software a solved problem, the cost of ownership has flipped.

James Phoenix
James Phoenix
Cover Image for Indexed PRD and Design Doc Strategy

Indexed PRD and Design Doc Strategy

A documentation-driven development pattern where a single `index.md` links all PRDs and design documents, creating navigable context for both humans and AI agents.

James Phoenix
James Phoenix