Measuring Context Effectiveness with Mutual Information

James Phoenix

Summary

Mutual information quantifies how much your context reduces uncertainty in LLM outputs. This article provides practical methods to measure context effectiveness: output variance testing, A/B comparison, test pass rate proxies, and semantic similarity scoring. When outputs vary widely, your context has low mutual information and needs improvement. When outputs converge, your context is effectively constraining the model.

The Problem

You’ve written a CLAUDE.md file. You’ve added type definitions. You’ve included examples. But how do you know if your context actually helps?

Signs your context might be ineffective:

High output variance: Same prompt produces wildly different code each time
Ignored instructions: LLM doesn’t follow patterns you specified
Frequent corrections: You’re constantly steering the model back on track
Test failures: Generated code fails tests despite clear specifications

Without measurement, context engineering is guesswork. With measurement, it becomes science.

Mutual Information: Quick Recap

Mutual information I(X;Y) measures how much knowing Y tells you about X:

I(X;Y) = H(X) - H(X|Y)

Where:

Udemy Bestseller

Learn Prompt Engineering

My O'Reilly book adapted for hands-on learning. Build production-ready prompts with practical exercises.

★ 4.5/5 rating

306,000+ learners

View Course

H(X) = Entropy (uncertainty) of output X without context
H(X|Y) = Entropy of output X given context Y
I(X;Y) = Reduction in uncertainty (bits) from providing context

High mutual information: Context strongly predicts output (good!)
Low mutual information: Context barely affects output (improve it!)

For the full theoretical foundation, see Information Theory for Coding Agents.

Method 1: Output Variance Testing

The simplest measurement: generate the same output multiple times and count unique results.

Implementation

interface ContextEffectivenessResult {
  uniqueOutputs: number;
  totalGenerations: number;
  entropyEstimate: number;  // log2(uniqueOutputs)
  effectiveness: 'high' | 'medium' | 'low';
  samples: string[];
}

async function measureContextEffectiveness(
  context: string,
  prompt: string,
  numTrials: number = 10,
  temperature: number = 0.7
): Promise<ContextEffectivenessResult> {
  const outputs: string[] = [];

  for (let i = 0; i < numTrials; i++) {
    const response = await llm.generate({
      systemPrompt: context,
      userPrompt: prompt,
      temperature
    });
    outputs.push(normalizeCode(response));
  }

  const uniqueOutputs = new Set(outputs).size;
  const entropyEstimate = Math.log2(uniqueOutputs);

  let effectiveness: 'high' | 'medium' | 'low';
  if (uniqueOutputs <= 2) {
    effectiveness = 'high';  // Strong constraint
  } else if (uniqueOutputs <= 5) {
    effectiveness = 'medium';  // Moderate constraint
  } else {
    effectiveness = 'low';  // Weak constraint
  }

  return {
    uniqueOutputs,
    totalGenerations: numTrials,
    entropyEstimate,
    effectiveness,
    samples: outputs.slice(0, 3)  // Keep first 3 for inspection
  };
}

function normalizeCode(code: string): string {
  // Remove whitespace variations to focus on semantic differences
  return code
    .replace(/\s+/g, ' ')
    .replace(/;\s*/g, ';')
    .trim();
}

Interpretation

Unique Outputs	Entropy	Interpretation
1-2	0-1 bits	High MI: Context very effective
3-5	1.6-2.3 bits	Medium MI: Context somewhat effective
6-10	2.6-3.3 bits	Low MI: Context needs improvement
10+	>3.3 bits	Very Low MI: Context not constraining

Example: Measuring a Prompt

// Vague context
const vagueContext = `
Write clean code.
Use TypeScript.
Handle errors properly.
`;

// Specific context
const specificContext = `
Write TypeScript functions that:
- Return Result<T, Error> for operations that can fail
- Use strict null checks
- Follow the pattern:

function processUser(user: User): Result<ProcessedUser, ValidationError> {
  if (!user.email) {
    return { success: false, error: new ValidationError('Email required') };
  }
  return { success: true, data: { ...user, processed: true } };
}
`;

const prompt = "Write a function to validate a payment method";

// Measure both
const vagueResult = await measureContextEffectiveness(vagueContext, prompt);
// { uniqueOutputs: 8, effectiveness: 'low' }

const specificResult = await measureContextEffectiveness(specificContext, prompt);
// { uniqueOutputs: 2, effectiveness: 'high' }

Method 2: A/B Testing Contexts

Compare two context versions directly by measuring output quality.

Implementation

interface ABTestResult {
  contextA: {
    testPassRate: number;
    avgSimilarity: number;
    uniqueOutputs: number;
  };
  contextB: {
    testPassRate: number;
    avgSimilarity: number;
    uniqueOutputs: number;
  };
  winner: 'A' | 'B' | 'tie';
  confidence: number;
}

async function abTestContexts(
  contextA: string,
  contextB: string,
  prompt: string,
  testCases: TestCase[],
  numTrials: number = 20
): Promise<ABTestResult> {
  const resultsA = await runTrials(contextA, prompt, testCases, numTrials);
  const resultsB = await runTrials(contextB, prompt, testCases, numTrials);

  // Compare test pass rates
  const passRateDiff = resultsB.testPassRate - resultsA.testPassRate;

  // Statistical significance (simplified)
  const confidence = calculateConfidence(resultsA, resultsB);

  let winner: 'A' | 'B' | 'tie';
  if (passRateDiff > 0.1 && confidence > 0.95) {
    winner = 'B';
  } else if (passRateDiff < -0.1 && confidence > 0.95) {
    winner = 'A';
  } else {
    winner = 'tie';
  }

  return { contextA: resultsA, contextB: resultsB, winner, confidence };
}

async function runTrials(
  context: string,
  prompt: string,
  testCases: TestCase[],
  numTrials: number
) {
  let totalPassed = 0;
  const outputs: string[] = [];

  for (let i = 0; i < numTrials; i++) {
    const output = await llm.generate({ systemPrompt: context, userPrompt: prompt });
    outputs.push(output);

    const passed = testCases.every(tc => tc.validate(output));
    if (passed) totalPassed++;
  }

  return {
    testPassRate: totalPassed / numTrials,
    avgSimilarity: calculateAverageSimilarity(outputs),
    uniqueOutputs: new Set(outputs.map(normalizeCode)).size
  };
}

Example: Testing Context Improvements

const baseContext = `
Use TypeScript. Handle errors with try/catch.
`;

const improvedContext = `
Use TypeScript with Result types. Never throw in business logic.

Pattern:
type Result<T, E> = { success: true; data: T } | { success: false; error: E };

function parseJSON<T>(input: string): Result<T, ParseError> {
  try {
    return { success: true, data: JSON.parse(input) };
  } catch (e) {
    return { success: false, error: new ParseError(e.message) };
  }
}
`;

const testCases = [
  {
    name: 'Returns Result type',
    validate: (code: string) => code.includes('Result<') || code.includes('success:')
  },
  {
    name: 'No throw statements',
    validate: (code: string) => !code.includes('throw new')
  }
];

const result = await abTestContexts(
  baseContext,
  improvedContext,
  "Write a function to parse user input",
  testCases
);

// Result: { winner: 'B', confidence: 0.98 }

Method 3: Test Pass Rate as Proxy

Use automated tests as a proxy for mutual information. Higher pass rate = higher effective MI.

Implementation

interface TestPassRateMetrics {
  passRate: number;
  failedTests: string[];
  estimatedMI: 'high' | 'medium' | 'low';
}

async function measureViaTestPassRate(
  context: string,
  prompt: string,
  testSuite: TestSuite,
  numTrials: number = 10
): Promise<TestPassRateMetrics> {
  let totalPassed = 0;
  const failedTests = new Map<string, number>();

  for (let i = 0; i < numTrials; i++) {
    const code = await llm.generate({ systemPrompt: context, userPrompt: prompt });

    for (const test of testSuite.tests) {
      try {
        const passed = await test.run(code);
        if (passed) totalPassed++;
        else {
          failedTests.set(test.name, (failedTests.get(test.name) ?? 0) + 1);
        }
      } catch {
        failedTests.set(test.name, (failedTests.get(test.name) ?? 0) + 1);
      }
    }
  }

  const totalTests = testSuite.tests.length * numTrials;
  const passRate = totalPassed / totalTests;

  let estimatedMI: 'high' | 'medium' | 'low';
  if (passRate > 0.9) {
    estimatedMI = 'high';
  } else if (passRate > 0.7) {
    estimatedMI = 'medium';
  } else {
    estimatedMI = 'low';
  }

  // Find consistently failing tests
  const consistentFailures = [...failedTests.entries()]
    .filter(([_, count]) => count > numTrials * 0.5)
    .map(([name]) => name);

  return { passRate, failedTests: consistentFailures, estimatedMI };
}

Interpreting Test Pass Rates

Pass Rate	Interpretation	Action
>90%	High MI	Context is effective
70-90%	Medium MI	Identify weak areas, add examples
50-70%	Low MI	Major context gaps, restructure
<50%	Very Low MI	Context may be counterproductive

Identifying Weak Areas

Consistently failing tests reveal context gaps:

const metrics = await measureViaTestPassRate(context, prompt, testSuite);

if (metrics.failedTests.includes('Returns correct error type')) {
  // Context needs better error handling examples
  context += `
Always return typed errors:
- ValidationError for input issues
- NotFoundError for missing resources
- AuthError for permission problems
`;
}

if (metrics.failedTests.includes('Handles null input')) {
  // Context needs null handling pattern
  context += `
Guard against null at function entry:
if (input === null || input === undefined) {
  return { success: false, error: new ValidationError('Input required') };
}
`;
}

Method 4: Semantic Similarity Scoring

Measure how semantically similar outputs are to a reference implementation.

Implementation

interface SimilarityMetrics {
  avgSimilarity: number;  // 0-1 scale
  minSimilarity: number;
  maxSimilarity: number;
  variance: number;
  estimatedMI: 'high' | 'medium' | 'low';
}

async function measureSemanticSimilarity(
  context: string,
  prompt: string,
  referenceOutput: string,
  numTrials: number = 10
): Promise<SimilarityMetrics> {
  const similarities: number[] = [];

  for (let i = 0; i < numTrials; i++) {
    const output = await llm.generate({ systemPrompt: context, userPrompt: prompt });
    const similarity = await calculateSemanticSimilarity(output, referenceOutput);
    similarities.push(similarity);
  }

  const avg = similarities.reduce((a, b) => a + b, 0) / similarities.length;
  const variance = similarities.reduce((sum, s) => sum + Math.pow(s - avg, 2), 0) / similarities.length;

  let estimatedMI: 'high' | 'medium' | 'low';
  if (avg > 0.85 && variance < 0.01) {
    estimatedMI = 'high';
  } else if (avg > 0.7 && variance < 0.05) {
    estimatedMI = 'medium';
  } else {
    estimatedMI = 'low';
  }

  return {
    avgSimilarity: avg,
    minSimilarity: Math.min(...similarities),
    maxSimilarity: Math.max(...similarities),
    variance,
    estimatedMI
  };
}

async function calculateSemanticSimilarity(
  output: string,
  reference: string
): Promise<number> {
  // Use embeddings for semantic comparison
  const outputEmbedding = await llm.embed(output);
  const referenceEmbedding = await llm.embed(reference);

  return cosineSimilarity(outputEmbedding, referenceEmbedding);
}

function cosineSimilarity(a: number[], b: number[]): number {
  const dotProduct = a.reduce((sum, val, i) => sum + val * b[i], 0);
  const magnitudeA = Math.sqrt(a.reduce((sum, val) => sum + val * val, 0));
  const magnitudeB = Math.sqrt(b.reduce((sum, val) => sum + val * val, 0));
  return dotProduct / (magnitudeA * magnitudeB);
}

When to Use Semantic Similarity

Best for:

Prose generation (documentation, explanations)
Code that should match a specific style
Translations or transformations

Less useful for:

Tasks with multiple valid solutions
Creative outputs where variance is acceptable

Combined Measurement Dashboard

Combine all methods for comprehensive measurement:

interface ContextEffectivenessDashboard {
  varianceTest: ContextEffectivenessResult;
  testPassRate: TestPassRateMetrics;
  semanticSimilarity?: SimilarityMetrics;
  overallScore: number;  // 0-100
  recommendations: string[];
}

async function fullContextAudit(
  context: string,
  prompt: string,
  testSuite: TestSuite,
  referenceOutput?: string
): Promise<ContextEffectivenessDashboard> {
  const varianceTest = await measureContextEffectiveness(context, prompt);
  const testPassRate = await measureViaTestPassRate(context, prompt, testSuite);

  let semanticSimilarity: SimilarityMetrics | undefined;
  if (referenceOutput) {
    semanticSimilarity = await measureSemanticSimilarity(context, prompt, referenceOutput);
  }

  // Calculate overall score
  const varianceScore = varianceTest.effectiveness === 'high' ? 100 :
                        varianceTest.effectiveness === 'medium' ? 70 : 40;
  const testScore = testPassRate.passRate * 100;
  const similarityScore = semanticSimilarity ? semanticSimilarity.avgSimilarity * 100 : testScore;

  const overallScore = (varianceScore + testScore + similarityScore) / 3;

  // Generate recommendations
  const recommendations: string[] = [];

  if (varianceTest.uniqueOutputs > 5) {
    recommendations.push('Add more specific examples to reduce output variance');
  }

  if (testPassRate.failedTests.length > 0) {
    recommendations.push(`Address failing tests: ${testPassRate.failedTests.join(', ')}`);
  }

  if (semanticSimilarity && semanticSimilarity.avgSimilarity < 0.7) {
    recommendations.push('Include reference output in context to improve similarity');
  }

  return {
    varianceTest,
    testPassRate,
    semanticSimilarity,
    overallScore,
    recommendations
  };
}

Improving Low-MI Context

When measurements show low mutual information, apply these fixes:

1. Add Concrete Examples

Low MI often means missing examples:

# Before (low MI)
Use error handling best practices.

# After (higher MI)
Handle errors with Result types:

function fetchUser(id: string): Result<User, ApiError> {
  if (!id) {
    return { success: false, error: new ValidationError('ID required') };
  }
  // ...
}

2. Add Anti-Patterns

Show what NOT to do:

# Higher MI with anti-patterns

DO NOT:
- throw new Error() in business logic
- return null for errors
- swallow errors silently

DO:
- return Result<T, E> types
- log errors before returning
- include error context

3. Add Type Constraints

Types provide high information density:

// Low MI: "Write a user validator"

// High MI:
interface UserValidation {
  validate(user: unknown): ValidationResult<User>;
}

type ValidationResult<T> =
  | { valid: true; data: T }
  | { valid: false; errors: ValidationError[] };

4. Add Test Cases as Examples

Tests are highly constraining:

The function must pass these tests:

test('returns success for valid email', () => {
  expect(validate('[email protected]')).toEqual({
    valid: true,
    data: { email: '[email protected]' }
  });
});

test('returns error for invalid email', () => {
  expect(validate('not-an-email')).toEqual({
    valid: false,
    errors: [{ field: 'email', message: 'Invalid email format' }]
  });
});

Best Practices

1. Measure Before Optimizing

// Baseline measurement
const baseline = await fullContextAudit(context, prompt, testSuite);
console.log(`Baseline score: ${baseline.overallScore}`);

// Make improvement
const improvedContext = context + additionalExamples;

// Measure improvement
const improved = await fullContextAudit(improvedContext, prompt, testSuite);
console.log(`Improved score: ${improved.overallScore}`);
console.log(`Delta: ${improved.overallScore - baseline.overallScore}`);

2. Use Multiple Measurement Methods

No single method captures everything:

Method	Measures	Best For
Variance	Constraint strength	General context quality
Test pass rate	Correctness	Functional requirements
Semantic similarity	Style adherence	Code style consistency

3. Track Over Time

interface ContextMetricsHistory {
  timestamp: Date;
  contextHash: string;
  metrics: ContextEffectivenessDashboard;
}

async function trackContextEvolution(
  metricsLog: ContextMetricsHistory[]
): Promise<void> {
  // Analyze trends
  const recentScores = metricsLog.slice(-10).map(m => m.metrics.overallScore);
  const trend = calculateTrend(recentScores);

  if (trend < 0) {
    console.warn('Context effectiveness declining. Review recent changes.');
  }
}

4. Set Quality Gates

Require minimum MI before deploying context changes:

async function validateContextChange(
  oldContext: string,
  newContext: string,
  prompt: string,
  testSuite: TestSuite
): Promise<boolean> {
  const oldMetrics = await fullContextAudit(oldContext, prompt, testSuite);
  const newMetrics = await fullContextAudit(newContext, prompt, testSuite);

  // Gate: New context must not decrease effectiveness
  if (newMetrics.overallScore < oldMetrics.overallScore - 5) {
    console.error('Context change decreases effectiveness. Blocked.');
    return false;
  }

  // Gate: Minimum score threshold
  if (newMetrics.overallScore < 70) {
    console.error('Context below minimum quality threshold.');
    return false;
  }

  return true;
}

Common Pitfalls

Pitfall 1: Testing at Temperature 0

// BAD: Temperature 0 always gives same output
const result = await measureContextEffectiveness(context, prompt, 10, 0);
// Always shows 1 unique output - meaningless!

// GOOD: Use moderate temperature to reveal variance
const result = await measureContextEffectiveness(context, prompt, 10, 0.7);

Pitfall 2: Too Few Trials

// BAD: 3 trials isn't statistically meaningful
const result = await measureContextEffectiveness(context, prompt, 3);

// GOOD: Use 10-20 trials for reliable measurement
const result = await measureContextEffectiveness(context, prompt, 15);

Pitfall 3: Ignoring Normalization

// BAD: Whitespace differences inflate unique count
const unique = new Set(outputs).size;  // 8 "unique" outputs

// GOOD: Normalize before comparison
const unique = new Set(outputs.map(normalizeCode)).size;  // 3 truly unique

Pitfall 4: Over-Optimizing for Single Metric

Context that scores perfectly on one metric may fail on others:

// High test pass rate but high variance = brittle context
const metrics = {
  testPassRate: 0.95,
  uniqueOutputs: 9  // Different implementations all happen to pass
};

// Add more specific examples to reduce variance while maintaining pass rate

Conclusion

Measuring context effectiveness transforms prompt engineering from art to science. By applying mutual information principles through practical methods, you can:

Quantify context quality with variance testing
Compare context versions with A/B testing
Validate correctness with test pass rates
Ensure style consistency with semantic similarity
Improve systematically based on measurements

The goal isn’t zero variance (that would be over-constraining). The goal is predictable, high-quality outputs that meet your requirements.

Key Takeaways:

High mutual information = Context strongly predicts output
Measure with multiple methods for complete picture
Output variance is the simplest, most universal metric
Test pass rate connects MI to practical correctness
Track metrics over time to catch regressions
Set quality gates to prevent context degradation

Context without measurement is hope. Context with measurement is engineering.

Information Theory for Coding Agents – Theoretical foundations of entropy and mutual information
Context Rot Auto-Compacting – Maintaining context quality over long sessions
Progressive Disclosure Context – Load context only when needed
Quality Gates as Information Filters – How verification reduces entropy
Context Debugging Framework – Systematic approach to context issues
Hierarchical Context Patterns – Organizing context effectively
Test-Driven Prompting – Using tests to constrain LLM output
Constraint-Based Prompting – Adding constraints to increase MI

References

Claude Shannon – A Mathematical Theory of Communication – Foundational information theory paper
Measuring Information Transfer in Neural Networks – Deep learning perspective on mutual information
Prompt Engineering Best Practices – Official Anthropic documentation

Measuring Context Effectiveness with Mutual Information

Summary

The Problem

Mutual Information: Quick Recap

Learn Prompt Engineering

Method 1: Output Variance Testing

Implementation

Interpretation

Example: Measuring a Prompt

Method 2: A/B Testing Contexts

Implementation

Example: Testing Context Improvements

Method 3: Test Pass Rate as Proxy

Implementation

Interpreting Test Pass Rates

Identifying Weak Areas

Method 4: Semantic Similarity Scoring

Implementation

When to Use Semantic Similarity

Combined Measurement Dashboard

Improving Low-MI Context

1. Add Concrete Examples

2. Add Anti-Patterns

3. Add Type Constraints

4. Add Test Cases as Examples

Best Practices

1. Measure Before Optimizing

2. Use Multiple Measurement Methods

3. Track Over Time

4. Set Quality Gates

Common Pitfalls

Pitfall 1: Testing at Temperature 0

Pitfall 2: Too Few Trials

Pitfall 3: Ignoring Normalization

Pitfall 4: Over-Optimizing for Single Metric

Conclusion

Related

References

More Insights

Own Your Control Plane

Indexed PRD and Design Doc Strategy