Summary
Mutual information quantifies how much your context reduces uncertainty in LLM outputs. This article provides practical methods to measure context effectiveness: output variance testing, A/B comparison, test pass rate proxies, and semantic similarity scoring. When outputs vary widely, your context has low mutual information and needs improvement. When outputs converge, your context is effectively constraining the model.
The Problem
You’ve written a CLAUDE.md file. You’ve added type definitions. You’ve included examples. But how do you know if your context actually helps?
Signs your context might be ineffective:
- High output variance: Same prompt produces wildly different code each time
- Ignored instructions: LLM doesn’t follow patterns you specified
- Frequent corrections: You’re constantly steering the model back on track
- Test failures: Generated code fails tests despite clear specifications
Without measurement, context engineering is guesswork. With measurement, it becomes science.
Mutual Information: Quick Recap
Mutual information I(X;Y) measures how much knowing Y tells you about X:
I(X;Y) = H(X) - H(X|Y)
Where:
H(X)= Entropy (uncertainty) of output X without contextH(X|Y)= Entropy of output X given context YI(X;Y)= Reduction in uncertainty (bits) from providing context
High mutual information: Context strongly predicts output (good!)
Low mutual information: Context barely affects output (improve it!)
For the full theoretical foundation, see Information Theory for Coding Agents.
Method 1: Output Variance Testing
The simplest measurement: generate the same output multiple times and count unique results.
Implementation
interface ContextEffectivenessResult {
uniqueOutputs: number;
totalGenerations: number;
entropyEstimate: number; // log2(uniqueOutputs)
effectiveness: 'high' | 'medium' | 'low';
samples: string[];
}
async function measureContextEffectiveness(
context: string,
prompt: string,
numTrials: number = 10,
temperature: number = 0.7
): Promise<ContextEffectivenessResult> {
const outputs: string[] = [];
for (let i = 0; i < numTrials; i++) {
const response = await llm.generate({
systemPrompt: context,
userPrompt: prompt,
temperature
});
outputs.push(normalizeCode(response));
}
const uniqueOutputs = new Set(outputs).size;
const entropyEstimate = Math.log2(uniqueOutputs);
let effectiveness: 'high' | 'medium' | 'low';
if (uniqueOutputs <= 2) {
effectiveness = 'high'; // Strong constraint
} else if (uniqueOutputs <= 5) {
effectiveness = 'medium'; // Moderate constraint
} else {
effectiveness = 'low'; // Weak constraint
}
return {
uniqueOutputs,
totalGenerations: numTrials,
entropyEstimate,
effectiveness,
samples: outputs.slice(0, 3) // Keep first 3 for inspection
};
}
function normalizeCode(code: string): string {
// Remove whitespace variations to focus on semantic differences
return code
.replace(/\s+/g, ' ')
.replace(/;\s*/g, ';')
.trim();
}
Interpretation
| Unique Outputs | Entropy | Interpretation |
|---|---|---|
| 1-2 | 0-1 bits | High MI: Context very effective |
| 3-5 | 1.6-2.3 bits | Medium MI: Context somewhat effective |
| 6-10 | 2.6-3.3 bits | Low MI: Context needs improvement |
| 10+ | >3.3 bits | Very Low MI: Context not constraining |
Example: Measuring a Prompt
// Vague context
const vagueContext = `
Write clean code.
Use TypeScript.
Handle errors properly.
`;
// Specific context
const specificContext = `
Write TypeScript functions that:
- Return Result<T, Error> for operations that can fail
- Use strict null checks
- Follow the pattern:
function processUser(user: User): Result<ProcessedUser, ValidationError> {
if (!user.email) {
return { success: false, error: new ValidationError('Email required') };
}
return { success: true, data: { ...user, processed: true } };
}
`;
const prompt = "Write a function to validate a payment method";
// Measure both
const vagueResult = await measureContextEffectiveness(vagueContext, prompt);
// { uniqueOutputs: 8, effectiveness: 'low' }
const specificResult = await measureContextEffectiveness(specificContext, prompt);
// { uniqueOutputs: 2, effectiveness: 'high' }
Method 2: A/B Testing Contexts
Compare two context versions directly by measuring output quality.
Implementation
interface ABTestResult {
contextA: {
testPassRate: number;
avgSimilarity: number;
uniqueOutputs: number;
};
contextB: {
testPassRate: number;
avgSimilarity: number;
uniqueOutputs: number;
};
winner: 'A' | 'B' | 'tie';
confidence: number;
}
async function abTestContexts(
contextA: string,
contextB: string,
prompt: string,
testCases: TestCase[],
numTrials: number = 20
): Promise<ABTestResult> {
const resultsA = await runTrials(contextA, prompt, testCases, numTrials);
const resultsB = await runTrials(contextB, prompt, testCases, numTrials);
// Compare test pass rates
const passRateDiff = resultsB.testPassRate - resultsA.testPassRate;
// Statistical significance (simplified)
const confidence = calculateConfidence(resultsA, resultsB);
let winner: 'A' | 'B' | 'tie';
if (passRateDiff > 0.1 && confidence > 0.95) {
winner = 'B';
} else if (passRateDiff < -0.1 && confidence > 0.95) {
winner = 'A';
} else {
winner = 'tie';
}
return { contextA: resultsA, contextB: resultsB, winner, confidence };
}
async function runTrials(
context: string,
prompt: string,
testCases: TestCase[],
numTrials: number
) {
let totalPassed = 0;
const outputs: string[] = [];
for (let i = 0; i < numTrials; i++) {
const output = await llm.generate({ systemPrompt: context, userPrompt: prompt });
outputs.push(output);
const passed = testCases.every(tc => tc.validate(output));
if (passed) totalPassed++;
}
return {
testPassRate: totalPassed / numTrials,
avgSimilarity: calculateAverageSimilarity(outputs),
uniqueOutputs: new Set(outputs.map(normalizeCode)).size
};
}
Example: Testing Context Improvements
const baseContext = `
Use TypeScript. Handle errors with try/catch.
`;
const improvedContext = `
Use TypeScript with Result types. Never throw in business logic.
Pattern:
type Result<T, E> = { success: true; data: T } | { success: false; error: E };
function parseJSON<T>(input: string): Result<T, ParseError> {
try {
return { success: true, data: JSON.parse(input) };
} catch (e) {
return { success: false, error: new ParseError(e.message) };
}
}
`;
const testCases = [
{
name: 'Returns Result type',
validate: (code: string) => code.includes('Result<') || code.includes('success:')
},
{
name: 'No throw statements',
validate: (code: string) => !code.includes('throw new')
}
];
const result = await abTestContexts(
baseContext,
improvedContext,
"Write a function to parse user input",
testCases
);
// Result: { winner: 'B', confidence: 0.98 }
Method 3: Test Pass Rate as Proxy
Use automated tests as a proxy for mutual information. Higher pass rate = higher effective MI.
Implementation
interface TestPassRateMetrics {
passRate: number;
failedTests: string[];
estimatedMI: 'high' | 'medium' | 'low';
}
async function measureViaTestPassRate(
context: string,
prompt: string,
testSuite: TestSuite,
numTrials: number = 10
): Promise<TestPassRateMetrics> {
let totalPassed = 0;
const failedTests = new Map<string, number>();
for (let i = 0; i < numTrials; i++) {
const code = await llm.generate({ systemPrompt: context, userPrompt: prompt });
for (const test of testSuite.tests) {
try {
const passed = await test.run(code);
if (passed) totalPassed++;
else {
failedTests.set(test.name, (failedTests.get(test.name) ?? 0) + 1);
}
} catch {
failedTests.set(test.name, (failedTests.get(test.name) ?? 0) + 1);
}
}
}
const totalTests = testSuite.tests.length * numTrials;
const passRate = totalPassed / totalTests;
let estimatedMI: 'high' | 'medium' | 'low';
if (passRate > 0.9) {
estimatedMI = 'high';
} else if (passRate > 0.7) {
estimatedMI = 'medium';
} else {
estimatedMI = 'low';
}
// Find consistently failing tests
const consistentFailures = [...failedTests.entries()]
.filter(([_, count]) => count > numTrials * 0.5)
.map(([name]) => name);
return { passRate, failedTests: consistentFailures, estimatedMI };
}
Interpreting Test Pass Rates
| Pass Rate | Interpretation | Action |
|---|---|---|
| >90% | High MI | Context is effective |
| 70-90% | Medium MI | Identify weak areas, add examples |
| 50-70% | Low MI | Major context gaps, restructure |
| <50% | Very Low MI | Context may be counterproductive |
Identifying Weak Areas
Consistently failing tests reveal context gaps:
const metrics = await measureViaTestPassRate(context, prompt, testSuite);
if (metrics.failedTests.includes('Returns correct error type')) {
// Context needs better error handling examples
context += `
Always return typed errors:
- ValidationError for input issues
- NotFoundError for missing resources
- AuthError for permission problems
`;
}
if (metrics.failedTests.includes('Handles null input')) {
// Context needs null handling pattern
context += `
Guard against null at function entry:
if (input === null || input === undefined) {
return { success: false, error: new ValidationError('Input required') };
}
`;
}
Method 4: Semantic Similarity Scoring
Measure how semantically similar outputs are to a reference implementation.
Implementation
interface SimilarityMetrics {
avgSimilarity: number; // 0-1 scale
minSimilarity: number;
maxSimilarity: number;
variance: number;
estimatedMI: 'high' | 'medium' | 'low';
}
async function measureSemanticSimilarity(
context: string,
prompt: string,
referenceOutput: string,
numTrials: number = 10
): Promise<SimilarityMetrics> {
const similarities: number[] = [];
for (let i = 0; i < numTrials; i++) {
const output = await llm.generate({ systemPrompt: context, userPrompt: prompt });
const similarity = await calculateSemanticSimilarity(output, referenceOutput);
similarities.push(similarity);
}
const avg = similarities.reduce((a, b) => a + b, 0) / similarities.length;
const variance = similarities.reduce((sum, s) => sum + Math.pow(s - avg, 2), 0) / similarities.length;
let estimatedMI: 'high' | 'medium' | 'low';
if (avg > 0.85 && variance < 0.01) {
estimatedMI = 'high';
} else if (avg > 0.7 && variance < 0.05) {
estimatedMI = 'medium';
} else {
estimatedMI = 'low';
}
return {
avgSimilarity: avg,
minSimilarity: Math.min(...similarities),
maxSimilarity: Math.max(...similarities),
variance,
estimatedMI
};
}
async function calculateSemanticSimilarity(
output: string,
reference: string
): Promise<number> {
// Use embeddings for semantic comparison
const outputEmbedding = await llm.embed(output);
const referenceEmbedding = await llm.embed(reference);
return cosineSimilarity(outputEmbedding, referenceEmbedding);
}
function cosineSimilarity(a: number[], b: number[]): number {
const dotProduct = a.reduce((sum, val, i) => sum + val * b[i], 0);
const magnitudeA = Math.sqrt(a.reduce((sum, val) => sum + val * val, 0));
const magnitudeB = Math.sqrt(b.reduce((sum, val) => sum + val * val, 0));
return dotProduct / (magnitudeA * magnitudeB);
}
When to Use Semantic Similarity
Best for:
- Prose generation (documentation, explanations)
- Code that should match a specific style
- Translations or transformations
Less useful for:
- Tasks with multiple valid solutions
- Creative outputs where variance is acceptable
Combined Measurement Dashboard
Combine all methods for comprehensive measurement:
interface ContextEffectivenessDashboard {
varianceTest: ContextEffectivenessResult;
testPassRate: TestPassRateMetrics;
semanticSimilarity?: SimilarityMetrics;
overallScore: number; // 0-100
recommendations: string[];
}
async function fullContextAudit(
context: string,
prompt: string,
testSuite: TestSuite,
referenceOutput?: string
): Promise<ContextEffectivenessDashboard> {
const varianceTest = await measureContextEffectiveness(context, prompt);
const testPassRate = await measureViaTestPassRate(context, prompt, testSuite);
let semanticSimilarity: SimilarityMetrics | undefined;
if (referenceOutput) {
semanticSimilarity = await measureSemanticSimilarity(context, prompt, referenceOutput);
}
// Calculate overall score
const varianceScore = varianceTest.effectiveness === 'high' ? 100 :
varianceTest.effectiveness === 'medium' ? 70 : 40;
const testScore = testPassRate.passRate * 100;
const similarityScore = semanticSimilarity ? semanticSimilarity.avgSimilarity * 100 : testScore;
const overallScore = (varianceScore + testScore + similarityScore) / 3;
// Generate recommendations
const recommendations: string[] = [];
if (varianceTest.uniqueOutputs > 5) {
recommendations.push('Add more specific examples to reduce output variance');
}
if (testPassRate.failedTests.length > 0) {
recommendations.push(`Address failing tests: ${testPassRate.failedTests.join(', ')}`);
}
if (semanticSimilarity && semanticSimilarity.avgSimilarity < 0.7) {
recommendations.push('Include reference output in context to improve similarity');
}
return {
varianceTest,
testPassRate,
semanticSimilarity,
overallScore,
recommendations
};
}
Improving Low-MI Context
When measurements show low mutual information, apply these fixes:
1. Add Concrete Examples
Low MI often means missing examples:
# Before (low MI)
Use error handling best practices.
# After (higher MI)
Handle errors with Result types:
function fetchUser(id: string): Result<User, ApiError> {
if (!id) {
return { success: false, error: new ValidationError('ID required') };
}
// ...
}
2. Add Anti-Patterns
Show what NOT to do:
# Higher MI with anti-patterns
DO NOT:
- throw new Error() in business logic
- return null for errors
- swallow errors silently
DO:
- return Result<T, E> types
- log errors before returning
- include error context
3. Add Type Constraints
Types provide high information density:
// Low MI: "Write a user validator"
// High MI:
interface UserValidation {
validate(user: unknown): ValidationResult<User>;
}
type ValidationResult<T> =
| { valid: true; data: T }
| { valid: false; errors: ValidationError[] };
4. Add Test Cases as Examples
Tests are highly constraining:
The function must pass these tests:
test('returns success for valid email', () => {
expect(validate('[email protected]')).toEqual({
valid: true,
data: { email: '[email protected]' }
});
});
test('returns error for invalid email', () => {
expect(validate('not-an-email')).toEqual({
valid: false,
errors: [{ field: 'email', message: 'Invalid email format' }]
});
});
Best Practices
1. Measure Before Optimizing
// Baseline measurement
const baseline = await fullContextAudit(context, prompt, testSuite);
console.log(`Baseline score: ${baseline.overallScore}`);
// Make improvement
const improvedContext = context + additionalExamples;
// Measure improvement
const improved = await fullContextAudit(improvedContext, prompt, testSuite);
console.log(`Improved score: ${improved.overallScore}`);
console.log(`Delta: ${improved.overallScore - baseline.overallScore}`);
2. Use Multiple Measurement Methods
No single method captures everything:
| Method | Measures | Best For |
|---|---|---|
| Variance | Constraint strength | General context quality |
| Test pass rate | Correctness | Functional requirements |
| Semantic similarity | Style adherence | Code style consistency |
3. Track Over Time
interface ContextMetricsHistory {
timestamp: Date;
contextHash: string;
metrics: ContextEffectivenessDashboard;
}
async function trackContextEvolution(
metricsLog: ContextMetricsHistory[]
): Promise<void> {
// Analyze trends
const recentScores = metricsLog.slice(-10).map(m => m.metrics.overallScore);
const trend = calculateTrend(recentScores);
if (trend < 0) {
console.warn('Context effectiveness declining. Review recent changes.');
}
}
4. Set Quality Gates
Require minimum MI before deploying context changes:
async function validateContextChange(
oldContext: string,
newContext: string,
prompt: string,
testSuite: TestSuite
): Promise<boolean> {
const oldMetrics = await fullContextAudit(oldContext, prompt, testSuite);
const newMetrics = await fullContextAudit(newContext, prompt, testSuite);
// Gate: New context must not decrease effectiveness
if (newMetrics.overallScore < oldMetrics.overallScore - 5) {
console.error('Context change decreases effectiveness. Blocked.');
return false;
}
// Gate: Minimum score threshold
if (newMetrics.overallScore < 70) {
console.error('Context below minimum quality threshold.');
return false;
}
return true;
}
Common Pitfalls
Pitfall 1: Testing at Temperature 0
// BAD: Temperature 0 always gives same output
const result = await measureContextEffectiveness(context, prompt, 10, 0);
// Always shows 1 unique output - meaningless!
// GOOD: Use moderate temperature to reveal variance
const result = await measureContextEffectiveness(context, prompt, 10, 0.7);
Pitfall 2: Too Few Trials
// BAD: 3 trials isn't statistically meaningful
const result = await measureContextEffectiveness(context, prompt, 3);
// GOOD: Use 10-20 trials for reliable measurement
const result = await measureContextEffectiveness(context, prompt, 15);
Pitfall 3: Ignoring Normalization
// BAD: Whitespace differences inflate unique count
const unique = new Set(outputs).size; // 8 "unique" outputs
// GOOD: Normalize before comparison
const unique = new Set(outputs.map(normalizeCode)).size; // 3 truly unique
Pitfall 4: Over-Optimizing for Single Metric
Context that scores perfectly on one metric may fail on others:
// High test pass rate but high variance = brittle context
const metrics = {
testPassRate: 0.95,
uniqueOutputs: 9 // Different implementations all happen to pass
};
// Add more specific examples to reduce variance while maintaining pass rate
Conclusion
Measuring context effectiveness transforms prompt engineering from art to science. By applying mutual information principles through practical methods, you can:
- Quantify context quality with variance testing
- Compare context versions with A/B testing
- Validate correctness with test pass rates
- Ensure style consistency with semantic similarity
- Improve systematically based on measurements
The goal isn’t zero variance (that would be over-constraining). The goal is predictable, high-quality outputs that meet your requirements.
Key Takeaways:
- High mutual information = Context strongly predicts output
- Measure with multiple methods for complete picture
- Output variance is the simplest, most universal metric
- Test pass rate connects MI to practical correctness
- Track metrics over time to catch regressions
- Set quality gates to prevent context degradation
Context without measurement is hope. Context with measurement is engineering.
Related
- Information Theory for Coding Agents – Theoretical foundations of entropy and mutual information
- Context Rot Auto-Compacting – Maintaining context quality over long sessions
- Progressive Disclosure Context – Load context only when needed
- Quality Gates as Information Filters – How verification reduces entropy
- Context Debugging Framework – Systematic approach to context issues
- Hierarchical Context Patterns – Organizing context effectively
- Test-Driven Prompting – Using tests to constrain LLM output
- Constraint-Based Prompting – Adding constraints to increase MI
References
- Claude Shannon – A Mathematical Theory of Communication – Foundational information theory paper
- Measuring Information Transfer in Neural Networks – Deep learning perspective on mutual information
- Prompt Engineering Best Practices – Official Anthropic documentation

