Summary
Information theory provides the mathematical foundation for understanding how coding agents process and generate code. Key concepts include entropy (measuring uncertainty), information content (value of constraints), mutual information (context effectiveness), and channel capacity (context window limits). This framework explains why quality gates, types, and tests work—they reduce entropy by filtering the state space of valid programs.
Introduction
Why do quality gates work? Why are types more valuable than comments? Why does better context lead to better code?
The answer lies in information theory—the mathematical study of information, communication, and uncertainty. Developed by Claude Shannon in 1948, information theory provides a rigorous framework for understanding how coding agents process and generate code.
This article explores four core concepts from information theory and their applications to AI-assisted coding:
- Entropy: Measuring uncertainty in code generation
- Information Content: Quantifying the value of constraints
- Mutual Information: Understanding how context reduces uncertainty
- Channel Capacity: Optimizing context windows as information channels
Concept 1: Entropy (Uncertainty)
The Entropy Formula
H(X) = -∑ P(x) log₂ P(x)
Where:
- H(X) = Entropy (uncertainty) measured in bits
- X = Set of all possible outputs
- P(x) = Probability of output x
- ∑ = Sum over all possible outputs
What Entropy Measures
Entropy quantifies how surprised you’d be by an outcome:
- High entropy: Many outcomes equally likely → High surprise → Unpredictable
- Low entropy: Few outcomes likely → Low surprise → Predictable
Entropy in Code Generation
When an LLM generates code, entropy measures the spread of the probability distribution over all possible programs.
Without constraints (high entropy):
// Prompt: "Write a function to process data"
// Possible outputs (all equally likely):
function processData(data: any): any { ... } // P = 0.2
function processData(data: any): void { ... } // P = 0.2
function processData(data: any): boolean { ... } // P = 0.2
function processData(data: any[]): string[] { ... } // P = 0.2
function processData(data: unknown): never { ... } // P = 0.2
// Entropy: H = -∑ P log₂ P = -(5 × 0.2 × log₂(0.2)) ≈ 2.32 bits
With constraints (low entropy):
// Prompt + types + tests + context:
function processData(
data: Array<{ id: number; value: string }>
): ProcessResult {
// Only 1-2 valid implementations that satisfy all constraints
// Entropy: H ≈ 0.5 bits (low)
}
Quality Gates as Entropy Filters
Each quality gate eliminates invalid states, reducing entropy:
All syntactically valid programs: H = 20 bits (1M+ programs)
↓
[Type Checker]
↓
Type-safe programs: H = 15 bits (32K programs)
↓
[Linter]
↓
Type-safe, clean programs: H = 12 bits (4K programs)
↓
[Tests]
↓
Type-safe, clean, correct programs: H = 5 bits (32 programs)
Key insight: Each gate reduces entropy multiplicatively, not additively. Going from H=20 to H=5 means reducing valid programs by a factor of 2^15 = 32,768×.
Practical Application: Measuring Code Predictability
You can estimate entropy through test failure rates:
# High entropy (unpredictable)
test_failure_rate = 40% # Many wrong implementations
entropy_estimate = "HIGH"
# Low entropy (predictable)
test_failure_rate = 5% # Few wrong implementations
entropy_estimate = "LOW"
Lower failure rate → Lower entropy → More predictable LLM behavior.
Concept 2: Information Content
The Information Content Formula
I(x) = -log₂ P(x)
Where:
- I(x) = Information content of event x (measured in bits)
- P(x) = Probability of event x
What Information Content Measures
Information content quantifies how much you learn when an event occurs:
- Rare events (low P): High information content (surprising)
- Common events (high P): Low information content (expected)
Examples:
# Event: "The function returns a value"
P(returns_value) = 0.9
I(returns_value) = -log₂(0.9) ≈ 0.15 bits # Low information (expected)
# Event: "The function signature matches specification exactly"
P(exact_match) = 0.1
I(exact_match) = -log₂(0.1) ≈ 3.32 bits # High information (surprising)
Information Content in Constraints
Different constraints provide different amounts of information:
Types (high information content):
// This constraint eliminates ~90% of possible implementations
function processUser(user: User): Promise<Result>
// Information provided:
I(type_constraint) ≈ 3.3 bits // Eliminates 90% → P = 0.1 remain
Tests (very high information content):
// This test eliminates ~95% of type-safe implementations
test('processUser returns success=true for valid user', () => {
expect(result.success).toBe(true);
});
// Information provided:
I(test) ≈ 4.3 bits // Eliminates 95% → P = 0.05 remain
Comments (low information content):
// This comment eliminates ~10% of implementations (low constraint)
// Process the user data
// Information provided:
I(comment) ≈ 0.15 bits // Eliminates 10% → P = 0.9 remain
Why Types > Comments
Types provide more information per token than comments:
Type hint: 5 tokens × 3.3 bits/constraint = 16.5 bits total
Comment: 10 tokens × 0.15 bits/constraint = 1.5 bits total
Types are 11× more informative per token!
This is why types, tests, and working examples are more valuable than verbose documentation—they encode more information per unit of context.
Practical Application: Prioritizing Context
When filling a context window, prioritize high-information content:
High information density (include first):
- Type definitions
- Test cases
- Working code examples
- Explicit rules with examples
Low information density (include last or omit):
- Generic comments
- Verbose prose explanations
- Outdated documentation
Concept 3: Mutual Information
The Mutual Information Formula
I(X;Y) = H(X) - H(X|Y)
Where:
- I(X;Y) = Mutual information between X and Y (bits)
- H(X) = Entropy of X (without knowing Y)
- H(X|Y) = Conditional entropy of X given Y (knowing Y)
What Mutual Information Measures
Mutual information quantifies how much knowing Y tells you about X:
- High mutual information: Y strongly predicts X
- Low mutual information: Y doesn’t help predict X
Example:
# X = LLM output (function implementation)
# Y = Context provided (CLAUDE.md, types, tests)
# Without context (Y):
H(X) = 15 bits # Many possible implementations
# With context (Y):
H(X|Y) = 3 bits # Few possible implementations
# Mutual information:
I(X;Y) = H(X) - H(X|Y) = 15 - 3 = 12 bits
# Interpretation: Context provides 12 bits of information,
# reducing uncertainty by 2^12 = 4,096×
Mutual Information in Context Effectiveness
Different context provides different mutual information with output:
High mutual information (context strongly determines output):
// Context: Complete working example
function processUser(user: User): Result {
if (!user.email) {
return { success: false, error: 'Email required' };
}
return { success: true, user };
}
// I(context; output) ≈ 10 bits (very predictive)
Low mutual information (context doesn’t help much):
// Context: Vague comment
// "Process the user somehow and return something"
// I(context; output) ≈ 1 bit (not predictive)
Optimizing for High Mutual Information
To maximize mutual information between context and output:
- Show, don’t tell: Working examples > explanations
- Be specific: Concrete constraints > vague guidelines
- Include anti-patterns: Show what NOT to do
- Provide multiple examples: Multiple examples > single example
Example: Low mutual information:
# Context
Write clean, maintainable code.
Use good patterns.
Handle errors properly.
Improved: High mutual information:
# Context
✅ DO THIS:
function authenticate(email: string, password: string): AuthResult {
// Validate input
if (!email || !password) {
return { success: false, error: 'Email and password required' };
}
// ... implementation
}
❌ DON'T DO THIS:
function authenticate(email: string, password: string): User {
if (!email) throw new Error('Email required'); // Don't throw
// ... implementation
}
The second context has higher mutual information—it more strongly determines what the output should look like.
Practical Application: Context Debugging
If LLM outputs are unpredictable, check mutual information:
if output_is_unpredictable:
# Low mutual information between context and output
# Add high-MI context:
context.add(working_examples)
context.add(explicit_types)
context.add(test_cases)
context.add(anti_patterns)
Concept 4: Channel Capacity
The Channel Capacity Formula
C = max I(X;Y)
Where:
- C = Channel capacity (maximum information transfer rate)
- I(X;Y) = Mutual information between input and output
- max = Maximum over all possible input distributions
What Channel Capacity Measures
Channel capacity quantifies the maximum amount of information that can be reliably transmitted through a channel.
For coding agents, the context window is the channel:
Input (X) → [Context Window (limited capacity)] → Output (Y)
Context Window as Information Bottleneck
LLM context windows have finite capacity:
# Example: Claude Sonnet 4 context window
max_tokens = 200_000
# If each token encodes ~4 bits of information
channel_capacity = 200_000 tokens × 4 bits/token
= 800_000 bits
= 100 KB of information
You cannot exceed this capacity. If you try to provide more information, either:
- Truncation: Early context gets cut off
- Dilution: Information density decreases (more tokens, same info)
- Compression: Information gets lost through summarization
Optimizing for Channel Capacity
To maximize information transfer within capacity limits:
Strategy 1: Maximize Information Density
// ❌ Low density (many tokens, little information)
const context = `
The function should process the user data.
It takes a user object as input.
It returns a result object.
The result has a success field.
The result has an error field if there's an error.
`; // 50 tokens, ~2 bits of information
// ✅ High density (few tokens, much information)
function processUser(user: User): Result {
return { success: boolean, error?: string };
}
// 15 tokens, ~8 bits of information
Strategy 2: Hierarchical Context Loading
Load only relevant context to avoid wasting capacity:
// ❌ Wastes capacity (loads everything)
context = root_claude_md + all_domain_mds + all_schemas;
// 50K tokens, but only 10K relevant
// ✅ Optimizes capacity (loads selectively)
context = root_claude_md + relevant_domain_md + relevant_schema;
// 12K tokens, all relevant
Strategy 3: Prompt Caching
Cache stable, high-information content to effectively increase capacity:
// Cached context (doesn't count toward capacity in subsequent requests)
const cached = claude_md + schemas + standards; // 10K tokens
// Dynamic context (uses remaining capacity)
const dynamic = task_description + file_contents; // 5K tokens
// Effective capacity: 200K for dynamic + unlimited cached
Practical Application: Context Window Utilization
Monitor how much of your channel capacity is used:
interface ChannelMetrics {
totalCapacity: number; // 200,000 tokens
used: number; // 45,000 tokens
utilization: number; // 22.5%
informationDensity: number; // bits per token
}
const metrics: ChannelMetrics = {
totalCapacity: 200_000,
used: 45_000,
utilization: 0.225, // 22.5% used
informationDensity: 3.5, // 3.5 bits/token (good)
};
// If utilization > 80%, you're near capacity
// If density < 2, you're wasting capacity with low-information content
Integrating All Four Concepts
These concepts work together to explain AI-assisted coding:
The Full Picture
┌─────────────────────────────────────────────────────────────┐
│ CODING AGENT SYSTEM │
├─────────────────────────────────────────────────────────────┤
│ │
│ 1. CONTEXT (Input) │
│ • Limited by channel capacity (200K tokens) │
│ • Must maximize information density (bits/token) │
│ • High mutual information with desired output │
│ │
│ ↓ │
│ │
│ 2. LLM GENERATION (Processing) │
│ • Samples from probability distribution │
│ • Entropy = uncertainty in distribution │
│ • Better context → lower entropy │
│ │
│ ↓ │
│ │
│ 3. QUALITY GATES (Filtering) │
│ • Each gate filters invalid states │
│ • Reduces entropy exponentially │
│ • Information content = bits eliminated │
│ │
│ ↓ │
│ │
│ 4. OUTPUT (Result) │
│ • Low entropy = predictable, correct code │
│ • High mutual information with input context │
│ • Satisfies all constraints (passed all filters) │
│ │
└─────────────────────────────────────────────────────────────┘
Example: Authentication Function
Step 1: Context (Channel Capacity)
// Context: 5K tokens, information density = 4 bits/token
// Total information: 20,000 bits
interface User { id: number; email: string; passwordHash: string; }
interface AuthResult { success: boolean; user?: User; error?: string; }
function authenticate(email: string, password: string): Promise<AuthResult>;
test('valid credentials return success', async () => {
const result = await authenticate('[email protected]', 'correct');
expect(result.success).toBe(true);
});
// CLAUDE.md: Never throw exceptions in auth functions.
// Always return AuthResult with success boolean.
Step 2: LLM Generation (Entropy)
# Before context: H(output) = 15 bits (32K possible implementations)
# After context: H(output|context) = 3 bits (8 possible implementations)
# Mutual information:
I(context; output) = 15 - 3 = 12 bits
# Context reduced uncertainty by 2^12 = 4,096×
Step 3: Quality Gates (Information Content)
// Type checker:
I(type_gate) = 3 bits // Eliminates 87.5% of implementations
H = 3 → 0 bits (only type-safe implementations remain)
// Linter:
I(lint_gate) = 1 bit // Eliminates 50% of remaining
H = 0 → -1 bits (wait, entropy can't be negative!)
// Actually: H approaches 0 as we narrow to 1-2 implementations
Step 4: Output (Low Entropy)
// Final implementation (1 of ~2 valid options)
async function authenticate(
email: string,
password: string
): Promise<AuthResult> {
if (!email || !password) {
return { success: false, error: 'Email and password required' };
}
const user = await findUserByEmail(email);
if (!user) {
return { success: false, error: 'Invalid credentials' };
}
const valid = await verifyPassword(password, user.passwordHash);
if (!valid) {
return { success: false, error: 'Invalid credentials' };
}
return { success: true, user };
}
// Final entropy: H ≈ 0.5 bits (only minor style variations possible)
Practical Applications
Application 1: Context Budget Allocation
Use information theory to allocate your context window:
interface ContextBudget {
component: string;
tokens: number;
informationBits: number;
priority: number;
}
const budget: ContextBudget[] = [
{
component: 'Type definitions',
tokens: 2000,
informationBits: 8000, // 4 bits/token (high density)
priority: 1, // Include first
},
{
component: 'Test cases',
tokens: 3000,
informationBits: 10500, // 3.5 bits/token
priority: 2,
},
{
component: 'Working examples',
tokens: 2500,
informationBits: 8750, // 3.5 bits/token
priority: 3,
},
{
component: 'Documentation comments',
tokens: 5000,
informationBits: 5000, // 1 bit/token (low density)
priority: 4, // Include last or omit
},
];
// Sort by information density, include until capacity reached
const optimizedContext = budget
.sort((a, b) => b.informationBits / b.tokens - a.informationBits / a.tokens)
.reduce((acc, item) => {
if (acc.totalTokens + item.tokens <= CHANNEL_CAPACITY) {
acc.items.push(item);
acc.totalTokens += item.tokens;
acc.totalInformation += item.informationBits;
}
return acc;
}, { items: [], totalTokens: 0, totalInformation: 0 });
Application 2: Quality Gate ROI Analysis
Calculate the information gain (entropy reduction) per gate:
interface GateROI {
gate: string;
setupCost: number; // Hours to implement
entropyReduction: number; // Bits eliminated
roi: number; // Bits per hour
}
const gateAnalysis: GateROI[] = [
{
gate: 'TypeScript (strict mode)',
setupCost: 2,
entropyReduction: 5, // Eliminates ~97% of type-unsafe code
roi: 2.5, // 2.5 bits/hour
},
{
gate: 'Integration tests',
setupCost: 8,
entropyReduction: 8, // Eliminates ~99.6% of incorrect behavior
roi: 1.0, // 1 bit/hour
},
{
gate: 'Custom ESLint rules',
setupCost: 4,
entropyReduction: 3, // Eliminates ~87.5% of style violations
roi: 0.75, // 0.75 bits/hour
},
];
// Prioritize gates by ROI
// TypeScript first (highest ROI), custom linting last (lowest ROI)
Application 3: Context Effectiveness Measurement
Measure mutual information between context and output:
def measure_context_effectiveness(context: str, num_trials: int = 10):
"""
Generate code multiple times with same context.
High variance = low mutual information (context not effective).
Low variance = high mutual information (context very effective).
"""
outputs = []
for _ in range(num_trials):
output = llm.generate(context)
outputs.append(output)
# Measure variance
unique_outputs = len(set(outputs))
entropy_estimate = log2(unique_outputs)
# High entropy = many different outputs = low mutual information
if entropy_estimate > 3: # More than 8 different outputs
return "LOW_MI: Context not effective, improve specificity"
elif entropy_estimate > 1: # 2-4 different outputs
return "MEDIUM_MI: Context somewhat effective"
else: # 1-2 outputs
return "HIGH_MI: Context very effective"
Application 4: Cost-Benefit Analysis of Context
Balance information gain vs. token cost:
interface ContextItem {
content: string;
tokens: number;
informationBits: number;
costPerToken: number; // $/token
}
function calculateContextValue(item: ContextItem): number {
const informationValue = item.informationBits * 0.001; // $0.001 per bit
const tokenCost = item.tokens * item.costPerToken;
return informationValue - tokenCost;
}
const contextItems: ContextItem[] = [
{
content: 'Type definitions',
tokens: 2000,
informationBits: 8000,
costPerToken: 0.000003, // $3 per 1M tokens
},
{
content: 'Verbose tutorial',
tokens: 10000,
informationBits: 5000,
costPerToken: 0.000003,
},
];
contextItems.forEach(item => {
const value = calculateContextValue(item);
console.log(`${item.content}: value = ${value > 0 ? 'POSITIVE' : 'NEGATIVE'}`);
});
// Output:
// Type definitions: value = POSITIVE ($8 - $0.006 = $7.994)
// Verbose tutorial: value = NEGATIVE ($5 - $0.03 = $4.97)
//
// Conclusion: Include type definitions, omit verbose tutorial
Best Practices
1. Maximize Information Density
Prefer high-information-per-token content:
✅ HIGH DENSITY:
- Type signatures
- Working code examples
- Test cases
- Explicit constraints
❌ LOW DENSITY:
- Verbose explanations
- Generic advice
- Redundant documentation
2. Measure Context Effectiveness
Track mutual information between context and output:
# Generate same function 10 times
outputs = [generate(context) for _ in range(10)]
# Count unique implementations
unique = len(set(outputs))
# If unique > 5: Context has low mutual information (improve it)
# If unique ≤ 2: Context has high mutual information (good!)
3. Stack Quality Gates for Exponential Reduction
Each gate compounds previous reductions:
No gates: H = 20 bits (1M programs)
Types: H = 15 bits (32K programs) [96.9% reduction]
Types + Linter: H = 12 bits (4K programs) [99.6% reduction]
Types + Lint + Tests: H = 5 bits (32 programs) [99.997% reduction]
4. Optimize for Channel Capacity
Stay under context window limits while maximizing information:
function optimizeContext(items: ContextItem[]): ContextItem[] {
return items
// Sort by information density (bits/token)
.sort((a, b) =>
(b.informationBits / b.tokens) - (a.informationBits / a.tokens)
)
// Take items until capacity reached
.reduce((acc, item) => {
if (acc.totalTokens + item.tokens <= CHANNEL_CAPACITY) {
acc.items.push(item);
acc.totalTokens += item.tokens;
}
return acc;
}, { items: [], totalTokens: 0 }).items;
}
5. Use Information Theory to Debug
If outputs are unpredictable:
- Measure entropy: How many different outputs?
- Add high-MI context: Types, tests, examples
- Verify entropy reduction: Test again
- Repeat until entropy is low: <2 bits (1-4 outputs)
Common Misconceptions
❌ Misconception 1: “More context is always better”
Truth: Only relevant, high-information context helps. Irrelevant context wastes channel capacity and may increase noise.
❌ Misconception 2: “Comments provide as much information as types”
Truth: Types provide 3-10× more information per token than comments because they’re machine-verifiable constraints.
❌ Misconception 3: “Zero entropy is the goal”
Truth: Zero entropy means exactly one possible output, which is too restrictive. Target is low entropy (2-5 bits) with multiple valid implementations.
❌ Misconception 4: “Information theory only applies to probabilistic systems”
Truth: While LLMs are probabilistic, information theory concepts (entropy, mutual information) are useful metaphors for reasoning about predictability and constraint value even in deterministic systems.
Measuring Success
Track these information-theoretic metrics:
1. Entropy Estimate (via test failures)
# High entropy: 30-50% test failure rate
# Medium entropy: 10-20% test failure rate
# Low entropy: <5% test failure rate
2. Information Density
# Bits per token in context
high_density = 3.5 - 5.0 # Types, tests, examples
medium_density = 2.0 - 3.5 # Structured docs
low_density = 0.5 - 2.0 # Comments, prose
3. Channel Utilization
utilization = tokens_used / channel_capacity
# Target: 60-80% utilization
# <60%: Underutilizing capacity
# >80%: Risk of truncation
4. Mutual Information (via output variance)
# Generate 10 times, count unique outputs
unique_outputs = 1-2 → High MI (good context)
unique_outputs = 3-5 → Medium MI (improve context)
unique_outputs = 6+ → Low MI (ineffective context)
Conclusion
Information theory provides a rigorous mathematical framework for understanding AI-assisted coding:
Four Core Concepts:
- Entropy (H): Measures uncertainty—quality gates reduce entropy exponentially
- Information Content (I): Quantifies constraint value—types > comments
- Mutual Information (I(X;Y)): Captures context effectiveness—examples > explanations
- Channel Capacity (C): Models context window limits—maximize information density
Key Insights:
- Quality gates work by filtering the state space, reducing entropy
- Types and tests provide more information per token than documentation
- Better context increases mutual information with desired output
- Context windows are information channels with finite capacity
- Constraints compound multiplicatively, not additively
Practical Applications:
- Prioritize high-information-density context (types, tests, examples)
- Stack quality gates for exponential entropy reduction
- Measure context effectiveness via output variance
- Optimize for channel capacity by maximizing bits/token
- Debug unpredictability by adding high-mutual-information constraints
The Result: Code generation that’s predictable, correct, and efficient—not by luck, but by mathematical design.
Related Concepts
- Entropy in Code Generation: Deep dive into entropy mechanics
- Quality Gates as Information Filters: How gates reduce state space
- Prompt Caching Strategy: Optimizing for channel capacity
- Hierarchical CLAUDE.md Files: Maximizing information density
- Test-Based Regression Patching: Building entropy filters incrementally
Mathematical Foundation
$$H(X) = -\sum_{x \in X} P(x) \log_2 P(x), \quad I(x) = -\log_2 P(x), \quad I(X;Y) = H(X) – H(X|Y), \quad C = \max I(X;Y)$$
Understanding the Information Theory Formulas
Four key formulas explain how coding agents work. Let’s break them down:
Formula 1: Entropy – H(X) = -∑ P(x) log₂ P(x)
H(X) measures uncertainty in the system (in bits).
Breaking it down:
- X = Set of all possible outputs (e.g., all possible function implementations)
- P(x) = Probability that output x occurs (0 to 1)
- ∑ = Sum over all possible outputs (“for each possible output, calculate this…”)
- log₂ = Logarithm base 2 (“how many bits to represent this?”)
- Negative sign (-) = Makes result positive
What it means:
- High H(X): Many outputs equally likely → Unpredictable
- Low H(X): Few outputs likely → Predictable
- H(X) = 0: Only one possible output → Perfectly predictable
Example:
4 equally-likely return types: boolean, number, string, null
P(each) = 0.25
H(X) = -(0.25×log₂(0.25) + 0.25×log₂(0.25) + 0.25×log₂(0.25) + 0.25×log₂(0.25))
= -(4 × 0.25 × (-2))
= -(4 × -0.5)
= 2 bits
Interpretation: Need 2 bits to specify which of 4 options (2² = 4)
Formula 2: Information Content – I(x) = -log₂ P(x)
I(x) measures how much you learn when event x occurs (in bits).
Breaking it down:
- x = A specific event or outcome
- P(x) = Probability of event x
- log₂ = Logarithm base 2
- Negative sign = Converts negative log to positive
What it means:
- Rare events (low P): High information (surprising!)
- Common events (high P): Low information (expected)
Example:
Event: "Function passes all type checks"
Without type hints:
P(passes) = 0.5 (50% of code is type-safe)
I(passes) = -log₂(0.5) = 1 bit (not very informative)
With type hints:
P(passes) = 0.95 (95% of typed code is type-safe)
I(passes) = -log₂(0.95) ≈ 0.07 bits (less informative because expected)
But wait! The type hints themselves provide information:
I(type_hints) = -log₂(0.1) ≈ 3.3 bits (eliminated 90% of invalid code)
Formula 3: Mutual Information – I(X;Y) = H(X) – H(X|Y)
I(X;Y) measures how much knowing Y tells you about X (in bits).
Breaking it down:
- X = Output we want to predict (LLM-generated code)
- Y = Context we provide (types, tests, examples)
- H(X) = Entropy of X without knowing Y (“uncertainty before context”)
- H(X|Y) = Entropy of X given Y (“uncertainty after context”)
- I(X;Y) = Reduction in uncertainty (“how much context helped”)
What it means:
- High I(X;Y): Context strongly predicts output → Good context
- Low I(X;Y): Context doesn’t help → Improve context
- I(X;Y) = 0: Context and output independent → Useless context
Example:
X = LLM output (function implementation)
Y = Context (CLAUDE.md + types + tests)
Without context:
H(X) = 15 bits (32,768 possible implementations)
With context:
H(X|Y) = 3 bits (8 possible implementations)
Mutual information:
I(X;Y) = H(X) - H(X|Y)
= 15 - 3
= 12 bits
Interpretation: Context reduced uncertainty by 2^12 = 4,096×
Context is VERY effective!
Formula 4: Channel Capacity – C = max I(X;Y)
C measures maximum information that can be transmitted through a channel.
Breaking it down:
- C = Channel capacity (maximum bits per transmission)
- max = Maximum over all possible input distributions
- I(X;Y) = Mutual information (from Formula 3)
What it means:
- Context window = Information channel
- Finite token limit = Finite capacity
- Must maximize information per token
Example:
Claude Sonnet context window: 200,000 tokens
If each token encodes 4 bits:
C = 200,000 tokens × 4 bits/token
= 800,000 bits
= 100 KB of information
You cannot exceed this capacity!
Strategy: Maximize bits/token by using:
- Types (4-5 bits/token)
- Tests (3-4 bits/token)
- Examples (3-4 bits/token)
Avoid:
- Verbose prose (1-2 bits/token)
- Redundant docs (0.5-1 bits/token)
How They Work Together
1. Start with high entropy: H(X) = 15 bits
(Many possible outputs)
2. Add context (Y) with high mutual information: I(X;Y) = 12 bits
(Context strongly predicts output)
3. Conditional entropy drops: H(X|Y) = H(X) - I(X;Y) = 3 bits
(Few outputs remain likely)
4. Quality gates provide information: I(gate) = 3 bits
(Eliminates invalid outputs)
5. Final entropy: H(X|Y,gates) ≈ 0 bits
(Only correct implementations remain)
All while staying within channel capacity: C = 200K tokens
Key Insights
From Formula 1 (Entropy):
- Lower entropy = More predictable LLM behavior
- Each constraint that halves valid outputs reduces entropy by 1 bit
From Formula 2 (Information Content):
- Types provide more bits/token than comments
- Tests provide more bits/token than documentation
- Prioritize high-information constraints
From Formula 3 (Mutual Information):
- Good context has high mutual information with output
- Measure by generating multiple times—low variance = high MI
- Examples and tests have higher MI than prose
From Formula 4 (Channel Capacity):
- Context window is finite resource
- Maximize information density (bits/token)
- Use hierarchical loading and prompt caching
Together: These formulas explain why certain patterns work—they maximize information transfer while minimizing entropy, all within channel capacity constraints.
Related Concepts
- Entropy in Code Generation – Deep dive into how entropy measures uncertainty in LLM outputs
- LLM as Recursive Function Generator – The retrieve-generate-verify model that operationalizes information theory
- Invariants in Programming and LLM Generation – How invariants encode constraints that reduce entropy
- Quality Gates as Information Filters – How verification gates perform set intersection
- Making Invalid States Impossible – Prevention vs validation through state space reduction
- Prompt Caching Strategy – Optimizing for channel capacity
- Hierarchical Context Patterns – Maximizing information density in context
- Test-Based Regression Patching – Building entropy filters incrementally
References
- Claude Shannon – A Mathematical Theory of Communication – The foundational 1948 paper that established information theory
- Information Theory – MIT OpenCourseWare – Comprehensive course materials on information theory
- Entropy (Information Theory) – Wikipedia – Detailed explanation of entropy and its applications
- Mutual Information – Intuitive Explanation – Video explanation of mutual information with visual examples

