Prompt Caching Strategy for 90% Cost Reduction

James Phoenix
James Phoenix

Summary

Structure prompts to maximize Claude’s prompt caching by putting stable context first and variable requests last. Achieves 90% cost reduction for iterative workflows by caching repeated context like CLAUDE.md files, schemas, and coding standards. For 100 requests/day with 5K context: $45/month → $5.40/month = $475/year savings.

The Problem

Repeated context (CLAUDE.md files, schemas, rules) costs tokens on every request to Claude API, making iterative development expensive. For teams doing 100+ requests per day with 5K tokens of repeated context, costs add up to $1.50 daily ($45/month) just for redundant context.

The Solution

Claude caches the first 1024+ tokens of identical content across requests for 5 minutes. By structuring prompts to put stable context first (CLAUDE.md, schemas, standards) and dynamic requests last, you achieve 90% cost reduction on cached tokens ($0.30 → $0.03 per 1M tokens). Cache persists during active development sessions, covering most follow-up requests.

The Problem

When working with AI coding agents like Claude, you face a hidden cost multiplier: repeated context.

Every time you make an API request, you’re charged for all input tokens – including context that hasn’t changed since your last request. For teams doing iterative development with AI agents, this creates explosive costs:

The Cost Breakdown

Consider a typical development session:

  • Request 1: Load CLAUDE.md (2K tokens) + schema (1.5K tokens) + coding standards (1.5K tokens) + task (100 tokens) = 5,100 tokens
  • Request 2: Same context (5K tokens) + new task (100 tokens) = 5,100 tokens
  • Request 3: Same context (5K tokens) + new task (100 tokens) = 5,100 tokens
  • Request 100: Same context (5K tokens) + new task (100 tokens) = 5,100 tokens

Total: 510,000 tokens for context that’s 99% identical across all requests.

Real-World Impact

At Claude Sonnet pricing ($3 per 1M input tokens):

100 requests/day × 5K tokens × 30 days = 15M tokens/month
Cost: 15M × $0.003 = $45/month

For a team of 5 developers:

5 developers × $45/month = $225/month = $2,700/year

And this is just for repeated context – not the actual work being done.

Why This Happens

LLMs are stateless. They don’t remember previous conversations or context. Every request starts from scratch, requiring you to provide all necessary context again:

  1. Project documentation (CLAUDE.md files)
  2. Schemas (JSON schema, database schema)
  3. Coding standards (linting rules, naming conventions)
  4. Architectural patterns (design patterns, frameworks)

Without caching, you’re paying full price for this repeated context on every single request.

The Solution

Claude offers prompt caching, which can reduce costs by 90% for repeated context.

How Prompt Caching Works

Claude automatically caches the first N tokens of your prompt (minimum 1024 tokens) if:

  1. The content is identical across requests
  2. The content appears at the beginning of your prompt
  3. Requests occur within 5 minutes of each other

Cached tokens cost 10x less than regular tokens:

  • Regular input: $3.00 per 1M tokens
  • Cached input: $0.30 per 1M tokens
  • Savings: 90% reduction

The Key Insight

Structure matters. Prompt caching only works for content at the beginning of your prompt. If you mix stable context with dynamic requests, caching breaks.

❌ Bad (no caching):
"Implement feature X using [CLAUDE.md content] following [schema] with [standards]"

✅ Good (caching works):
"[CLAUDE.md content]
[schema]
[standards]

Implement feature X"

Implementation

Step 1: Identify Stable vs. Dynamic Content

Stable content (changes rarely):

  • Root CLAUDE.md files
  • Domain CLAUDE.md files
  • JSON schemas
  • Database schemas
  • Coding standards
  • Framework documentation
  • Architectural diagrams
  • Reusable code patterns

Dynamic content (changes every request):

  • Specific tasks or questions
  • File paths
  • Function names
  • Variable values
  • User input
  • Iteration-specific instructions

Step 2: Structure Your Prompts

Use this template for all AI requests:

# SYSTEM CONTEXT (CACHED)

## Project Architecture
[Content from root CLAUDE.md]

## Domain Patterns
[Content from relevant domain CLAUDE.md]

## Schema Definitions
[JSON schemas, database schemas]

## Coding Standards
[Linting rules, naming conventions, patterns]

## Framework Guidelines
[tRPC patterns, React patterns, etc.]

---

# CURRENT REQUEST (NOT CACHED)

## Task
[Specific task for this request]

## Context
[File paths, current code, specific requirements]

Step 3: Implement in Your Workflow

For Claude Code

Claude Code automatically structures prompts for caching by:

Udemy Bestseller

Learn Prompt Engineering

My O'Reilly book adapted for hands-on learning. Build production-ready prompts with practical exercises.

4.5/5 rating
306,000+ learners
View Course
  1. Loading all CLAUDE.md files at the beginning of the prompt
  2. Loading schemas and standards after CLAUDE.md files
  3. Adding the specific task at the end

No configuration needed – it just works.

For API Calls

When using the Claude API directly, structure your messages:

import Anthropic from '@anthropic-ai/sdk';

const client = new Anthropic({
  apiKey: process.env.ANTHROPIC_API_KEY,
});

// Load stable context once
const stableContext = `
# Project Architecture
${await fs.readFile('CLAUDE.md', 'utf-8')}

# Schema Definitions
${await fs.readFile('schemas/article.schema.json', 'utf-8')}

# Coding Standards
${await fs.readFile('eslint.config.js', 'utf-8')}
`;

// Make requests with cached context
const response = await client.messages.create({
  model: 'claude-sonnet-4',
  max_tokens: 4096,
  messages: [
    {
      role: 'user',
      content: [
        {
          type: 'text',
          text: stableContext,
          cache_control: { type: 'ephemeral' }, // Mark for caching
        },
        {
          type: 'text',
          text: 'Implement user authentication endpoint',
        },
      ],
    },
  ],
});

For Custom Tools

If you’re building custom AI-assisted tools:

import anthropic

client = anthropic.Anthropic(api_key=os.environ.get("ANTHROPIC_API_KEY"))

# Load stable context
with open('CLAUDE.md') as f:
    claude_md = f.read()

with open('schemas/article.schema.json') as f:
    schema = f.read()

stable_context = f"""
# Project Architecture
{claude_md}

# Schema Definitions
{schema}
"""

# Make cached request
message = client.messages.create(
    model="claude-sonnet-4",
    max_tokens=4096,
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": stable_context,
                    "cache_control": {"type": "ephemeral"},
                },
                {
                    "type": "text",
                    "text": "Generate article for prompt-caching-strategy",
                },
            ],
        }
    ],
)

Step 4: Verify Caching is Working

Check the API response for cache metrics:

const response = await client.messages.create({...});

console.log('Cache Performance:', {
  cacheCreationTokens: response.usage.cache_creation_input_tokens,
  cacheReadTokens: response.usage.cache_read_input_tokens,
  regularTokens: response.usage.input_tokens,
});

// First request:
// { cacheCreationTokens: 5000, cacheReadTokens: 0, regularTokens: 100 }
// Cost: 5000 × $0.003 + 100 × $0.003 = $15.30 per 1M tokens

// Subsequent requests (within 5 min):
// { cacheCreationTokens: 0, cacheReadTokens: 5000, regularTokens: 100 }
// Cost: 5000 × $0.0003 + 100 × $0.003 = $1.80 per 1M tokens
// Savings: 88% reduction!

Cost Savings Analysis

Scenario 1: Individual Developer

Usage:

  • 50 requests/day
  • 5K tokens stable context
  • 100 tokens dynamic request

Without caching:

50 requests × 5.1K tokens × 22 workdays = 5.61M tokens/month
Cost: 5.61M × $0.003 = $16.83/month = $202/year

With caching (90% of context cached after first request):

First request: 5.1K × $0.003 = $0.0153
Next 49 requests: (5K × $0.0003 + 0.1K × $0.003) × 49 = $0.75
Total per day: $0.77
Monthly: $0.77 × 22 = $16.94/month

Wait, that doesn't add up. Let me recalculate:

First request: 5.1K tokens at full price
Next 49 requests: 5K cached + 0.1K regular

Daily cost:
- First: 5.1K × $0.000003 = $0.0000153
- Next 49: (5K × $0.0000003 + 0.1K × $0.000003) × 49 = $0.0000882
- Total: $0.0001035 per day

Monthly: $0.0001035 × 22 × 1M = $2.28/month = $27/year

Savings: $202 - $27 = $175/year (86% reduction)

Scenario 2: Small Team (5 developers)

Usage:

  • 100 requests/day per developer
  • 5K tokens stable context
  • 100 tokens dynamic request

Without caching:

5 devs × 100 requests × 5.1K tokens × 22 days = 56.1M tokens/month
Cost: 56.1M × $0.003 = $168/month = $2,016/year

With caching:

Each dev: ~$5.40/month (from previous calc, scaled up)
5 devs: $27/month = $324/year

Savings: $2,016 - $324 = $1,692/year (84% reduction)

Scenario 3: Large Team (20 developers)

Usage:

  • 150 requests/day per developer
  • 5K tokens stable context
  • 100 tokens dynamic request

Without caching:

20 devs × 150 requests × 5.1K tokens × 22 days = 336.6M tokens/month
Cost: 336.6M × $0.003 = $1,010/month = $12,120/year

With caching:

Each dev: ~$8/month
20 devs: $160/month = $1,920/year

Savings: $12,120 - $1,920 = $10,200/year (84% reduction)

Best Practices

1. Front-Load Stable Context

Always put stable content at the beginning of your prompts:

✅ Correct order:
1. CLAUDE.md files (root → domain)
2. Schemas (JSON, database)
3. Coding standards
4. Framework patterns
5. Specific task

❌ Wrong order:
1. Specific task
2. CLAUDE.md files
3. Schemas

2. Maintain Consistent Formatting

Caching requires exact byte-for-byte matches. Ensure:

  • Same whitespace (spaces vs tabs)
  • Same line endings (LF vs CRLF)
  • Same encoding (UTF-8)
  • No dynamic timestamps or UUIDs in cached content
// ❌ Bad: includes timestamp
const context = `
Generated at: ${new Date().toISOString()}
${claudeMd}
`;

// ✅ Good: stable content only
const context = claudeMd;

3. Understand Cache Lifetime

Caches last 5 minutes from the last request using them:

12:00 - Request 1: Cache created
12:02 - Request 2: Cache hit (3 min remaining)
12:04 - Request 3: Cache hit (1 min remaining)
12:05 - Request 4: Cache hit (reset to 5 min)
12:06 - Request 5: Cache hit (4 min remaining)
12:12 - Request 6: Cache expired, recreated

Strategy: For long development sessions, make requests at least every 4 minutes to keep cache warm.

4. Batch Similar Tasks

Group related tasks to maximize cache reuse:

 Good:
- Request 1-10: API endpoint development (use api/ CLAUDE.md)
- Request 11-20: Database migrations (use database/ CLAUDE.md)
- Request 21-30: UI components (use ui/ CLAUDE.md)

 Bad:
- Request 1: API endpoint
- Request 2: UI component
- Request 3: Database migration
- Request 4: API endpoint (cache expired)

5. Monitor Cache Performance

Track cache hit rates to optimize:

const cacheMetrics = {
  totalRequests: 0,
  cacheHits: 0,
  cacheMisses: 0,
  tokensSaved: 0,
};

function trackCacheUsage(response: Message) {
  cacheMetrics.totalRequests++;
  
  if (response.usage.cache_read_input_tokens > 0) {
    cacheMetrics.cacheHits++;
    cacheMetrics.tokensSaved += response.usage.cache_read_input_tokens;
  } else {
    cacheMetrics.cacheMisses++;
  }
  
  const hitRate = (cacheMetrics.cacheHits / cacheMetrics.totalRequests) * 100;
  console.log(`Cache hit rate: ${hitRate.toFixed(1)}%`);
  console.log(`Tokens saved: ${cacheMetrics.tokensSaved.toLocaleString()}`);
}

Target: 80%+ cache hit rate for iterative development.

6. Combine with Hierarchical CLAUDE.md

Use hierarchical CLAUDE.md files to minimize context size:

Instead of loading ALL documentation:
- Root CLAUDE.md: 2K tokens
- api/ CLAUDE.md: 3K tokens
- database/ CLAUDE.md: 2.5K tokens
- ui/ CLAUDE.md: 3K tokens
Total: 10.5K tokens

Load only relevant documentation:
- Root CLAUDE.md: 2K tokens
- api/ CLAUDE.md: 3K tokens
Total: 5K tokens (52% reduction)

Combined savings:

  • Hierarchical context: 52% reduction in context size
  • Prompt caching: 90% reduction in cost for repeated context
  • Total effective savings: 95% cost reduction

Common Pitfalls

❌ Pitfall 1: Dynamic Content in Cached Section

Problem: Including timestamps or UUIDs breaks caching

// ❌ Bad
const context = `
Session ID: ${uuid()}
Timestamp: ${Date.now()}
${claudeMd}
`;

Solution: Keep cached content completely static

// ✅ Good
const cachedContext = claudeMd;
const dynamicRequest = `
Session ID: ${uuid()}
Task: Implement feature X
`;

❌ Pitfall 2: Task Before Context

Problem: Putting dynamic requests before stable context prevents caching

Solution: Always structure prompts with context first, task last

❌ Pitfall 3: Inconsistent Whitespace

Problem: Different whitespace breaks byte-for-byte matching

// Request 1:
const context = claudeMd; // uses spaces

// Request 2:
const context = claudeMd.replace(/ /g, '	'); // uses tabs
// Cache miss! Different bytes

Solution: Normalize whitespace before caching

❌ Pitfall 4: Not Monitoring Cache Metrics

Problem: Assuming caching works without verification

Solution: Always check cache_read_input_tokens in responses

❌ Pitfall 5: Caching Too Little Content

Problem: Only caching <1024 tokens doesn’t trigger caching

Solution: Ensure cached content is ≥1024 tokens

Integration with Other Patterns

Combine with Lean Root CLAUDE.md

Lean root files cache efficiently:

Lean root (30 lines) + domain files (100 lines) = 2K tokens cached
Monolithic root (500 lines) = 5K tokens cached

Benefit: Faster cache creation, lower initial cost

Combine with Semantic Naming

Semantic file names help organize cached content:

# CACHED CONTEXT

## Architecture Patterns
${readFile('architecture-patterns.md')}

## API Conventions
${readFile('api-conventions.md')}

Combine with Quality Gates

Cache quality gate configurations:

# CACHED CONTEXT

## Linting Rules
${readFile('.eslintrc.json')}

## Type Checking
${readFile('tsconfig.json')}

## Testing Standards
${readFile('vitest.config.ts')}

Measuring Success

Key Metrics

  1. Cache hit rate: cache_read_tokens / total_input_tokens

    • Target: >80% for iterative development
  2. Cost reduction: (old_cost - new_cost) / old_cost

    • Target: >70% reduction
  3. Tokens saved: sum(cache_read_tokens) × 0.9

    • Track cumulative savings
  4. Cache lifetime utilization: requests_per_cache / max_possible

    • Target: >10 requests per cache

Tracking Dashboard

interface CacheDashboard {
  period: string;
  totalRequests: number;
  cacheHitRate: number;
  tokensSaved: number;
  costSavings: number;
  avgCacheLifetime: number;
}

const dashboard: CacheDashboard = {
  period: '2025-11',
  totalRequests: 2340,
  cacheHitRate: 0.87, // 87%
  tokensSaved: 10_234_500,
  costSavings: 27.63, // dollars
  avgCacheLifetime: 12.3, // requests
};

Conclusion

Prompt caching is the highest-impact, lowest-effort optimization for AI-assisted development costs.

Key Takeaways:

  1. Structure prompts with stable context first, dynamic requests last
  2. Front-load CLAUDE.md files and schemas at the beginning
  3. Maintain consistency in formatting and content
  4. Monitor cache metrics to verify effectiveness
  5. Combine with hierarchical CLAUDE.md for maximum savings
  6. Target 80%+ cache hit rate for iterative workflows

The result: 90% cost reduction on repeated context, saving hundreds to thousands of dollars per year while enabling more AI-assisted development.

For a team of 20 developers, that’s $10,000+/year in savings – enough to hire another developer or invest in better tooling.

Related Concepts

References

Topics
Api CostApi EfficiencyClaudeContext ManagementCost OptimizationIterative DevelopmentPerformancePrompt Caching

More Insights

Cover Image for Thought Leaders

Thought Leaders

People to follow for compound engineering, context engineering, and AI agent development.

James Phoenix
James Phoenix
Cover Image for Systems Thinking & Observability

Systems Thinking & Observability

Software should be treated as a measurable dynamical system, not as a collection of features.

James Phoenix
James Phoenix