Providers & requests

Prefix cache

Also called: prompt cache, prefix caching

A prefix cache lets a provider reuse the unchanged front of your request instead of reprocessing it, so repeated prefixes are cheaper and faster. It is the main reason keeping the start of your prompt stable pays off.

James Phoenix
Understanding Data Updated July 2, 2026

Because the model is stateless, every request re-sends the whole conversation, and the front of it, the system prompt and tool definitions, barely changes from call to call. Reprocessing that identical prefix each time is wasteful, so providers offer a prefix cache: they remember the work done on a prefix they have seen before and skip straight to the new part.

How it actually helps

The cache keys on the exact leading tokens of your request. If the first stretch of tokens matches a previous request, the provider reuses the computed state for them:

  • It is cheaper. The cached prefix is billed as cache tokens at a reduced rate instead of full input tokens.
  • It is faster. Skipping the reprocessing of a large stable prefix cuts the time to first token, which you feel most on long sessions.

The catch: order and stability

The cache only helps for an exact prefix match, and it matches from the very start. That has a real design consequence: put the stable stuff first and the volatile stuff last. If you change something near the top of the prompt, or shuffle the order of what you send, you invalidate the cache from that point on and pay full price again.

Tip
Keep the system prompt and tool definitions fixed and at the front, and append new material to the end. Small habits like not reordering your context are what let the prefix cache actually earn its discount across a session.

Related terms

Building with AI agents?

This dictionary is part of how I think about agentic engineering. If you want the same thinking applied to your codebase, that is what I do.

See how I can help