Because the model is stateless, every request re-sends the whole conversation, and the front of it, the system prompt and tool definitions, barely changes from call to call. Reprocessing that identical prefix each time is wasteful, so providers offer a prefix cache: they remember the work done on a prefix they have seen before and skip straight to the new part.
How it actually helps
The cache keys on the exact leading tokens of your request. If the first stretch of tokens matches a previous request, the provider reuses the computed state for them:
- It is cheaper. The cached prefix is billed as cache tokens at a reduced rate instead of full input tokens.
- It is faster. Skipping the reprocessing of a large stable prefix cuts the time to first token, which you feel most on long sessions.
The catch: order and stability
The cache only helps for an exact prefix match, and it matches from the very start. That has a real design consequence: put the stable stuff first and the volatile stuff last. If you change something near the top of the prompt, or shuffle the order of what you send, you invalidate the cache from that point on and pay full price again.
Related terms
Cache tokens
Cache tokens are input tokens served from the prefix cache at a reduced rate. They are how prompt caching shows up as a separate line in your usage numbers.
Read definition →Input tokens
Input tokens are the tokens you send in a request: the system prompt, the conversation history, loaded files, and tool definitions. You are billed for them, and they count against the context window.
Read definition →Stateless
Stateless means the model API keeps no memory between requests. Each call starts blank, so every request must carry all the context the model needs. This is foundational to how agents are built.
Read definition →