Prefix cache

Because the model is stateless, every request re-sends the whole conversation, and the front of it, the system prompt and tool definitions, barely changes from call to call. Reprocessing that identical prefix each time is wasteful, so providers offer a prefix cache: they remember the work done on a prefix they have seen before and skip straight to the new part.

How it actually helps

The cache keys on the exact leading tokens of your request. If the first stretch of tokens matches a previous request, the provider reuses the computed state for them:

It is cheaper. The cached prefix is billed as cache tokens at a reduced rate instead of full input tokens.
It is faster. Skipping the reprocessing of a large stable prefix cuts the time to first token, which you feel most on long sessions.

The catch: order and stability

The cache only helps for an exact prefix match, and it matches from the very start. That has a real design consequence: put the stable stuff first and the volatile stuff last. If you change something near the top of the prompt, or shuffle the order of what you send, you invalidate the cache from that point on and pay full price again.

Tip

Keep the system prompt and tool definitions fixed and at the front, and append new material to the end. Small habits like not reordering your context are what let the prefix cache actually earn its discount across a session.

How it actually helps

The catch: order and stability

Related terms

Cache tokens

Input tokens

Stateless

Building with AI agents?