Attention is how a model decides which parts of the text in front of it matter for what it is about to say next. For each token it produces, it weighs how strongly every other token in the context relates to the current one, then leans on the relevant ones.
How to picture it
Imagine the model reading your whole context window at once and, at every step, asking "which of these earlier tokens should influence this next one most?" A function name it needs to match, a constraint you stated three paragraphs up, the variable it just declared: attention is what lets a token here connect to the one that matters over there. It is the reason a model can keep a long piece of code coherent instead of treating each line in isolation.
A useful mental model:
- Attention is about the relationships between tokens, not just their order.
- Every token can, in principle, attend to every other token in the context.
- The strength of those connections is learned during training, not hand-written.
Why it matters for coding
You do not tune attention directly, but nearly every context habit is really about helping it. Attention is not infinite: it spreads across everything in the window, which is the idea behind an attention budget, and it grows less reliable as the context fills up, which is attention degradation.
Related terms
Context window
The context window is the maximum amount of text, measured in tokens, that a model can consider for a single request. It is a hard ceiling, and it is the main resource you manage when working with an agent.
Read definition →Attention budget
The attention budget is the idea that a model's effective attention is a finite resource spread across the whole context window. The more you put in, the thinner the attention on each piece.
Read definition →Attention degradation
Attention degradation is the quality drop a model shows as its context grows: recall weakens and it misses or confuses buried details, often well below the hard token limit. It is also called context rot.
Read definition →