There are two phases in a model's life. Training is the one-off, enormously expensive process of learning the parameters. Inference is what happens every time you use the finished model: you pass in some text and it computes the next tokens. Training happens once; inference happens on every single request.
What actually happens on a request
At inference time the model does one thing repeatedly: predict the next token, append it, and predict again. It generates the response one token at a time, each new token conditioned on everything before it. That is why responses stream in, and why a long answer takes longer and costs more than a short one.
Why the distinction is useful
Keeping training and inference separate clears up a lot of confusion:
- The model does not learn from your conversation. Chatting with it is inference, not training. Nothing you say updates its parameters, which is why it starts every new session blank.
- Cost and latency live at inference. When you think about speed or spend for a coding agent, you are thinking about inference, because that is the part that runs on every request.
- Sampling makes it non-deterministic. Inference usually picks among likely next tokens with a bit of randomness, which is why the same prompt can give two different answers.
Related terms
Model
A model is the trained artifact at the centre of every AI coding tool: a large file of numbers (parameters) that, given some text, produces the most likely continuation. When people say "which model are you using," this is the thing they mean.
Read definition →Training
Training is the process that produces a model: showing it enormous amounts of text and adjusting its parameters until it gets good at predicting what comes next. It happens once, before you ever use the model.
Read definition →Token
A token is the unit of text a model reads and writes: a chunk that is usually part of a word, not a whole word or a single character. Everything is measured in tokens, including your context window and your bill.
Read definition →