Inference

There are two phases in a model's life. Training is the one-off, enormously expensive process of learning the parameters. Inference is what happens every time you use the finished model: you pass in some text and it computes the next tokens. Training happens once; inference happens on every single request.

What actually happens on a request

At inference time the model does one thing repeatedly: predict the next token, append it, and predict again. It generates the response one token at a time, each new token conditioned on everything before it. That is why responses stream in, and why a long answer takes longer and costs more than a short one.

Why the distinction is useful

Keeping training and inference separate clears up a lot of confusion:

The model does not learn from your conversation. Chatting with it is inference, not training. Nothing you say updates its parameters, which is why it starts every new session blank.
Cost and latency live at inference. When you think about speed or spend for a coding agent, you are thinking about inference, because that is the part that runs on every request.
Sampling makes it non-deterministic. Inference usually picks among likely next tokens with a bit of randomness, which is why the same prompt can give two different answers.

What actually happens on a request

Why the distinction is useful

Related terms

Model

Training

Token

Building with AI agents?