Foundations

Inference

Also called: generation

Inference is the act of running a trained model to get an answer: text goes in, a prediction comes out. Every message you send to a coding agent is an inference. It is the opposite end of the lifecycle from training.

James Phoenix
Understanding Data Updated July 2, 2026

There are two phases in a model's life. Training is the one-off, enormously expensive process of learning the parameters. Inference is what happens every time you use the finished model: you pass in some text and it computes the next tokens. Training happens once; inference happens on every single request.

What actually happens on a request

At inference time the model does one thing repeatedly: predict the next token, append it, and predict again. It generates the response one token at a time, each new token conditioned on everything before it. That is why responses stream in, and why a long answer takes longer and costs more than a short one.

Why the distinction is useful

Keeping training and inference separate clears up a lot of confusion:

  • The model does not learn from your conversation. Chatting with it is inference, not training. Nothing you say updates its parameters, which is why it starts every new session blank.
  • Cost and latency live at inference. When you think about speed or spend for a coding agent, you are thinking about inference, because that is the part that runs on every request.
  • Sampling makes it non-deterministic. Inference usually picks among likely next tokens with a bit of randomness, which is why the same prompt can give two different answers.

Related terms

Building with AI agents?

This dictionary is part of how I think about agentic engineering. If you want the same thinking applied to your codebase, that is what I do.

See how I can help