Send a model the exact same prompt twice and you can get two different answers. That is non-determinism, and it is by design, not a bug. During next-token prediction the model produces a probability distribution over possible next tokens, and then it usually samples from that distribution rather than always taking the single most likely token. A setting called temperature controls how much randomness gets mixed in. Higher temperature, more variety; lower, more repetition.
Why providers do this
A little randomness makes output feel less robotic and helps the model escape repetitive ruts. The trade is reproducibility. Even at very low temperature you are not guaranteed identical results, because inference runs on batched hardware where tiny numerical variations creep in.
What it means for coding work
This is easy to forget until it bites you:
- A passing run is not proof. An agent solving a task once does not mean it will solve it every time. If reliability matters, run it more than once.
- Do not hard-code on exact wording. Tests or scripts that assume the model returns a specific string will be flaky. Assert on behaviour or structure instead.
- Bugs can be intermittent. A prompt that fails one time in five is still broken. Chase the pattern, not the single lucky success.
Related terms
Inference
Inference is the act of running a trained model to get an answer: text goes in, a prediction comes out. Every message you send to a coding agent is an inference. It is the opposite end of the lifecycle from training.
Read definition →Next-token prediction
Next-token prediction is the one job a language model does: given the text so far, predict the most likely next token, add it, and repeat. It is both the training objective and what runs at inference.
Read definition →Effort
Effort is a dial for how much internal reasoning a model spends before it answers. Turn it up for genuinely hard problems; you pay for it in latency and extra output tokens.
Read definition →