Newer models let you set an effort level: roughly, how much thinking the model does at inference before it commits to an answer. Low effort answers fast and cheap. High effort lets the model work through a problem internally, often producing a chain of reasoning, before it writes the reply you see. You may also see this called reasoning effort or just "thinking."
What you are actually buying
Under the hood, more effort means the model generates more tokens while reasoning, even if a lot of that reasoning is hidden from the final answer. That has direct consequences:
- Latency goes up. More internal work means you wait longer for the first useful output.
- Cost goes up. Reasoning tokens are output tokens, and you pay for them.
- Quality goes up on hard tasks, not easy ones. On a tricky bug spanning several files, high effort earns its keep. On a rename or a one-line change, it is wasted money and time.
Match the dial to the task
The skill is choosing deliberately rather than leaving it maxed out. Reach for high effort when the problem is genuinely hard: subtle logic, tangled dependencies, tricky planning. Drop to low effort for the routine bulk of coding work, where a fast model is not just cheaper but keeps you in flow.
Related terms
Inference
Inference is the act of running a trained model to get an answer: text goes in, a prediction comes out. Every message you send to a coding agent is an inference. It is the opposite end of the lifecycle from training.
Read definition →Non-determinism
Non-determinism is why the same prompt can give you different answers. At inference the model samples among likely next tokens with a controlled amount of randomness, so runs vary.
Read definition →Output tokens
Output tokens are the tokens a model generates in its response, including any hidden reasoning. They are usually priced higher than input tokens, and turning up effort produces more of them.
Read definition →