Temperature is a numeric sampling parameter — typically between 0 and 2 — that controls how random or deterministic an LLM's outputs are. At temperature 0 the model always picks the highest-probability next token, producing nearly identical outputs across runs. At temperature 1.0 the model samples from the natural probability distribution. Above 1.0 the distribution is flattened, making low-probability tokens more likely and outputs more creative — and more likely to go off the rails.
Mathematically, temperature is applied to the logits before they are turned into probabilities by softmax. Dividing by a small temperature sharpens the distribution; dividing by a large one flattens it. A temperature of 0 is technically just argmax — no real sampling happens.
The right temperature depends entirely on the task:
- Temperature 0 — deterministic tasks: classification, extraction, structured output, code generation where you want the same answer every time.
- Temperature 0.2–0.5 — reasoning tasks where you want consistency but allow some flexibility.
- Temperature 0.7 — the default in most APIs; balanced for general chat.
- Temperature 0.9–1.2 — creative tasks: brainstorming, marketing copy, fiction, ideation.
- Temperature >1.5 — chaos. Useful for occasional research but rarely production.
In 2026 most production systems run at temperature 0 or close to it, because reproducibility, debuggability and evaluation all benefit from determinism. Creative endpoints (image prompt generators, copy variations) push higher.
Temperature is usually paired with top-p (nucleus sampling) and sometimes top-k. Top-p restricts sampling to the smallest set of tokens whose cumulative probability exceeds p (typically 0.9 or 0.95). Together they let you tune both the sharpness of the distribution and the size of the candidate set.
A common confusion is treating temperature 0 as fully deterministic. In practice, frontier APIs still produce slightly different outputs across runs at temperature 0 because of GPU non-determinism, batching effects and provider-side load balancing. If you need true reproducibility, you also need a fixed seed (where supported) and ideally self-hosted inference.
For a US team building production LLM apps, the operational rule is: lower temperature for anything user-facing that needs to be reliable; higher only when the task explicitly benefits from variety. And always run your evaluation harness at the same temperature you intend to ship at — temperature changes can move benchmark numbers by several points.