Inference is what happens every time you actually use an AI model — the model takes an input and produces an output, with no further learning. Training is build-time; inference is run-time. The same trained weights handle billions of requests across a model's lifetime.
For LLMs, inference is technically autoregressive generation: the model predicts the next token, appends it to the input and repeats until a stop condition is met. That loop is what makes streaming responses look like typing — each token is being generated one at a time. The cost and latency of inference scale roughly with how many tokens come in and how many go out.
The economics of inference dominate AI product design in 2026. A query to GPT-5 might cost $0.005; a query to Claude Sonnet 4 with a long context window might cost ten times more; a query to a small open-weight Llama running on your own GPU might cost a tenth as much. Designing a product means routing the right query to the right model, caching repeated work, and capping output length so a runaway response cannot rack up dollars.
Several techniques shrink inference cost without losing too much quality:
- Quantisation — storing weights in 8-bit or 4-bit instead of 16-bit, often 2–4x faster with negligible quality loss.
- Distillation — training a smaller student model to imitate a larger teacher; cheaper inference forever after.
- Speculative decoding — using a small model to draft tokens and a large model to verify them in batches.
- KV-cache reuse — keeping the model's attention state warm across follow-up requests.
- Prompt caching — frontier APIs now reuse the encoded prefix of long system prompts, sometimes cutting cost by 90%.
For a US business deploying AI, the practical question is the inference budget per user per month. A free-tier consumer product can spend pennies; a $99/month SaaS can spend dollars; an enterprise contract with seat-based pricing can spend hundreds. Knowing the unit economics from day one prevents the painful realisation, six months in, that every active user is unprofitable.
Inference is also where reliability lives. Models drift, providers go down, prompt regressions slip in. Treating inference as a serious production system — with retries, fallbacks, monitoring and evaluation — separates AI products that scale from demos that crumble at 10x traffic.