The context window is the maximum number of tokens an LLM can read and reason over in a single request — the upper bound on prompt length plus generated output. In 2026 the practical range stretches from 8k tokens on small open-weight models up to 2 million tokens on frontier flagships, with research releases pushing past 10 million.
The race to longer context has fundamentally changed what LLMs are useful for. A 200k-token window holds a long book, an entire codebase or a year of customer support transcripts in one shot. That eliminates the need for retrieval pipelines in many simple use cases — you can paste the document and ask. For more complex use cases, retrieval is still cheaper and more accurate than relying on the model to find a needle in a hundred-page haystack.
Three numbers describe a model's context capability:
- Window size — the published maximum. GPT-5, Claude Sonnet 4 and Gemini 2.5 sit at 200k–2M.
- Effective context — the size at which retrieval accuracy stays high. Often dramatically less than the published max; benchmarks like RULER and Needle-in-a-Haystack measure this.
- Cost per token at depth — long contexts cost the same per token but you use a lot more tokens. A 100k-token query can cost 50–100x more than a 1k-token one.
The engineering tradeoffs:
- Long context replaces some RAG — for one-shot questions over a single document, just paste it.
- RAG still wins for large corpora — embedding 10 million documents and retrieving the relevant 5 will always be cheaper than feeding all 10 million to the model.
- Long context enables long agents — agents that hold conversational state over thousands of tool calls need windows that can grow.
- Latency matters — first-token latency on a 1M-token prompt can exceed 30 seconds even on the fastest providers.
For a US team designing an AI feature in 2026, the right question is not "what is the longest context we can use?" but "what is the smallest context that gives us enough context?" Smaller is faster, cheaper and more accurate. The headline number on a model card is a capability, not a recommendation.