Attention: Definition & Meaning | AI Glossary

Attention is the mechanism inside a transformer that lets each token in a sequence look at every other token and decide how much to listen to each one. It is what gives modern LLMs their ability to track long-range dependencies — connecting a pronoun to its antecedent fifty words ago, holding the structure of an essay across thousands of tokens, or relating a function call to its definition somewhere else in a code file.

Mathematically, attention works by projecting each token into three vectors: a query, a key and a value. The model computes a similarity score between the query of token i and the key of every other token j, normalises those scores into a probability distribution (the attention weights), and uses them to take a weighted sum of the value vectors. The output is a new representation for token i that has been informed by every other token.

The famous tagline "attention is all you need" referred to the discovery that you can build a sequence model entirely out of stacked attention layers, dropping the recurrent or convolutional structure that previously seemed necessary. That replacement turned out to be wildly more parallelisable on GPUs, which is why transformers scaled.

Several flavours show up:

Self-attention — tokens attend to other tokens within the same sequence; the standard inside an LLM.
Cross-attention — tokens in one sequence attend to tokens in another; how decoder-only models conditioned on retrieved context work, and how diffusion models attend to text prompts.
Causal (masked) attention — tokens can only attend to earlier positions; required for autoregressive generation so the model cannot peek at future tokens.
Multi-head attention — the operation is run in parallel with multiple sets of learned projections, letting the model capture different relational patterns simultaneously.
Grouped-query / multi-query attention — variants that share keys and values across heads to reduce inference memory; nearly universal in modern open-weight LLMs.

The practical consequence is that attention is quadratic in sequence length — doubling the context doubles the memory and quadruples the compute. That quadratic scaling is why frontier teams have invested so heavily in flash attention, sliding-window attention, ring attention and approximate alternatives. A 2-million-token context window in 2026 is an engineering miracle that would have been infeasible in 2022.

Related terms