The transformer is the neural network architecture introduced in the 2017 Google paper "Attention is All You Need" and has since become the backbone of essentially every frontier LLM, image model and increasingly audio and video model. GPT-5, Claude Sonnet 4, Gemini 2.5, Llama, Mistral, Stable Diffusion, Sora — all transformers under the hood.
Before the transformer, sequence modelling relied on recurrent networks (LSTMs, GRUs) that processed tokens one at a time. The transformer's breakthrough was the attention mechanism, which lets every token in a sequence directly look at every other token in parallel. This unlocked two things at once: dramatic quality improvements (the model can capture long-range dependencies) and massive parallelisation (training is no longer sequential), which together made trillion-parameter models possible.
A transformer block has three main pieces:
- Self-attention — each token computes a weighted view of every other token based on learned query, key and value projections. This is the "intelligence" of the layer.
- Feedforward network — a per-token MLP that further transforms the attended representation.
- Residual connections + layer normalisation — keep gradients flowing through dozens of stacked blocks.
Stack 30–100 of these blocks, train on trillions of tokens, and you have an LLM.
In 2026 the architecture has been refined but not replaced. Variants and improvements:
- Mixture of Experts (MoE) — sparse routing where each token only activates a subset of the parameters; how Mistral 8x7B, DeepSeek V3 and rumoured GPT-5 architectures get capability without proportional inference cost.
- Flash Attention — a memory-efficient implementation that made long context windows economically feasible.
- Mamba and state-space models — credible challengers that scale linearly in context length; gaining ground in long-context audio and DNA tasks but not yet displacing transformers in mainstream LLMs.
- Multimodal transformers — vision and audio fed in as token-like patches alongside text; powers GPT-5's image understanding and Gemini's video reasoning.
For a US developer, you will rarely write transformer code by hand. PyTorch + Hugging Face Transformers gives you a working implementation in a few lines. The conceptual understanding still pays off when reasoning about latency (attention is quadratic in context length), cost (longer prompts cost more compute) and capability (some tasks suit MoE, others suit dense models).