Mixture of Experts (MoE) is an architecture where a model is built out of many specialised sub-networks ("experts") and a routing layer decides which experts handle each input token. Only a small fraction of the experts activate per token, so a model with hundreds of billions of total parameters can run at the cost of one with tens of billions. MoE is how frontier providers in 2026 deliver flagship-level capability at competitive inference prices.
A typical MoE layer replaces the dense feedforward block of a transformer with a router plus N expert MLPs. The router computes a small score for each expert, picks the top 1–4, and sends the token through only those. Other experts contribute nothing to that token's output. Across a long sequence, every expert ends up being used somewhere, and during training they specialise — some on code, some on math, some on languages, some on factual recall — through pure optimisation pressure.
Why MoE matters in 2026:
- Capability per dollar — Mixtral 8x7B, DeepSeek V3, Qwen 3 MoE and (rumoured) GPT-5 use sparse routing to deliver flagship capability at a fraction of dense-model inference cost.
- Scalability — adding capacity by adding experts is cheaper than scaling depth or width of a dense model.
- Specialisation — experts naturally divide labour, which can improve quality on heterogeneous workloads.
The downsides:
- Memory — total parameters still have to fit in GPU memory, even if only a fraction activate per token. A 200B-parameter MoE needs the same VRAM as a 200B-parameter dense model.
- Load balancing — naive routing produces "dead experts" that never activate. Modern recipes use auxiliary losses to keep utilisation balanced.
- Inference engineering — efficient MoE inference requires careful batch packing and is more complex than dense transformer inference.
- Fine-tuning is harder — small datasets can collapse the router and degrade quality. Frozen-router fine-tuning and LoRA-on-experts are active areas of research.
For a US developer, the MoE distinction matters mostly when self-hosting open-weight models: a Mixtral 8x22B has 141B total parameters but activates 39B per token, so VRAM requirements are still 141B-class even though latency is 39B-class. When using closed APIs, MoE is invisible — you just notice that frontier models keep getting cheaper without losing capability, which is largely an MoE-driven phenomenon.