Large Language Models

Mixture of Experts (MoE)

An architecture where each input only activates a subset of the model's parameters, decoupling capacity from cost.

In common use since 1991

Mixture of Experts (MoE) is an architecture where a model is built out of many specialised sub-networks ("experts") and a routing layer decides which experts handle each input token. Only a small fraction of the experts activate per token, so a model with hundreds of billions of total parameters can run at the cost of one with tens of billions. MoE is how frontier providers in 2026 deliver flagship-level capability at competitive inference prices.

A typical MoE layer replaces the dense feedforward block of a transformer with a router plus N expert MLPs. The router computes a small score for each expert, picks the top 1–4, and sends the token through only those. Other experts contribute nothing to that token's output. Across a long sequence, every expert ends up being used somewhere, and during training they specialise — some on code, some on math, some on languages, some on factual recall — through pure optimisation pressure.

Why MoE matters in 2026:

  • Capability per dollar — Mixtral 8x7B, DeepSeek V3, Qwen 3 MoE and (rumoured) GPT-5 use sparse routing to deliver flagship capability at a fraction of dense-model inference cost.
  • Scalability — adding capacity by adding experts is cheaper than scaling depth or width of a dense model.
  • Specialisation — experts naturally divide labour, which can improve quality on heterogeneous workloads.

The downsides:

  • Memory — total parameters still have to fit in GPU memory, even if only a fraction activate per token. A 200B-parameter MoE needs the same VRAM as a 200B-parameter dense model.
  • Load balancing — naive routing produces "dead experts" that never activate. Modern recipes use auxiliary losses to keep utilisation balanced.
  • Inference engineering — efficient MoE inference requires careful batch packing and is more complex than dense transformer inference.
  • Fine-tuning is harder — small datasets can collapse the router and degrade quality. Frozen-router fine-tuning and LoRA-on-experts are active areas of research.

For a US developer, the MoE distinction matters mostly when self-hosting open-weight models: a Mixtral 8x22B has 141B total parameters but activates 39B per token, so VRAM requirements are still 141B-class even though latency is 39B-class. When using closed APIs, MoE is invisible — you just notice that frontier models keep getting cheaper without losing capability, which is largely an MoE-driven phenomenon.

Keep exploring

Looking for something else? The full glossary covers 120+ AI terms updated for 2026.

Open the glossary
Chat on WhatsApp