Distillation (also called knowledge distillation) is the technique of training a smaller, cheaper "student" model to imitate the outputs of a larger "teacher" model. The student inherits much of the teacher's capability while running at a fraction of the cost. Distillation is the engine behind cheap-but-capable model variants like GPT-5 mini, Claude Haiku, Gemini Flash and most open-weight 7B models that punch above their weight.
The basic recipe: take a strong teacher (usually a frontier model) and a smaller student architecture. Generate a large dataset of teacher outputs — either responses to prompts, or full probability distributions over tokens, or chain-of-thought explanations. Train the student to match the teacher's outputs on that data. The student learns the teacher's behaviour without needing to discover it from scratch.
Three main flavours:
- Hard-label distillation — train the student on (prompt, teacher_response) pairs as if they were ordinary supervised data. Simplest, very effective when the teacher is much better than the student would be from scratch.
- Soft-label distillation — match the full probability distribution of the teacher, not just its top choice. Captures more of the teacher's "uncertainty profile". The classic Hinton 2015 formulation.
- Reasoning distillation — generate the teacher's chain of thought and train the student to reproduce both the reasoning and the answer. Powers the recent wave of small reasoning models.
In 2026 distillation is everywhere. The economic logic is overwhelming: a frontier teacher costs $5–15 per million output tokens; a distilled student running locally or on a small GPU costs cents. For high-volume workloads, the savings are six- and seven-figure. Modern distillation pipelines are increasingly automated — a strong teacher generates thousands of synthetic training examples per hour, and the student trains continuously on the result.
Two important constraints:
- Licensing — many providers' terms of service prohibit using their model outputs to train competing models. OpenAI, Anthropic and Google all have such clauses; enforcement is uneven but the legal risk is real, especially for commercial models. Open-weight teachers (Llama, Mistral, DeepSeek) sidestep this entirely.
- Quality ceiling — a student is bounded by its teacher. Distilling from GPT-5 cannot produce a model better than GPT-5 on the tasks where the teacher itself is the bottleneck.
For a US business team, distillation is usually the highest-ROI path to cheap inference once you have stabilised on a workflow. Identify the high-volume, well-defined tasks; have a frontier model generate thousands of clean examples; fine-tune a small open-weight model on the result; ship at 1/50th the cost.