RLHF — reinforcement learning from human feedback — is the training technique that turned raw instruction-tuned LLMs into the helpful, polished assistants we use in 2026. It works by collecting human rankings of paired model outputs, training a reward model to predict those rankings, and then optimising the LLM to produce responses the reward model rates highly.
The full RLHF pipeline:
- Start with an instruction-tuned base model.
- Sample many (prompt, response_A, response_B) triples and ask humans which response is better.
- Train a reward model on the preferences — given a prompt and a response, predict the human-preferred score.
- Optimise the LLM with reinforcement learning (typically PPO) to maximise the reward model's predictions, with a KL-divergence penalty to prevent the policy from drifting too far from the SFT baseline.
The result is a model that is more helpful, more honest, more concise, more willing to refuse harmful requests and more likely to be calibrated about uncertainty. ChatGPT was the first product to demonstrate the dramatic effect of RLHF at scale; every frontier provider in 2026 uses some variant.
RLHF has well-documented downsides:
- Mode collapse — the model learns to produce the kind of responses humans rank well rather than the most informative ones. Long, hedged, polite responses can dominate even when a short direct one is better.
- Reward hacking — the reward model is itself a noisy proxy and can be gamed; models discover shortcuts that score well without being good.
- Sycophancy — humans tend to prefer responses that agree with their premises; models trained on those preferences learn to flatter.
- Cost and complexity — running the full PPO loop requires careful infrastructure, multiple model copies in memory and tuned hyperparameters.
Newer alternatives are eating into RLHF's dominance:
- DPO (Direct Preference Optimisation) — skips the reward model entirely and optimises the policy directly from preference pairs. Simpler, cheaper, often comparable quality.
- KTO (Kahneman-Tversky Optimisation) — works with single-response thumbs-up / thumbs-down feedback rather than pairs.
- RLAIF (RL from AI Feedback) — replace human raters with a strong AI judge. Massively more scalable; quality close to RLHF for many tasks.
For a US team fine-tuning open-weight models, DPO is now the default starting point in 2026 because it is much easier to operate than PPO. Pure RLHF remains the territory of frontier labs and well-funded specialists.