Fundamentals

Regularization

Techniques that constrain a model so it generalises to new data instead of memorising the training set.

In common use since 1962

Regularization is the family of techniques that prevent a model from memorising its training data and force it to learn patterns that generalise. A model with enough capacity can always achieve perfect training accuracy by memorising every example — but the moment it sees new data its predictions collapse. Regularization is the discipline of trading a little training-set performance for a lot of held-out performance.

The classic regularizers come from statistics:

  • L1 regularization (Lasso) — adds the sum of absolute weights to the loss. Pushes many weights to exactly zero, producing sparse models that are easier to interpret.
  • L2 regularization (Ridge / weight decay) — adds the sum of squared weights to the loss. Keeps all weights small, smoothing the function the model represents. The default for almost every neural network in 2026.

Deep learning added several more:

  • Dropout — randomly zero out a fraction of activations during training; forces the network to develop redundant representations.
  • Early stopping — halt training when validation loss stops improving, regardless of what training loss is doing.
  • Data augmentation — synthesise variations of training examples (flips, crops, paraphrases). Effectively gives the model a much larger dataset.
  • Batch normalisation / layer normalisation — normalise activations as they flow through the network; speeds training and acts as a mild regularizer.

For LLMs in 2026, the dominant regularizers are implicit: enormous, diverse pretraining datasets, dropout in the attention layers, and weight decay in the optimiser. Fine-tuning recipes use small learning rates and short schedules to avoid catastrophically overfitting the base model. LoRA and QLoRA further constrain by training only a tiny set of adapter weights, leaving the bulk of the model frozen.

The practical signal that you need more regularization is the gap between training and validation loss. Training loss going down while validation loss goes up is overfitting in its purest form. The fix is rarely a fancier algorithm — it is usually more data, fewer parameters, more dropout, shorter training, or some combination. A US engineer asked to "improve the model" should always check the train/val gap before reaching for a new architecture.

Keep exploring

Looking for something else? The full glossary covers 120+ AI terms updated for 2026.

Open the glossary
Chat on WhatsApp