Gradient descent is the optimisation algorithm at the heart of essentially every neural network trained in the last decade. The idea is geometrically simple: think of the loss function as a landscape, where every point represents a configuration of model weights and the height represents how wrong the model is. Gradient descent reads the slope at the current point and steps downhill, repeating until it finds a low spot.
Mathematically, the gradient is the vector of partial derivatives of the loss with respect to every parameter. Backpropagation is the algorithm that computes that vector efficiently for a deep network. Together, gradient descent + backpropagation are what allows a network with billions of parameters to be trained in finite time.
In practice, vanilla gradient descent is rarely used. The variants you actually meet in 2026:
- Stochastic gradient descent (SGD) — uses one example or a small mini-batch per step. Noisier but vastly cheaper than computing the full gradient.
- Momentum — keeps a running average of past gradients so updates accelerate in consistent directions and dampen in oscillating ones.
- Adam / AdamW — the dominant optimiser for transformers; combines momentum with per-parameter learning rate scaling. AdamW (with proper weight decay) is the default for almost every LLM trained today.
- Lion — a newer optimiser from Google that uses sign-based updates; often matches Adam at lower memory cost.
- Sophia — second-order-aware optimiser that has shown promise on LLM training but is not yet mainstream.
The practical knobs an engineer tunes are the learning rate (how big each step is), the schedule (how the learning rate changes over time, usually warmup-then-decay) and the batch size (how many examples per step). Get the learning rate wrong by a factor of 10 and training either stalls or explodes. The most reliable trick in deep learning is to start with a learning rate one order of magnitude smaller than seems sensible, then scale up.
For a US engineer adopting frontier models, gradient descent is mostly under the hood — you do not implement it, you call a training loop that uses it. But understanding why it works and where it fails (saddle points, vanishing gradients, instability at large scale) is what separates someone who can debug a training run from someone who can only restart it.