Backpropagation is the algorithm that makes neural network training feasible at scale. Given a network and a loss value, it efficiently computes the gradient of the loss with respect to every parameter by applying the chain rule layer by layer, working backwards from the output to the input.
Without backpropagation, computing those gradients would require differentiating the entire network manually or running the network once per parameter — a non-starter when the network has billions of parameters. With backpropagation, gradients for the entire model are computed in roughly the same time as a single forward pass, plus a backward pass of comparable cost. That efficiency is what made deep learning practical.
The steps in modern terms:
- Forward pass — feed the input through the network, store every intermediate activation.
- Loss computation — compare the output to the target, compute a scalar loss.
- Backward pass — starting from the loss, apply the chain rule layer by layer to compute gradients.
- Optimiser step — feed the gradients to gradient descent (Adam, SGD, etc.) which updates the weights.
- Repeat — billions of times across millions of mini-batches.
In 2026 you almost never implement backpropagation by hand. PyTorch, JAX and TensorFlow handle it through automatic differentiation: define your model as a computational graph in Python, call the framework's backward function, and it figures out the chain rule for you. The conceptual understanding still matters — you cannot debug a NaN loss or a vanishing gradient if you do not know what the framework is doing on your behalf.
Three practical issues every deep learning engineer eventually meets:
- Vanishing gradients — gradients shrink toward zero in deep networks, so early layers stop learning. Solved by ReLU activations, residual connections (ResNet) and careful initialisation.
- Exploding gradients — gradients blow up, often during early training of large models. Solved by gradient clipping and warmup schedules.
- Memory cost — storing activations for the backward pass uses memory linear in network depth. Gradient checkpointing trades compute for memory by recomputing activations on demand.
For a US engineer building on top of frontier APIs, backpropagation lives entirely inside the lab that trained the model. You never call it. But understanding why it works helps you reason about why fine-tuning is possible, why distillation produces smaller models that match larger ones, and why the entire field looks the way it does.