What is backpropagation?

Backpropagation (backprop) is the algorithm that calculates how much each weight in a neural network contributed to a prediction error. It works backwards through the network, from the output error back through each layer, using the chain rule of calculus to assign blame to each weight proportionally. The result is a gradient for every weight: a number that says how much to adjust it, and in which direction, to reduce the error.

What is a learning rate and why does it matter?

The learning rate controls how large the weight updates are during training. A learning rate that's too high causes the model to overshoot, it updates weights so aggressively that it bounces around without ever settling on a good solution. A learning rate that's too low makes training painfully slow and the model may get stuck in suboptimal solutions. Finding the right learning rate (or using adaptive learning rate schedules that vary it over time) is one of the core challenges of training neural networks.

What does 'overfitting' mean?

A model overfits when it learns the training data too well, memorizing specific examples rather than learning general patterns. An overfit model performs well on training data but poorly on new, unseen data. It's the neural network equivalent of a student who memorizes practice exam answers without understanding the underlying concepts. Techniques like dropout, regularization, and validation sets help detect and prevent overfitting.

How long does it take to train a neural network?

It depends enormously on network size, dataset size, and hardware. A small image classifier can be trained on a laptop in minutes. A large language model like GPT-4 took thousands of high-end GPUs running for months. Most practical deep learning falls somewhere between these extremes, hours to days on modern GPU hardware for production-quality models.

Can a neural network ever 'unlearn' something?

Not directly. Standard training only adds and modifies learned patterns, there's no targeted delete operation for specific memories. If you retrain the model on data without certain examples, it will gradually lose those patterns, but this is imprecise and expensive. 'Machine unlearning' is an active research area trying to solve this, motivated by privacy regulations like GDPR that require companies to delete user data on request.

How Neural Networks Learn

A freshly initialized neural network is, in a precise sense, stupid. Its weights are set to small random numbers. Pass an image of a cat through it and it will confidently output “submarine.” Ask it to predict the next word in a sentence and it will produce random tokens with equal probability. It knows nothing. Learn more about neural networks

Six hours and fifty million examples later, the same network can recognize cats with 99% accuracy. The weights haven’t been replaced, they’ve been nudged, millions of times, each nudge guided by a measure of how wrong the current weights are. The original backpropagation paper explains how this process was developed.

The short answer

A neural network learns by starting with random weights (garbage predictions), then repeatedly comparing its predictions to correct answers, calculating how wrong it was, and adjusting weights in the direction that reduces error. This process, called training, turns random noise into useful pattern recognition through millions of tiny weight adjustments guided by backpropagation, which calculates how each weight contributed to the error.

The full picture

This process, the conversion of random noise into reliable intelligence through iterated error correction, is called training. Understanding it clarifies something fundamental about what machine learning actually is.

The setup: predictions and losses

Training requires three things: a model with adjustable weights, a dataset of examples, and a loss function.

The model produces predictions. The dataset provides the correct answers. The loss function measures the gap between them, it outputs a single number that quantifies “how wrong is this prediction?” A perfect prediction produces zero loss. A terrible prediction produces high loss.

Common loss functions include:

Cross-entropy loss for classification tasks (is this image a cat, dog, or bird?). It measures the difference between the model’s predicted probability distribution over classes and the true distribution (which is 1 for the right class and 0 for everything else). High when the model is confident and wrong. Low when the model assigns high probability to the correct class.

Mean squared error for regression tasks (what will this stock price be tomorrow?). It takes the average of the squared differences between predictions and true values. Squaring makes large errors disproportionately costly.

Cross-entropy loss with temperature for language models, which are essentially classification tasks run repeatedly: “given everything before, what is the next token?” The model outputs a probability distribution over the entire vocabulary at each step.

The loss function turns a vague objective (“predict correctly”) into a concrete mathematical quantity that can be optimized.

Gradient descent: walking downhill

Imagine the loss function as a landscape. Every possible combination of weights corresponds to a point on this landscape. The height at each point is the loss value those weights produce. Your goal is to find the lowest point in the landscape, the weights that minimize loss.

You can’t see the whole landscape (it has billions of dimensions for modern models), but you can always tell which direction is downhill from where you’re standing. That direction is the gradient, a vector pointing in the direction of steepest increase in loss. If you go the opposite direction (the negative gradient), you go downhill.

Gradient descent is the algorithm of repeatedly stepping in the direction of the negative gradient:

Make predictions with current weights
Calculate the loss
Calculate the gradient (which direction increases loss the most?)
Update weights by a small step in the opposite direction
Repeat

Each step makes the predictions slightly less wrong. Over millions of iterations, the weights settle into a configuration that produces reliably low loss on the training data.

The step size is controlled by the learning rate, one of the most important hyperparameters in training. Step too large and you overshoot good solutions, bouncing around the loss landscape without settling. Step too small and training takes forever, or gets stuck in a shallow dip (a local minimum) when a better solution is reachable with a slightly larger step.

In practice, training uses stochastic gradient descent (SGD): instead of computing the gradient on the full dataset (expensive), you compute it on a small random batch of examples. This is noisier but much faster, and the noise actually helps, it prevents the optimizer from getting stuck in narrow local minima.

Backpropagation: assigning credit

The gradient tells us how to update weights. But how do you calculate the gradient for a network with billions of weights spread across dozens of layers?

This is what backpropagation solves. It’s the algorithm that efficiently computes gradients for every weight in the network simultaneously.

The key insight is the chain rule from calculus. If you have a chain of functions, input → layer 1 → layer 2 → … → output → loss, the gradient of the loss with respect to a weight in layer 1 equals the product of all the gradients along the chain between that weight and the output.

Backpropagation computes this efficiently by:

Running the forward pass: compute predictions, compute loss, storing intermediate values at each layer
Computing the gradient of the loss with respect to the output
Working backwards through each layer, using the stored intermediate values and the chain rule to compute the gradient at that layer
Propagating those gradients back to the previous layer

The name is literal: gradients flow backward through the network. By the end, every weight has a gradient, a number indicating how much that weight contributed to the current error, and in which direction adjusting it would reduce that error.

Backpropagation made training deep networks practical. Before efficient implementations of backpropagation (and before GPUs made the parallel computation tractable), training networks with more than a couple of layers was prohibitively expensive. Modern deep learning frameworks (PyTorch, TensorFlow) compute backpropagation automatically, you define the forward computation, and the gradient flows backward for free.

Batches, epochs, and iterations

A few terms that come up constantly:

A batch is the set of training examples processed together before a weight update. Typical batch sizes are 32, 128, or 256 examples. Larger batches give more stable gradient estimates but require more memory and can make training less exploratory.

An epoch is one complete pass through the entire training dataset. Most models are trained for many epochs, the same data is seen multiple times. The model doesn’t just memorize examples on the first pass; it updates weights incrementally, and the patterns only emerge through repetition.

An iteration (or step) is one batch → forward pass → backward pass → weight update cycle. A training run of 10 epochs on a dataset of 100,000 examples with a batch size of 100 involves 10,000 iterations per epoch, or 100,000 total iterations.

Modern large model training involves trillions of tokens processed over hundreds of thousands of iterations.

What gets learned: representations

When a neural network trains on images, the early layers don’t learn to recognize cats directly. They learn to recognize edges, then textures, then parts (ears, eyes), then objects. Each layer builds more abstract representations on top of the previous layer’s features.

This hierarchical feature learning is one of the most important properties of deep networks, and it emerges automatically from gradient descent. The network discovers useful intermediate representations because those representations reduce loss.

In language models, early layers learn syntactic properties (part of speech, sentence structure). Middle layers learn semantic relationships (word meaning, analogies). Later layers learn task-specific patterns. The layers aren’t programmed with these roles, they emerge from training because they’re useful for predicting the next token.

This is sometimes called representation learning: the network learns not just to predict, but to build internal representations of its domain that are useful for prediction.

Generalization: learning patterns, not examples

The goal of training isn’t to minimize loss on the training data, it’s to minimize loss on new, unseen data. A model that memorized every training example would score zero loss on training but fail completely on new inputs.

Generalization is the ability to apply learned patterns to new inputs. It happens when the model learns genuine regularities in the data rather than superficial correlates.

The main failure mode is overfitting: the model memorizes training examples rather than learning general patterns. An overfit model can achieve near-zero training loss while performing no better than chance on new data.

Several techniques combat overfitting:

Dropout randomly sets some neuron activations to zero during training. This prevents neurons from co-adapting, from learning patterns that only work because other specific neurons are active. It forces redundancy and more distributed representations.

Weight decay (L2 regularization) adds a penalty proportional to the magnitude of weights to the loss function. This encourages smaller weights, which correspond to simpler models that are less likely to overfit.

Early stopping monitors performance on a held-out validation set during training. When validation loss stops improving (even if training loss continues to decrease), training is stopped. The model at the point of best validation performance is saved.

Data augmentation artificially expands the training dataset by applying transformations (rotations, crops, color shifts for images; paraphrasing for text) that preserve labels but create superficially different inputs. This forces the model to learn invariances, that a cat is still a cat when rotated 30 degrees.

The loss landscape is wild

The theoretical picture of gradient descent in a smooth bowl-shaped landscape is misleading. The actual loss landscapes of large neural networks are high-dimensional, non-convex, and full of saddle points, plateaus, and local minima.

Several surprising findings from empirical research:

Local minima are usually fine. In high-dimensional spaces, most local minima have similar loss values to the global minimum. The feared catastrophe of getting trapped in a terrible local minimum rarely occurs in practice for large networks.

Saddle points are the real obstacle. A saddle point is a spot where the gradient is zero but you’re not at a minimum, you’re at a point that’s a minimum in some dimensions but a maximum in others. Early optimizers could get trapped here. Modern optimizers (especially those with momentum) escape saddle points naturally.

Sharp vs. flat minima matter for generalization. Models that land in “flat” minima, regions where the loss is low over a wide range of weights, generalize better than models that land in narrow minima. Recent research suggests this is partly why certain training techniques and batch sizes improve real-world performance.

Scale changes the landscape. Large models appear to have qualitatively different loss landscapes than small models, smoother, with fewer problematic local minima and stronger generalization properties. This is part of why scaling laws work.

Why it matters

Understanding how neural networks learn matters because it changes how you think about AI entirely. When you know that a model starts as random noise and learns by nudging weights through millions of iterations, you stop expecting perfection and start understanding why these systems need so much data and compute. It also explains why the quality of training data matters as much as the model architecture itself, the model can only learn patterns that exist in the examples it sees.

This has practical implications for how you use AI. If you understand backpropagation, you understand why a model trained on biased data produces biased outputs, the bias is literally encoded in the weights. You understand why models sometimes fail on edge cases (they never saw those patterns during training). You understand why fine-tuning on domain-specific data works, it shifts the weights toward patterns relevant to your use case.

For anyone building AI products or automating workflows, this understanding is foundational. It tells you what to optimize for: better data quality, appropriate model size for the task, and realistic expectations about generalization. A model that learned from internet text will have learned internet-level patterns, for better and worse. Knowing that helps you decide what to fine-tune, what to constrain, and what to verify with human oversight.

Key terms

Loss function A mathematical function that measures the difference between a model’s predictions and the correct answers. Returns a single number (the loss). Training tries to minimize this number.

Gradient A vector that points in the direction of steepest increase in the loss function. Taking a step opposite to the gradient reduces the loss.

Gradient descent The training algorithm: iteratively update weights by taking small steps in the direction that reduces loss.

Backpropagation The algorithm that efficiently computes gradients for every weight in a neural network by propagating error signals backward through the layers.

Learning rate The step size used during weight updates. Controls how aggressively weights are adjusted each iteration.

Epoch One complete pass through the entire training dataset.

Overfitting When a model learns training examples too specifically and fails to generalize to new data.

Regularization Techniques (dropout, weight decay, early stopping) that reduce overfitting by discouraging models from becoming too complex or specific.

Common misconceptions

“Backpropagation is how the brain learns.” Biological neurons don’t use backpropagation. Real brains learn through mechanisms like Hebbian learning (“neurons that fire together, wire together”), neuromodulation, and synaptic plasticity, none of which require a global error signal propagating backward through the network. Backpropagation is an efficient mathematical algorithm for gradient computation, not a model of neuroscience.

“More training always helps.” Past a certain point, more training on the same data leads to overfitting. Models improve until they start memorizing noise in the training data. Training schedules, early stopping, and regularization exist precisely because unconstrained training eventually hurts performance.

“The network understands why it got something wrong.” The gradient tells the network in which direction to adjust weights to reduce loss, nothing more. The network isn’t reasoning about its mistakes. It’s following a mathematical prescription for reducing a number. Understanding, if it happens, emerges from the aggregate effect of millions of such adjustments.

“Training finds the ‘correct’ weights.” The weights a trained model ends up with are not unique. Different random initializations, different batch orderings, and different hardware can all produce different final weights that achieve similar loss. The loss landscape has many approximately-equivalent solutions. There is no single right answer that training converges to, there’s a region of good solutions, and gradient descent finds one.