How Neural Networks Learn
A 8-minute read
A neural network starts as pure noise — random numbers that produce garbage. Through millions of iterations of making predictions, measuring mistakes, and nudging weights in the right direction, it learns. Here's exactly how that process works.
A freshly initialized neural network is, in a precise sense, stupid. Its weights are set to small random numbers. Pass an image of a cat through it and it will confidently output “submarine.” Ask it to predict the next word in a sentence and it will produce random tokens with equal probability. It knows nothing.
Six hours and fifty million examples later, the same network can recognize cats with 99% accuracy. The weights haven’t been replaced — they’ve been nudged, millions of times, each nudge guided by a measure of how wrong the current weights are.
This process — the conversion of random noise into reliable intelligence through iterated error correction — is called training. Understanding it clarifies something fundamental about what machine learning actually is.
The setup: predictions and losses
Training requires three things: a model with adjustable weights, a dataset of examples, and a loss function.
The model produces predictions. The dataset provides the correct answers. The loss function measures the gap between them — it outputs a single number that quantifies “how wrong is this prediction?” A perfect prediction produces zero loss. A terrible prediction produces high loss.
Common loss functions include:
Cross-entropy loss for classification tasks (is this image a cat, dog, or bird?). It measures the difference between the model’s predicted probability distribution over classes and the true distribution (which is 1 for the right class and 0 for everything else). High when the model is confident and wrong. Low when the model assigns high probability to the correct class.
Mean squared error for regression tasks (what will this stock price be tomorrow?). It takes the average of the squared differences between predictions and true values. Squaring makes large errors disproportionately costly.
Cross-entropy loss with temperature for language models, which are essentially classification tasks run repeatedly: “given everything before, what is the next token?” The model outputs a probability distribution over the entire vocabulary at each step.
The loss function turns a vague objective (“predict correctly”) into a concrete mathematical quantity that can be optimized.
Gradient descent: walking downhill
Imagine the loss function as a landscape. Every possible combination of weights corresponds to a point on this landscape. The height at each point is the loss value those weights produce. Your goal is to find the lowest point in the landscape — the weights that minimize loss.
You can’t see the whole landscape (it has billions of dimensions for modern models), but you can always tell which direction is downhill from where you’re standing. That direction is the gradient — a vector pointing in the direction of steepest increase in loss. If you go the opposite direction (the negative gradient), you go downhill.
Gradient descent is the algorithm of repeatedly stepping in the direction of the negative gradient:
- Make predictions with current weights
- Calculate the loss
- Calculate the gradient (which direction increases loss the most?)
- Update weights by a small step in the opposite direction
- Repeat
Each step makes the predictions slightly less wrong. Over millions of iterations, the weights settle into a configuration that produces reliably low loss on the training data.
The step size is controlled by the learning rate — one of the most important hyperparameters in training. Step too large and you overshoot good solutions, bouncing around the loss landscape without settling. Step too small and training takes forever, or gets stuck in a shallow dip (a local minimum) when a better solution is reachable with a slightly larger step.
In practice, training uses stochastic gradient descent (SGD): instead of computing the gradient on the full dataset (expensive), you compute it on a small random batch of examples. This is noisier but much faster, and the noise actually helps — it prevents the optimizer from getting stuck in narrow local minima.
Backpropagation: assigning credit
The gradient tells us how to update weights. But how do you calculate the gradient for a network with billions of weights spread across dozens of layers?
This is what backpropagation solves. It’s the algorithm that efficiently computes gradients for every weight in the network simultaneously.
The key insight is the chain rule from calculus. If you have a chain of functions — input → layer 1 → layer 2 → … → output → loss — the gradient of the loss with respect to a weight in layer 1 equals the product of all the gradients along the chain between that weight and the output.
Backpropagation computes this efficiently by:
- Running the forward pass: compute predictions, compute loss, storing intermediate values at each layer
- Computing the gradient of the loss with respect to the output
- Working backwards through each layer, using the stored intermediate values and the chain rule to compute the gradient at that layer
- Propagating those gradients back to the previous layer
The name is literal: gradients flow backward through the network. By the end, every weight has a gradient — a number indicating how much that weight contributed to the current error, and in which direction adjusting it would reduce that error.
Backpropagation made training deep networks practical. Before efficient implementations of backpropagation (and before GPUs made the parallel computation tractable), training networks with more than a couple of layers was prohibitively expensive. Modern deep learning frameworks (PyTorch, TensorFlow) compute backpropagation automatically — you define the forward computation, and the gradient flows backward for free.
Batches, epochs, and iterations
A few terms that come up constantly:
A batch is the set of training examples processed together before a weight update. Typical batch sizes are 32, 128, or 256 examples. Larger batches give more stable gradient estimates but require more memory and can make training less exploratory.
An epoch is one complete pass through the entire training dataset. Most models are trained for many epochs — the same data is seen multiple times. The model doesn’t just memorize examples on the first pass; it updates weights incrementally, and the patterns only emerge through repetition.
An iteration (or step) is one batch → forward pass → backward pass → weight update cycle. A training run of 10 epochs on a dataset of 100,000 examples with a batch size of 100 involves 10,000 iterations per epoch, or 100,000 total iterations.
Modern large model training involves trillions of tokens processed over hundreds of thousands of iterations.
What gets learned: representations
When a neural network trains on images, the early layers don’t learn to recognize cats directly. They learn to recognize edges, then textures, then parts (ears, eyes), then objects. Each layer builds more abstract representations on top of the previous layer’s features.
This hierarchical feature learning is one of the most important properties of deep networks — and it emerges automatically from gradient descent. The network discovers useful intermediate representations because those representations reduce loss.
In language models, early layers learn syntactic properties (part of speech, sentence structure). Middle layers learn semantic relationships (word meaning, analogies). Later layers learn task-specific patterns. The layers aren’t programmed with these roles — they emerge from training because they’re useful for predicting the next token.
This is sometimes called representation learning: the network learns not just to predict, but to build internal representations of its domain that are useful for prediction.
Generalization: learning patterns, not examples
The goal of training isn’t to minimize loss on the training data — it’s to minimize loss on new, unseen data. A model that memorized every training example would score zero loss on training but fail completely on new inputs.
Generalization is the ability to apply learned patterns to new inputs. It happens when the model learns genuine regularities in the data rather than superficial correlates.
The main failure mode is overfitting: the model memorizes training examples rather than learning general patterns. An overfit model can achieve near-zero training loss while performing no better than chance on new data.
Several techniques combat overfitting:
Dropout randomly sets some neuron activations to zero during training. This prevents neurons from co-adapting — from learning patterns that only work because other specific neurons are active. It forces redundancy and more distributed representations.
Weight decay (L2 regularization) adds a penalty proportional to the magnitude of weights to the loss function. This encourages smaller weights, which correspond to simpler models that are less likely to overfit.
Early stopping monitors performance on a held-out validation set during training. When validation loss stops improving (even if training loss continues to decrease), training is stopped. The model at the point of best validation performance is saved.
Data augmentation artificially expands the training dataset by applying transformations (rotations, crops, color shifts for images; paraphrasing for text) that preserve labels but create superficially different inputs. This forces the model to learn invariances — that a cat is still a cat when rotated 30 degrees.
The loss landscape is wild
The theoretical picture of gradient descent in a smooth bowl-shaped landscape is misleading. The actual loss landscapes of large neural networks are high-dimensional, non-convex, and full of saddle points, plateaus, and local minima.
Several surprising findings from empirical research:
Local minima are usually fine. In high-dimensional spaces, most local minima have similar loss values to the global minimum. The feared catastrophe of getting trapped in a terrible local minimum rarely occurs in practice for large networks.
Saddle points are the real obstacle. A saddle point is a spot where the gradient is zero but you’re not at a minimum — you’re at a point that’s a minimum in some dimensions but a maximum in others. Early optimizers could get trapped here. Modern optimizers (especially those with momentum) escape saddle points naturally.
Sharp vs. flat minima matter for generalization. Models that land in “flat” minima — regions where the loss is low over a wide range of weights — generalize better than models that land in narrow minima. Recent research suggests this is partly why certain training techniques and batch sizes improve real-world performance.
Scale changes the landscape. Large models appear to have qualitatively different loss landscapes than small models — smoother, with fewer problematic local minima and stronger generalization properties. This is part of why scaling laws work.
Key terms
Loss function A mathematical function that measures the difference between a model’s predictions and the correct answers. Returns a single number (the loss). Training tries to minimize this number.
Gradient A vector that points in the direction of steepest increase in the loss function. Taking a step opposite to the gradient reduces the loss.
Gradient descent The training algorithm: iteratively update weights by taking small steps in the direction that reduces loss.
Backpropagation The algorithm that efficiently computes gradients for every weight in a neural network by propagating error signals backward through the layers.
Learning rate The step size used during weight updates. Controls how aggressively weights are adjusted each iteration.
Epoch One complete pass through the entire training dataset.
Overfitting When a model learns training examples too specifically and fails to generalize to new data.
Regularization Techniques (dropout, weight decay, early stopping) that reduce overfitting by discouraging models from becoming too complex or specific.
Common misconceptions
“Backpropagation is how the brain learns.” Biological neurons don’t use backpropagation. Real brains learn through mechanisms like Hebbian learning (“neurons that fire together, wire together”), neuromodulation, and synaptic plasticity — none of which require a global error signal propagating backward through the network. Backpropagation is an efficient mathematical algorithm for gradient computation, not a model of neuroscience.
“More training always helps.” Past a certain point, more training on the same data leads to overfitting. Models improve until they start memorizing noise in the training data. Training schedules, early stopping, and regularization exist precisely because unconstrained training eventually hurts performance.
“The network understands why it got something wrong.” The gradient tells the network in which direction to adjust weights to reduce loss — nothing more. The network isn’t reasoning about its mistakes. It’s following a mathematical prescription for reducing a number. Understanding, if it happens, emerges from the aggregate effect of millions of such adjustments.
“Training finds the ‘correct’ weights.” The weights a trained model ends up with are not unique. Different random initializations, different batch orderings, and different hardware can all produce different final weights that achieve similar loss. The loss landscape has many approximately-equivalent solutions. There is no single right answer that training converges to — there’s a region of good solutions, and gradient descent finds one.