Backpropagation Isn't Magic: It's the Chain Rule Run Backwards
A network with random weights makes garbage predictions on the first pass. The entire question that makes deep learning work is: given how wrong an output was, exactly how much should each individual weight, buried three layers back, be blamed and adjusted? Backpropagation is the algorithm that answers this, and underneath the intimidating name it's just the chain rule applied methodically, layer by layer, backward.
Training happens in two passes. The forward pass is straightforward: input data flows through the network, each neuron computes a weighted sum and applies an activation function, the output of one layer becomes the input of the next, and the final layer (often softmax) produces a prediction.
The backward pass is where the actual learning happens. First compute the error at the output. Then compute gradients using the chain rule, and propagate them backward from output toward input. For a weight in the final layer , the gradient of the cost with respect to that weight decomposes into three multiplied pieces:
where is the pre-activation, is the activated output, and is the squared-error cost. Each piece has a clean form: , , and . Multiply the three together and you have exactly how much nudging that one weight would change the cost.
In practice this collapses into an error term at each neuron. At the output node: . At a hidden node, the error depends on the errors of every node it feeds into: . That's the whole mechanism of "backward": a hidden neuron's blame is a weighted sum of the blame of everything downstream of it. Once every neuron has its error term, the update rule is identical everywhere: . Run a real network through this and you can watch a weight like 0.5 shift to 0.4848 after a single pass, a fraction of a percent, one example, one step closer.