Backpropagation Isn't Magic: It's the Chain Rule Run Backwards

backpropagation chain-rule gradient neural-networks math

A network with random weights makes garbage predictions on the first pass. The entire question that makes deep learning work is: given how wrong an output was, exactly how much should each individual weight, buried three layers back, be blamed and adjusted? Backpropagation is the algorithm that answers this, and underneath the intimidating name it's just the chain rule applied methodically, layer by layer, backward.

Training happens in two passes. The forward pass is straightforward: input data flows through the network, each neuron computes a weighted sum and applies an activation function, the output of one layer becomes the input of the next, and the final layer (often softmax) produces a prediction.

The backward pass is where the actual learning happens. First compute the error at the output. Then compute gradients using the chain rule, and propagate them backward from output toward input. For a weight $w^{(L)}$ in the final layer $L$ , the gradient of the cost with respect to that weight decomposes into three multiplied pieces:

\frac{\partial C_0}{\partial w^{(L)}} = \frac{\partial z^{(L)}}{\partial w^{(L)}} \cdot \frac{\partial a^{(L)}}{\partial z^{(L)}} \cdot \frac{\partial C_0}{\partial a^{(L)}}

where $z^{(L)} = w^{(L)} a^{(L-1)} + b^{(L)}$ is the pre-activation, $a^{(L)} = \sigma(z^{(L)})$ is the activated output, and $C_0 = (a^{(L)} - y)^2$ is the squared-error cost. Each piece has a clean form: $\partial C_0 / \partial a^{(L)} = 2(a^{(L)} - y)$ , $\partial a^{(L)} / \partial z^{(L)} = \sigma'(z^{(L)})$ , and $\partial z^{(L)} / \partial w^{(L)} = a^{(L-1)}$ . Multiply the three together and you have exactly how much nudging that one weight would change the cost.

In practice this collapses into an error term at each neuron. At the output node: $\text{Error} = O_i(1 - O_i)(y - O_i)$ . At a hidden node, the error depends on the errors of every node it feeds into: $\text{Error} = O_i(1 - O_i) \sum_k \text{Error}_k \cdot w_{ik}$ . That's the whole mechanism of "backward": a hidden neuron's blame is a weighted sum of the blame of everything downstream of it. Once every neuron has its error term, the update rule is identical everywhere: $w_{ij}(\text{new}) = w_{ij}(\text{old}) + \eta \cdot O_i \cdot \text{Error}_i$ . Run a real network through this and you can watch a weight like 0.5 shift to 0.4848 after a single pass, a fraction of a percent, one example, one step closer.