Padding and Pooling Aren't Details: They Decide What a Convolution Keeps and Throws Away

A convolution is conceptually simple: slide a small filter across the input, multiply element-wise at each position, sum the result, and that sum becomes one pixel of the output feature map. A $3\times3$ patch element-wise multiplied with a $3\times3$ kernel and summed becomes a single output value; do that at every position and you get a full feature map. But three choices, easy to gloss over, determine what that output actually captures.

Filters (kernels) are small matrices whose values are learned during training, not hand-designed; they extract whatever specific feature reduces the loss most. Stride is the step size the filter moves across the input: larger strides skip more positions, producing smaller feature maps and faster computation, at the cost of coarser coverage. Padding adds extra pixels, usually zeros, around the border of the input before convolving.

Padding exists because of an uncomfortable fact about convolution: without it, corner pixels get used in far fewer filter positions than middle pixels, edge pixels moderately fewer, and middle pixels the most. Left alone, this causes real information loss at the borders and an imbalance in how thoroughly different parts of the image get learned from. Padding with zeros around the border evens this out and also prevents the feature map from shrinking after every convolution. The size math is exact: an $n \times n$ input padded by $p$ becomes $(n+2p) \times (n+2p)$ , and after convolving with an $f \times f$ filter the output is $(n + 2p - f + 1) \times (n + 2p - f + 1)$ . Padding isn't free, though: it increases computational cost from the added pixels, and excessive padding can dilute feature learning near the edges instead of helping it.

Pooling operates on the output of a convolution, and its job is different from the filter's: reduce the spatial dimensions of the feature map while keeping the information that matters. With filter size $f$ and stride $s$ , output dimensions follow $\lfloor (n_h - f)/s \rfloor + 1$ (and add $2p$ inside the numerator if padding is used). Max pooling takes the largest value in each window, keeping the strongest signal. Average pooling takes the mean, smoothing things out. Global pooling collapses an entire feature map down to a single value.

The contrast with a fully connected layer is the whole point of the architecture: FC layers connect every neuron to every neuron in the previous layer, dense connections, no weight sharing, high parameter count, high overfitting risk. Convolutional layers connect neurons only to a local patch, share weights across every spatial position, and end up with far fewer parameters and better built-in regularization, as a direct consequence of how padding, stride, and pooling constrain what each neuron is even allowed to look at.