Convolutional neural networks
In this series (25 parts)
- Neural networks: the basic building block
- Forward pass and backpropagation
- Training neural networks: a practical guide
- Convolutional neural networks
- Recurrent neural networks and LSTMs
- Attention mechanism and transformers
- Word embeddings: from one-hot to dense representations
- Transfer learning and fine-tuning
- Optimization techniques for deep networks
- Regularization for deep networks
- Encoder-decoder architectures
- Generative models: an overview
- Restricted Boltzmann Machines
- Deep Belief Networks
- Variational Autoencoders
- Generative Adversarial Networks: training and theory
- DCGAN, conditional GANs, and GAN variants
- Representation learning and self-supervised learning
- Domain adaptation and fine-tuning strategies
- Distributed representations and latent spaces
- AutoML and hyperparameter optimization
- Neural architecture search
- Network compression and efficient inference
- Graph neural networks
- Practical deep learning: debugging and tuning
Prerequisites
You should understand:
- Training neural networks: the training loop, initialization, and practical considerations that apply to all networks including CNNs
Why convolutions?
Fully connected layers treat every input independently. For a 224x224 color image, that is 150,528 input values. A single hidden layer with 1000 neurons would need over 150 million weights. That is wasteful and overfits quickly.
Images have spatial structure. A cat’s ear looks the same whether it is in the top-left or bottom-right of the image. Convolutions exploit this with two ideas:
- Parameter sharing: the same small filter scans across the entire image. Instead of learning separate weights for every position, you learn one set of filter weights.
- Translation equivariance: if the input shifts, the output shifts by the same amount. The network detects features regardless of their position.
These properties make CNNs dramatically more parameter-efficient for spatial data than fully connected networks.
Building intuition
A fully connected network treats every pixel independently. A CNN looks at small patches and slides across the image. Imagine looking through a small window that moves across a photograph. At each position, you see a tiny piece of the scene. The same window scans the entire photo, searching for one specific pattern.
Consider a 5x5 image patch and a 3x3 filter that detects a cross-like pattern:
| Image patch | Col 0 | Col 1 | Col 2 | Col 3 | Col 4 |
|---|---|---|---|---|---|
| Row 0 | 1 | 0 | 2 | 1 | 0 |
| Row 1 | 0 | 1 | 0 | 1 | 1 |
| Row 2 | 1 | 0 | 1 | 0 | 0 |
| Row 3 | 0 | 0 | 1 | 1 | 0 |
| Row 4 | 1 | 1 | 0 | 0 | 1 |
The 3x3 filter:
| 1 | 0 | 1 |
|---|---|---|
| 0 | 1 | 0 |
| 1 | 0 | 1 |
At each position, multiply the overlapping values and sum. The result is a 3x3 feature map:
| 6 | 1 | 4 |
|---|---|---|
| 1 | 4 | 2 |
| 3 | 2 | 3 |
High values mean the patch matches the filter pattern. Position (0,0) scored 6 because the top-left 3x3 region aligns well with the cross shape. The full step-by-step calculation is in Example 1 below.
How a CNN processes an image
graph LR A["Input image"] --> B["Conv + ReLU"] B --> C["Pooling"] C --> D["Conv + ReLU"] D --> E["Pooling"] E --> F["Flatten"] F --> G["Fully connected"] G --> H["Output class"]
Each convolution extracts features. Pooling shrinks spatial dimensions. After several rounds, the network flattens everything into a vector and classifies it. Early layers detect edges and textures. Deeper layers combine those into complex shapes like wheels, ears, or faces.
With that picture in mind, let’s formalize each step.
The convolution operation
Feature map activations after the first convolutional layer, showing response magnitudes for 8 learned filters.
A convolution slides a small filter (also called a kernel) across the input, computing a dot product at each position.
Key terms:
- Filter size: the spatial dimensions of the kernel (e.g., 3x3, 5x5)
- Stride: how many pixels the filter moves at each step
- Padding: extra zeros added around the input border to control the output size
Output dimension formula
For an input of width , filter size , padding , and stride :
This formula applies independently to height and width. It tells you exactly how large the output feature map will be.
With “same” padding ( and ), the output has the same spatial dimensions as the input. With “valid” padding (), the output shrinks.
How a filter slides across an image
graph TD
A["Input (e.g. 5x5)"] --> B["Pad with zeros
if padding > 0"]
B --> C["Place filter at
top-left position"]
C --> D["Compute dot product
for one output value"]
D --> E["Slide right
by stride S"]
E --> F{"Reached
right edge?"}
F -->|"No"| D
F -->|"Yes"| G["Move down by stride S
reset to left edge"]
G --> H{"Reached
bottom edge?"}
H -->|"No"| D
H -->|"Yes"| I["Output
feature map"]
With stride 1, the filter moves one pixel at a time. With stride 2, it skips every other position, cutting the output size roughly in half. Padding adds zeros around the border so the filter can center on edge pixels, preserving the spatial dimensions.
Pooling
After convolution and activation, pooling reduces spatial dimensions. It makes the representation smaller and more robust to small shifts in the input.
| Pooling type | Operation | Output for 2x2 region | When to use | Effect on gradients |
|---|---|---|---|---|
| Max pooling | Take the maximum value | Default choice; preserves strongest activations | Gradient flows only to max element | |
| Average pooling | Compute the mean | Global average pooling in final layer | Gradient splits equally among elements |
How max pooling works
graph TD A["Feature map 4x4"] --> B["Split into 2x2 regions"] B --> C["Region 1 max of 1, 3, 2, 4 = 4"] B --> D["Region 2 max of 0, 2, 1, 3 = 3"] B --> E["Region 3 max of 5, 1, 0, 2 = 5"] B --> F["Region 4 max of 3, 1, 4, 0 = 4"] C --> G["Pooled output 2x2"] D --> G E --> G F --> G
Max pooling with a 2x2 window and stride 2 is the standard. It cuts each spatial dimension in half, reducing computation for subsequent layers by 4x.
Receptive field
The receptive field of a neuron is the region of the original input that can influence its value. In the first conv layer, a 3x3 filter has a 3x3 receptive field. After stacking more conv layers, the receptive field grows. After pooling, it grows even faster.
Two stacked 3x3 conv layers have a 5x5 effective receptive field, and three have a 7x7 receptive field. This is why modern architectures prefer multiple small filters over one large filter: same receptive field, fewer parameters, more nonlinearity.
Receptive field growth through layers
graph LR A["Input pixel"] --> B["Layer 1: 3x3 conv Receptive field: 3x3"] B --> C["Layer 2: 3x3 conv Receptive field: 5x5"] C --> D["Layer 3: 3x3 conv Receptive field: 7x7"] D --> E["2x2 max pool Receptive field: 14x14"]
Each 3x3 conv layer adds 2 pixels to the receptive field. Pooling doubles it by compressing spatial dimensions. Three conv layers followed by one pool produce a 14x14 receptive field from just 3x3 filters.
The CNN pattern
Most CNNs follow a common structure:
graph LR A["Input image"] --> B["Conv + ReLU"] B --> C["Conv + ReLU"] C --> D["Max Pool"] D --> E["Conv + ReLU"] E --> F["Conv + ReLU"] F --> G["Max Pool"] G --> H["Flatten"] H --> I["FC + ReLU"] I --> J["FC + Softmax"] J --> K["Class probabilities"]
The pattern: convolve to detect features, activate with ReLU, pool to downsample. Repeat. Then flatten the spatial dimensions into a vector, and use fully connected layers for the final classification. As you go deeper, spatial dimensions shrink while the number of channels grows.
Key architectures
LeNet-5 (1998)
The original CNN for digit recognition. Two conv layers, two pooling layers, three FC layers. Small by today’s standards, but it proved the concept.
AlexNet (2012)
The model that reignited interest in deep learning. 8 layers (5 conv, 3 FC). Key innovations: ReLU activations, dropout for regularization, and training on GPUs. Won the 2012 ImageNet competition by a wide margin.
VGG-16 (2014)
Showed that depth matters. 16 layers, all using 3x3 filters. Simple and uniform architecture. But 138 million parameters, most of them in the FC layers.
ResNet (2015)
The breakthrough that enabled very deep networks (50, 101, even 152 layers). The key idea: skip connections (also called residual connections).
Instead of learning a mapping directly, the network learns the residual . The output is:
graph TD X["Input x"] --> Conv1["Conv + BN + ReLU"] Conv1 --> Conv2["Conv + BN"] X --> Skip["Skip connection (identity)"] Conv2 --> Add["Add: F(x) + x"] Skip --> Add Add --> ReLU2["ReLU"] ReLU2 --> Out["Output"]
Why this works: the gradient can flow directly through the skip connection, bypassing the conv layers entirely. Even if the conv layers have vanishing gradients, the skip connection provides a highway for gradient flow. This is why ResNets can be 100+ layers deep without training difficulties.
Batch normalization (BN) after each conv layer also stabilizes training by normalizing intermediate activations.
Architecture comparison
Evolution of CNN architectures
graph LR A["LeNet-5 (1998) 7 layers, 60K params"] --> B["AlexNet (2012) 8 layers, 61M params"] B --> C["VGG-16 (2014) 16 layers, 138M params"] C --> D["ResNet-50 (2015) 50 layers, 25.6M params"] style A fill:#e0f0ff,stroke:#333,color:#000 style D fill:#c0ffc0,stroke:#333,color:#000
Each generation went deeper. AlexNet proved GPUs could train large CNNs. VGG showed that stacking small 3x3 filters beats fewer large filters. ResNet introduced skip connections that made 50+ layers trainable without degradation.
| Architecture | Year | Depth | Parameters | Key innovation | Top-5 error (ImageNet) |
|---|---|---|---|---|---|
| LeNet-5 | 1998 | 7 | 60K | First successful CNN | N/A (MNIST) |
| AlexNet | 2012 | 8 | 61M | ReLU, dropout, GPU training | 15.3% |
| VGG-16 | 2014 | 16 | 138M | Uniform 3x3 filters | 7.3% |
| ResNet-50 | 2015 | 50 | 25.6M | Skip connections | 3.57% |
Notice that ResNet-50 has fewer parameters than VGG-16 but is much deeper and more accurate. Skip connections and batch normalization made this possible.
For tasks beyond image classification, these architectures serve as backbones for transfer learning: you take a pretrained ResNet, remove the final FC layer, and fine-tune on your specific task.
Example 1: Convolution by hand
Apply a 3x3 filter to a 5x5 input with stride 1 and no padding.
Input (5x5):
Filter (3x3):
Output size: , so the output is 3x3.
At each position, we overlay the filter on the input patch and sum the element-wise products.
Position (0,0): patch is rows 0-2, cols 0-2:
Position (0,1): patch is rows 0-2, cols 1-3:
Position (0,2): patch is rows 0-2, cols 2-4:
Position (1,0):
Position (1,1):
Position (1,2):
Position (2,0):
Position (2,1):
Position (2,2):
Output (3x3):
This filter detects a cross-like pattern (nonzero at corners and center). Positions where the input matches this pattern get higher values.
Example 2: Tracking dimensions through a CNN
Start with a 224x224x3 input image and pass it through three layers.
Layer 1: Conv with 64 filters of size 3x3, stride 1, padding 1.
Output: 224 x 224 x 64. Same spatial size (thanks to padding), but now 64 channels.
Layer 2: Max pooling with 2x2 window, stride 2.
Output: 112 x 112 x 64. Spatial dimensions halved. Channel count unchanged.
Layer 3: Conv with 128 filters of size 3x3, stride 1, padding 1.
Output: 112 x 112 x 128. Spatial size preserved, channels doubled.
| Layer | Operation | Output shape | Parameters |
|---|---|---|---|
| Input | - | 224 x 224 x 3 | 0 |
| Conv1 | 3x3, 64 filters, s1, p1 | 224 x 224 x 64 | |
| Pool1 | 2x2 max pool, s2 | 112 x 112 x 64 | 0 |
| Conv2 | 3x3, 128 filters, s1, p1 | 112 x 112 x 128 |
Notice how spatial dimensions decrease (224 to 112) while depth increases (3 to 64 to 128). This is the classic CNN trade-off: compress spatial information while expanding feature representations.
Example 3: ResNet skip connection
Trace a forward pass through a residual block with specific values.
Input: (2D vector for simplicity)
Block has two layers (no bias, using ReLU):
Layer 1: Linear + ReLU.
Layer 2: Linear only (ReLU comes after the skip).
Add the skip connection:
Final ReLU:
Without skip connection: the output would be just , a tiny signal.
With skip connection: the output is , preserving the original input’s magnitude. The skip connection ensures that even if the learned transformation is small, the signal is not lost. During backpropagation, the gradient flows directly through the addition, making deep training stable.
What comes next
CNNs handle spatial data by exploiting local patterns and translation equivariance. But many problems involve sequential data: text, speech, time series. The order of elements matters, and the sequence length can vary.
Recurrent neural networks and LSTMs tackle this by introducing memory: a hidden state that carries information from one time step to the next.