Oct 16, 2025 · 22 min read · Deep Learning

Convolutional neural networks

In this series (25 parts)

Prerequisites

You should understand:

Training neural networks: the training loop, initialization, and practical considerations that apply to all networks including CNNs

Why convolutions?

Fully connected layers treat every input independently. For a 224x224 color image, that is 150,528 input values. A single hidden layer with 1000 neurons would need over 150 million weights. That is wasteful and overfits quickly.

Images have spatial structure. A cat’s ear looks the same whether it is in the top-left or bottom-right of the image. Convolutions exploit this with two ideas:

Parameter sharing: the same small filter scans across the entire image. Instead of learning separate weights for every position, you learn one set of filter weights.
Translation equivariance: if the input shifts, the output shifts by the same amount. The network detects features regardless of their position.

These properties make CNNs dramatically more parameter-efficient for spatial data than fully connected networks.

Building intuition

A fully connected network treats every pixel independently. A CNN looks at small patches and slides across the image. Imagine looking through a small window that moves across a photograph. At each position, you see a tiny piece of the scene. The same window scans the entire photo, searching for one specific pattern.

Consider a 5x5 image patch and a 3x3 filter that detects a cross-like pattern:

Image patch	Col 0	Col 1	Col 2	Col 3	Col 4
Row 0	1	0	2	1	0
Row 1	0	1	0	1	1
Row 2	1	0	1	0	0
Row 3	0	0	1	1	0
Row 4	1	1	0	0	1

The 3x3 filter:

1	0	1
0	1	0
1	0	1

At each position, multiply the overlapping values and sum. The result is a 3x3 feature map:

6	1	4
1	4	2
3	2	3

High values mean the patch matches the filter pattern. Position (0,0) scored 6 because the top-left 3x3 region aligns well with the cross shape. The full step-by-step calculation is in Example 1 below.

How a CNN processes an image

graph LR
  A["Input image"] --> B["Conv + ReLU"]
  B --> C["Pooling"]
  C --> D["Conv + ReLU"]
  D --> E["Pooling"]
  E --> F["Flatten"]
  F --> G["Fully connected"]
  G --> H["Output class"]

Each convolution extracts features. Pooling shrinks spatial dimensions. After several rounds, the network flattens everything into a vector and classifies it. Early layers detect edges and textures. Deeper layers combine those into complex shapes like wheels, ears, or faces.

With that picture in mind, let’s formalize each step.

The convolution operation

Feature map activations after the first convolutional layer, showing response magnitudes for 8 learned filters.

A convolution slides a small filter (also called a kernel) across the input, computing a dot product at each position.

Key terms:

Filter size: the spatial dimensions of the kernel (e.g., 3x3, 5x5)
Stride: how many pixels the filter moves at each step
Padding: extra zeros added around the input border to control the output size

Output dimension formula

For an input of width $W$ , filter size $F$ , padding $P$ , and stride $S$ :

$W_{\text{out}} = \frac{W - F + 2P}{S} + 1$

This formula applies independently to height and width. It tells you exactly how large the output feature map will be.

With “same” padding ( $P = \lfloor F/2 \rfloor$ and $S = 1$ ), the output has the same spatial dimensions as the input. With “valid” padding ( $P = 0$ ), the output shrinks.

How a filter slides across an image

graph TD
  A["Input (e.g. 5x5)"] --> B["Pad with zeros
if padding > 0"]
  B --> C["Place filter at
top-left position"]
  C --> D["Compute dot product
for one output value"]
  D --> E["Slide right
by stride S"]
  E --> F{"Reached
right edge?"}
  F -->|"No"| D
  F -->|"Yes"| G["Move down by stride S
reset to left edge"]
  G --> H{"Reached
bottom edge?"}
  H -->|"No"| D
  H -->|"Yes"| I["Output
feature map"]

With stride 1, the filter moves one pixel at a time. With stride 2, it skips every other position, cutting the output size roughly in half. Padding adds zeros around the border so the filter can center on edge pixels, preserving the spatial dimensions.

Pooling

After convolution and activation, pooling reduces spatial dimensions. It makes the representation smaller and more robust to small shifts in the input.

Pooling type	Operation	Output for 2x2 region $[1, 3, 2, 4]$	When to use	Effect on gradients
Max pooling	Take the maximum value	$4$	Default choice; preserves strongest activations	Gradient flows only to max element
Average pooling	Compute the mean	$2.5$	Global average pooling in final layer	Gradient splits equally among elements

How max pooling works

graph TD
  A["Feature map 4x4"] --> B["Split into
2x2 regions"]
  B --> C["Region 1
max of 1, 3, 2, 4 = 4"]
  B --> D["Region 2
max of 0, 2, 1, 3 = 3"]
  B --> E["Region 3
max of 5, 1, 0, 2 = 5"]
  B --> F["Region 4
max of 3, 1, 4, 0 = 4"]
  C --> G["Pooled output 2x2"]
  D --> G
  E --> G
  F --> G

Max pooling with a 2x2 window and stride 2 is the standard. It cuts each spatial dimension in half, reducing computation for subsequent layers by 4x.

Receptive field

The receptive field of a neuron is the region of the original input that can influence its value. In the first conv layer, a 3x3 filter has a 3x3 receptive field. After stacking more conv layers, the receptive field grows. After pooling, it grows even faster.

Two stacked 3x3 conv layers have a 5x5 effective receptive field, and three have a 7x7 receptive field. This is why modern architectures prefer multiple small filters over one large filter: same receptive field, fewer parameters, more nonlinearity.

Receptive field growth through layers

graph LR
  A["Input pixel"] --> B["Layer 1: 3x3 conv
Receptive field: 3x3"]
  B --> C["Layer 2: 3x3 conv
Receptive field: 5x5"]
  C --> D["Layer 3: 3x3 conv
Receptive field: 7x7"]
  D --> E["2x2 max pool
Receptive field: 14x14"]

Each 3x3 conv layer adds 2 pixels to the receptive field. Pooling doubles it by compressing spatial dimensions. Three conv layers followed by one pool produce a 14x14 receptive field from just 3x3 filters.

The CNN pattern

Most CNNs follow a common structure:

graph LR
  A["Input image"] --> B["Conv + ReLU"]
  B --> C["Conv + ReLU"]
  C --> D["Max Pool"]
  D --> E["Conv + ReLU"]
  E --> F["Conv + ReLU"]
  F --> G["Max Pool"]
  G --> H["Flatten"]
  H --> I["FC + ReLU"]
  I --> J["FC + Softmax"]
  J --> K["Class probabilities"]

The pattern: convolve to detect features, activate with ReLU, pool to downsample. Repeat. Then flatten the spatial dimensions into a vector, and use fully connected layers for the final classification. As you go deeper, spatial dimensions shrink while the number of channels grows.

Key architectures

LeNet-5 (1998)

The original CNN for digit recognition. Two conv layers, two pooling layers, three FC layers. Small by today’s standards, but it proved the concept.

AlexNet (2012)

The model that reignited interest in deep learning. 8 layers (5 conv, 3 FC). Key innovations: ReLU activations, dropout for regularization, and training on GPUs. Won the 2012 ImageNet competition by a wide margin.

VGG-16 (2014)

Showed that depth matters. 16 layers, all using 3x3 filters. Simple and uniform architecture. But 138 million parameters, most of them in the FC layers.

ResNet (2015)

The breakthrough that enabled very deep networks (50, 101, even 152 layers). The key idea: skip connections (also called residual connections).

Instead of learning a mapping $H(\mathbf{x})$ directly, the network learns the residual $F(\mathbf{x}) = H(\mathbf{x}) - \mathbf{x}$ . The output is:

$\mathbf{y} = F(\mathbf{x}) + \mathbf{x}$

graph TD
  X["Input x"] --> Conv1["Conv + BN + ReLU"]
  Conv1 --> Conv2["Conv + BN"]
  X --> Skip["Skip connection (identity)"]
  Conv2 --> Add["Add: F(x) + x"]
  Skip --> Add
  Add --> ReLU2["ReLU"]
  ReLU2 --> Out["Output"]

Why this works: the gradient can flow directly through the skip connection, bypassing the conv layers entirely. Even if the conv layers have vanishing gradients, the skip connection provides a highway for gradient flow. This is why ResNets can be 100+ layers deep without training difficulties.

Batch normalization (BN) after each conv layer also stabilizes training by normalizing intermediate activations.

Architecture comparison

Evolution of CNN architectures

graph LR
  A["LeNet-5 (1998)
7 layers, 60K params"] --> B["AlexNet (2012)
8 layers, 61M params"]
  B --> C["VGG-16 (2014)
16 layers, 138M params"]
  C --> D["ResNet-50 (2015)
50 layers, 25.6M params"]
  style A fill:#e0f0ff,stroke:#333,color:#000
  style D fill:#c0ffc0,stroke:#333,color:#000

Each generation went deeper. AlexNet proved GPUs could train large CNNs. VGG showed that stacking small 3x3 filters beats fewer large filters. ResNet introduced skip connections that made 50+ layers trainable without degradation.

Architecture	Year	Depth	Parameters	Key innovation	Top-5 error (ImageNet)
LeNet-5	1998	7	60K	First successful CNN	N/A (MNIST)
AlexNet	2012	8	61M	ReLU, dropout, GPU training	15.3%
VGG-16	2014	16	138M	Uniform 3x3 filters	7.3%
ResNet-50	2015	50	25.6M	Skip connections	3.57%

Notice that ResNet-50 has fewer parameters than VGG-16 but is much deeper and more accurate. Skip connections and batch normalization made this possible.

For tasks beyond image classification, these architectures serve as backbones for transfer learning: you take a pretrained ResNet, remove the final FC layer, and fine-tune on your specific task.

Example 1: Convolution by hand

Apply a 3x3 filter to a 5x5 input with stride 1 and no padding.

Input (5x5):

$\begin{bmatrix} 1 & 0 & 2 & 1 & 0 \\ 0 & 1 & 0 & 1 & 1 \\ 1 & 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 1 & 0 \\ 1 & 1 & 0 & 0 & 1 \end{bmatrix}$

Filter (3x3):

$\begin{bmatrix} 1 & 0 & 1 \\ 0 & 1 & 0 \\ 1 & 0 & 1 \end{bmatrix}$

Output size: $(5 - 3 + 0)/1 + 1 = 3$ , so the output is 3x3.

At each position, we overlay the filter on the input patch and sum the element-wise products.

Position (0,0): patch is rows 0-2, cols 0-2:

$1 \cdot 1 + 0 \cdot 0 + 2 \cdot 1 + 0 \cdot 0 + 1 \cdot 1 + 0 \cdot 0 + 1 \cdot 1 + 0 \cdot 0 + 1 \cdot 1 = 1+2+1+1+1 = 6$

Position (0,1): patch is rows 0-2, cols 1-3:

$0 \cdot 1 + 2 \cdot 0 + 1 \cdot 1 + 1 \cdot 0 + 0 \cdot 1 + 1 \cdot 0 + 0 \cdot 1 + 1 \cdot 0 + 0 \cdot 1 = 1$

Position (0,2): patch is rows 0-2, cols 2-4:

$2 \cdot 1 + 1 \cdot 0 + 0 \cdot 1 + 0 \cdot 0 + 1 \cdot 1 + 1 \cdot 0 + 1 \cdot 1 + 0 \cdot 0 + 0 \cdot 1 = 2+1+1 = 4$

Position (1,0): $0+0+0+0+0+0+0+0+1 = 1$

Position (1,1): $1+0+1+0+1+0+0+0+1 = 4$

Position (1,2): $0+0+1+0+0+0+1+0+0 = 2$

Position (2,0): $1+0+1+0+0+0+1+0+0 = 3$

Position (2,1): $0+0+0+0+1+0+1+0+0 = 2$

Position (2,2): $1+0+0+0+1+0+0+0+1 = 3$

Output (3x3):

$\begin{bmatrix} 6 & 1 & 4 \\ 1 & 4 & 2 \\ 3 & 2 & 3 \end{bmatrix}$

This filter detects a cross-like pattern (nonzero at corners and center). Positions where the input matches this pattern get higher values.

Example 2: Tracking dimensions through a CNN

Start with a 224x224x3 input image and pass it through three layers.

Layer 1: Conv with 64 filters of size 3x3, stride 1, padding 1.

$H_{\text{out}} = \frac{224 - 3 + 2(1)}{1} + 1 = \frac{223}{1} + 1 = 224$

Output: 224 x 224 x 64. Same spatial size (thanks to padding), but now 64 channels.

Layer 2: Max pooling with 2x2 window, stride 2.

$H_{\text{out}} = \frac{224 - 2}{2} + 1 = 112$

Output: 112 x 112 x 64. Spatial dimensions halved. Channel count unchanged.

Layer 3: Conv with 128 filters of size 3x3, stride 1, padding 1.

$H_{\text{out}} = \frac{112 - 3 + 2(1)}{1} + 1 = 112$

Output: 112 x 112 x 128. Spatial size preserved, channels doubled.

Layer	Operation	Output shape	Parameters
Input	-	224 x 224 x 3	0
Conv1	3x3, 64 filters, s1, p1	224 x 224 x 64	$(3 \times 3 \times 3) \times 64 + 64 = 1{,}792$
Pool1	2x2 max pool, s2	112 x 112 x 64	0
Conv2	3x3, 128 filters, s1, p1	112 x 112 x 128	$(3 \times 3 \times 64) \times 128 + 128 = 73{,}856$

Notice how spatial dimensions decrease (224 to 112) while depth increases (3 to 64 to 128). This is the classic CNN trade-off: compress spatial information while expanding feature representations.

Example 3: ResNet skip connection

Trace a forward pass through a residual block with specific values.

Input: $\mathbf{x} = [1.0, -0.5]$ (2D vector for simplicity)

Block has two layers (no bias, using ReLU):

$W_1 = \begin{bmatrix} 0.2 & 0.1 \\ -0.1 & 0.3 \end{bmatrix}, \quad W_2 = \begin{bmatrix} 0.3 & -0.2 \\ 0.1 & 0.4 \end{bmatrix}$

Layer 1: Linear + ReLU.

$\mathbf{z}_1 = W_1 \mathbf{x} = \begin{bmatrix} 0.2(1.0) + 0.1(-0.5) \\ -0.1(1.0) + 0.3(-0.5) \end{bmatrix} = \begin{bmatrix} 0.15 \\ -0.25 \end{bmatrix}$

$\mathbf{a}_1 = \text{ReLU}(\mathbf{z}_1) = \begin{bmatrix} 0.15 \\ 0 \end{bmatrix}$

Layer 2: Linear only (ReLU comes after the skip).

$F(\mathbf{x}) = W_2 \mathbf{a}_1 = \begin{bmatrix} 0.3(0.15) + (-0.2)(0) \\ 0.1(0.15) + 0.4(0) \end{bmatrix} = \begin{bmatrix} 0.045 \\ 0.015 \end{bmatrix}$

Add the skip connection:

$F(\mathbf{x}) + \mathbf{x} = \begin{bmatrix} 0.045 + 1.0 \\ 0.015 + (-0.5) \end{bmatrix} = \begin{bmatrix} 1.045 \\ -0.485 \end{bmatrix}$

Final ReLU:

$\text{output} = \text{ReLU}\begin{pmatrix} 1.045 \\ -0.485 \end{pmatrix} = \begin{bmatrix} 1.045 \\ 0 \end{bmatrix}$

Without skip connection: the output would be just $F(\mathbf{x}) = [0.045, 0.015]$ , a tiny signal.

With skip connection: the output is $[1.045, 0]$ , preserving the original input’s magnitude. The skip connection ensures that even if the learned transformation $F(\mathbf{x})$ is small, the signal is not lost. During backpropagation, the gradient flows directly through the addition, making deep training stable.

What comes next

CNNs handle spatial data by exploiting local patterns and translation equivariance. But many problems involve sequential data: text, speech, time series. The order of elements matters, and the sequence length can vary.

Recurrent neural networks and LSTMs tackle this by introducing memory: a hidden state that carries information from one time step to the next.

← Back to all series