Search…

Convolutional neural networks

In this series (25 parts)
  1. Neural networks: the basic building block
  2. Forward pass and backpropagation
  3. Training neural networks: a practical guide
  4. Convolutional neural networks
  5. Recurrent neural networks and LSTMs
  6. Attention mechanism and transformers
  7. Word embeddings: from one-hot to dense representations
  8. Transfer learning and fine-tuning
  9. Optimization techniques for deep networks
  10. Regularization for deep networks
  11. Encoder-decoder architectures
  12. Generative models: an overview
  13. Restricted Boltzmann Machines
  14. Deep Belief Networks
  15. Variational Autoencoders
  16. Generative Adversarial Networks: training and theory
  17. DCGAN, conditional GANs, and GAN variants
  18. Representation learning and self-supervised learning
  19. Domain adaptation and fine-tuning strategies
  20. Distributed representations and latent spaces
  21. AutoML and hyperparameter optimization
  22. Neural architecture search
  23. Network compression and efficient inference
  24. Graph neural networks
  25. Practical deep learning: debugging and tuning

Prerequisites

You should understand:

  • Training neural networks: the training loop, initialization, and practical considerations that apply to all networks including CNNs

Why convolutions?

Fully connected layers treat every input independently. For a 224x224 color image, that is 150,528 input values. A single hidden layer with 1000 neurons would need over 150 million weights. That is wasteful and overfits quickly.

Images have spatial structure. A cat’s ear looks the same whether it is in the top-left or bottom-right of the image. Convolutions exploit this with two ideas:

  1. Parameter sharing: the same small filter scans across the entire image. Instead of learning separate weights for every position, you learn one set of filter weights.
  2. Translation equivariance: if the input shifts, the output shifts by the same amount. The network detects features regardless of their position.

These properties make CNNs dramatically more parameter-efficient for spatial data than fully connected networks.


Building intuition

A fully connected network treats every pixel independently. A CNN looks at small patches and slides across the image. Imagine looking through a small window that moves across a photograph. At each position, you see a tiny piece of the scene. The same window scans the entire photo, searching for one specific pattern.

Consider a 5x5 image patch and a 3x3 filter that detects a cross-like pattern:

Image patchCol 0Col 1Col 2Col 3Col 4
Row 010210
Row 101011
Row 210100
Row 300110
Row 411001

The 3x3 filter:

101
010
101

At each position, multiply the overlapping values and sum. The result is a 3x3 feature map:

614
142
323

High values mean the patch matches the filter pattern. Position (0,0) scored 6 because the top-left 3x3 region aligns well with the cross shape. The full step-by-step calculation is in Example 1 below.

How a CNN processes an image

graph LR
  A["Input image"] --> B["Conv + ReLU"]
  B --> C["Pooling"]
  C --> D["Conv + ReLU"]
  D --> E["Pooling"]
  E --> F["Flatten"]
  F --> G["Fully connected"]
  G --> H["Output class"]

Each convolution extracts features. Pooling shrinks spatial dimensions. After several rounds, the network flattens everything into a vector and classifies it. Early layers detect edges and textures. Deeper layers combine those into complex shapes like wheels, ears, or faces.

With that picture in mind, let’s formalize each step.


The convolution operation

Feature map activations after the first convolutional layer, showing response magnitudes for 8 learned filters.

A convolution slides a small filter (also called a kernel) across the input, computing a dot product at each position.

Key terms:

  • Filter size: the spatial dimensions of the kernel (e.g., 3x3, 5x5)
  • Stride: how many pixels the filter moves at each step
  • Padding: extra zeros added around the input border to control the output size

Output dimension formula

For an input of width WW, filter size FF, padding PP, and stride SS:

Wout=WF+2PS+1W_{\text{out}} = \frac{W - F + 2P}{S} + 1

This formula applies independently to height and width. It tells you exactly how large the output feature map will be.

With “same” padding (P=F/2P = \lfloor F/2 \rfloor and S=1S = 1), the output has the same spatial dimensions as the input. With “valid” padding (P=0P = 0), the output shrinks.

How a filter slides across an image

graph TD
  A["Input (e.g. 5x5)"] --> B["Pad with zeros
if padding > 0"]
  B --> C["Place filter at
top-left position"]
  C --> D["Compute dot product
for one output value"]
  D --> E["Slide right
by stride S"]
  E --> F{"Reached
right edge?"}
  F -->|"No"| D
  F -->|"Yes"| G["Move down by stride S
reset to left edge"]
  G --> H{"Reached
bottom edge?"}
  H -->|"No"| D
  H -->|"Yes"| I["Output
feature map"]

With stride 1, the filter moves one pixel at a time. With stride 2, it skips every other position, cutting the output size roughly in half. Padding adds zeros around the border so the filter can center on edge pixels, preserving the spatial dimensions.


Pooling

After convolution and activation, pooling reduces spatial dimensions. It makes the representation smaller and more robust to small shifts in the input.

Pooling typeOperationOutput for 2x2 region [1,3,2,4][1, 3, 2, 4]When to useEffect on gradients
Max poolingTake the maximum value44Default choice; preserves strongest activationsGradient flows only to max element
Average poolingCompute the mean2.52.5Global average pooling in final layerGradient splits equally among elements

How max pooling works

graph TD
  A["Feature map 4x4"] --> B["Split into
2x2 regions"]
  B --> C["Region 1
max of 1, 3, 2, 4 = 4"]
  B --> D["Region 2
max of 0, 2, 1, 3 = 3"]
  B --> E["Region 3
max of 5, 1, 0, 2 = 5"]
  B --> F["Region 4
max of 3, 1, 4, 0 = 4"]
  C --> G["Pooled output 2x2"]
  D --> G
  E --> G
  F --> G

Max pooling with a 2x2 window and stride 2 is the standard. It cuts each spatial dimension in half, reducing computation for subsequent layers by 4x.


Receptive field

The receptive field of a neuron is the region of the original input that can influence its value. In the first conv layer, a 3x3 filter has a 3x3 receptive field. After stacking more conv layers, the receptive field grows. After pooling, it grows even faster.

Two stacked 3x3 conv layers have a 5x5 effective receptive field, and three have a 7x7 receptive field. This is why modern architectures prefer multiple small filters over one large filter: same receptive field, fewer parameters, more nonlinearity.

Receptive field growth through layers

graph LR
  A["Input pixel"] --> B["Layer 1: 3x3 conv
Receptive field: 3x3"]
  B --> C["Layer 2: 3x3 conv
Receptive field: 5x5"]
  C --> D["Layer 3: 3x3 conv
Receptive field: 7x7"]
  D --> E["2x2 max pool
Receptive field: 14x14"]

Each 3x3 conv layer adds 2 pixels to the receptive field. Pooling doubles it by compressing spatial dimensions. Three conv layers followed by one pool produce a 14x14 receptive field from just 3x3 filters.


The CNN pattern

Most CNNs follow a common structure:

graph LR
  A["Input image"] --> B["Conv + ReLU"]
  B --> C["Conv + ReLU"]
  C --> D["Max Pool"]
  D --> E["Conv + ReLU"]
  E --> F["Conv + ReLU"]
  F --> G["Max Pool"]
  G --> H["Flatten"]
  H --> I["FC + ReLU"]
  I --> J["FC + Softmax"]
  J --> K["Class probabilities"]

The pattern: convolve to detect features, activate with ReLU, pool to downsample. Repeat. Then flatten the spatial dimensions into a vector, and use fully connected layers for the final classification. As you go deeper, spatial dimensions shrink while the number of channels grows.


Key architectures

LeNet-5 (1998)

The original CNN for digit recognition. Two conv layers, two pooling layers, three FC layers. Small by today’s standards, but it proved the concept.

AlexNet (2012)

The model that reignited interest in deep learning. 8 layers (5 conv, 3 FC). Key innovations: ReLU activations, dropout for regularization, and training on GPUs. Won the 2012 ImageNet competition by a wide margin.

VGG-16 (2014)

Showed that depth matters. 16 layers, all using 3x3 filters. Simple and uniform architecture. But 138 million parameters, most of them in the FC layers.

ResNet (2015)

The breakthrough that enabled very deep networks (50, 101, even 152 layers). The key idea: skip connections (also called residual connections).

Instead of learning a mapping H(x)H(\mathbf{x}) directly, the network learns the residual F(x)=H(x)xF(\mathbf{x}) = H(\mathbf{x}) - \mathbf{x}. The output is:

y=F(x)+x\mathbf{y} = F(\mathbf{x}) + \mathbf{x}

graph TD
  X["Input x"] --> Conv1["Conv + BN + ReLU"]
  Conv1 --> Conv2["Conv + BN"]
  X --> Skip["Skip connection (identity)"]
  Conv2 --> Add["Add: F(x) + x"]
  Skip --> Add
  Add --> ReLU2["ReLU"]
  ReLU2 --> Out["Output"]

Why this works: the gradient can flow directly through the skip connection, bypassing the conv layers entirely. Even if the conv layers have vanishing gradients, the skip connection provides a highway for gradient flow. This is why ResNets can be 100+ layers deep without training difficulties.

Batch normalization (BN) after each conv layer also stabilizes training by normalizing intermediate activations.

Architecture comparison

Evolution of CNN architectures

graph LR
  A["LeNet-5 (1998)
7 layers, 60K params"] --> B["AlexNet (2012)
8 layers, 61M params"]
  B --> C["VGG-16 (2014)
16 layers, 138M params"]
  C --> D["ResNet-50 (2015)
50 layers, 25.6M params"]
  style A fill:#e0f0ff,stroke:#333,color:#000
  style D fill:#c0ffc0,stroke:#333,color:#000

Each generation went deeper. AlexNet proved GPUs could train large CNNs. VGG showed that stacking small 3x3 filters beats fewer large filters. ResNet introduced skip connections that made 50+ layers trainable without degradation.

ArchitectureYearDepthParametersKey innovationTop-5 error (ImageNet)
LeNet-51998760KFirst successful CNNN/A (MNIST)
AlexNet2012861MReLU, dropout, GPU training15.3%
VGG-16201416138MUniform 3x3 filters7.3%
ResNet-5020155025.6MSkip connections3.57%

Notice that ResNet-50 has fewer parameters than VGG-16 but is much deeper and more accurate. Skip connections and batch normalization made this possible.

For tasks beyond image classification, these architectures serve as backbones for transfer learning: you take a pretrained ResNet, remove the final FC layer, and fine-tune on your specific task.


Example 1: Convolution by hand

Apply a 3x3 filter to a 5x5 input with stride 1 and no padding.

Input (5x5):

[1021001011101000011011001]\begin{bmatrix} 1 & 0 & 2 & 1 & 0 \\ 0 & 1 & 0 & 1 & 1 \\ 1 & 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 1 & 0 \\ 1 & 1 & 0 & 0 & 1 \end{bmatrix}

Filter (3x3):

[101010101]\begin{bmatrix} 1 & 0 & 1 \\ 0 & 1 & 0 \\ 1 & 0 & 1 \end{bmatrix}

Output size: (53+0)/1+1=3(5 - 3 + 0)/1 + 1 = 3, so the output is 3x3.

At each position, we overlay the filter on the input patch and sum the element-wise products.

Position (0,0): patch is rows 0-2, cols 0-2:

11+00+21+00+11+00+11+00+11=1+2+1+1+1=61 \cdot 1 + 0 \cdot 0 + 2 \cdot 1 + 0 \cdot 0 + 1 \cdot 1 + 0 \cdot 0 + 1 \cdot 1 + 0 \cdot 0 + 1 \cdot 1 = 1+2+1+1+1 = 6

Position (0,1): patch is rows 0-2, cols 1-3:

01+20+11+10+01+10+01+10+01=10 \cdot 1 + 2 \cdot 0 + 1 \cdot 1 + 1 \cdot 0 + 0 \cdot 1 + 1 \cdot 0 + 0 \cdot 1 + 1 \cdot 0 + 0 \cdot 1 = 1

Position (0,2): patch is rows 0-2, cols 2-4:

21+10+01+00+11+10+11+00+01=2+1+1=42 \cdot 1 + 1 \cdot 0 + 0 \cdot 1 + 0 \cdot 0 + 1 \cdot 1 + 1 \cdot 0 + 1 \cdot 1 + 0 \cdot 0 + 0 \cdot 1 = 2+1+1 = 4

Position (1,0): 0+0+0+0+0+0+0+0+1=10+0+0+0+0+0+0+0+1 = 1

Position (1,1): 1+0+1+0+1+0+0+0+1=41+0+1+0+1+0+0+0+1 = 4

Position (1,2): 0+0+1+0+0+0+1+0+0=20+0+1+0+0+0+1+0+0 = 2

Position (2,0): 1+0+1+0+0+0+1+0+0=31+0+1+0+0+0+1+0+0 = 3

Position (2,1): 0+0+0+0+1+0+1+0+0=20+0+0+0+1+0+1+0+0 = 2

Position (2,2): 1+0+0+0+1+0+0+0+1=31+0+0+0+1+0+0+0+1 = 3

Output (3x3):

[614142323]\begin{bmatrix} 6 & 1 & 4 \\ 1 & 4 & 2 \\ 3 & 2 & 3 \end{bmatrix}

This filter detects a cross-like pattern (nonzero at corners and center). Positions where the input matches this pattern get higher values.


Example 2: Tracking dimensions through a CNN

Start with a 224x224x3 input image and pass it through three layers.

Layer 1: Conv with 64 filters of size 3x3, stride 1, padding 1.

Hout=2243+2(1)1+1=2231+1=224H_{\text{out}} = \frac{224 - 3 + 2(1)}{1} + 1 = \frac{223}{1} + 1 = 224

Output: 224 x 224 x 64. Same spatial size (thanks to padding), but now 64 channels.

Layer 2: Max pooling with 2x2 window, stride 2.

Hout=22422+1=112H_{\text{out}} = \frac{224 - 2}{2} + 1 = 112

Output: 112 x 112 x 64. Spatial dimensions halved. Channel count unchanged.

Layer 3: Conv with 128 filters of size 3x3, stride 1, padding 1.

Hout=1123+2(1)1+1=112H_{\text{out}} = \frac{112 - 3 + 2(1)}{1} + 1 = 112

Output: 112 x 112 x 128. Spatial size preserved, channels doubled.

LayerOperationOutput shapeParameters
Input-224 x 224 x 30
Conv13x3, 64 filters, s1, p1224 x 224 x 64(3×3×3)×64+64=1,792(3 \times 3 \times 3) \times 64 + 64 = 1{,}792
Pool12x2 max pool, s2112 x 112 x 640
Conv23x3, 128 filters, s1, p1112 x 112 x 128(3×3×64)×128+128=73,856(3 \times 3 \times 64) \times 128 + 128 = 73{,}856

Notice how spatial dimensions decrease (224 to 112) while depth increases (3 to 64 to 128). This is the classic CNN trade-off: compress spatial information while expanding feature representations.


Example 3: ResNet skip connection

Trace a forward pass through a residual block with specific values.

Input: x=[1.0,0.5]\mathbf{x} = [1.0, -0.5] (2D vector for simplicity)

Block has two layers (no bias, using ReLU):

W1=[0.20.10.10.3],W2=[0.30.20.10.4]W_1 = \begin{bmatrix} 0.2 & 0.1 \\ -0.1 & 0.3 \end{bmatrix}, \quad W_2 = \begin{bmatrix} 0.3 & -0.2 \\ 0.1 & 0.4 \end{bmatrix}

Layer 1: Linear + ReLU.

z1=W1x=[0.2(1.0)+0.1(0.5)0.1(1.0)+0.3(0.5)]=[0.150.25]\mathbf{z}_1 = W_1 \mathbf{x} = \begin{bmatrix} 0.2(1.0) + 0.1(-0.5) \\ -0.1(1.0) + 0.3(-0.5) \end{bmatrix} = \begin{bmatrix} 0.15 \\ -0.25 \end{bmatrix}

a1=ReLU(z1)=[0.150]\mathbf{a}_1 = \text{ReLU}(\mathbf{z}_1) = \begin{bmatrix} 0.15 \\ 0 \end{bmatrix}

Layer 2: Linear only (ReLU comes after the skip).

F(x)=W2a1=[0.3(0.15)+(0.2)(0)0.1(0.15)+0.4(0)]=[0.0450.015]F(\mathbf{x}) = W_2 \mathbf{a}_1 = \begin{bmatrix} 0.3(0.15) + (-0.2)(0) \\ 0.1(0.15) + 0.4(0) \end{bmatrix} = \begin{bmatrix} 0.045 \\ 0.015 \end{bmatrix}

Add the skip connection:

F(x)+x=[0.045+1.00.015+(0.5)]=[1.0450.485]F(\mathbf{x}) + \mathbf{x} = \begin{bmatrix} 0.045 + 1.0 \\ 0.015 + (-0.5) \end{bmatrix} = \begin{bmatrix} 1.045 \\ -0.485 \end{bmatrix}

Final ReLU:

output=ReLU(1.0450.485)=[1.0450]\text{output} = \text{ReLU}\begin{pmatrix} 1.045 \\ -0.485 \end{pmatrix} = \begin{bmatrix} 1.045 \\ 0 \end{bmatrix}

Without skip connection: the output would be just F(x)=[0.045,0.015]F(\mathbf{x}) = [0.045, 0.015], a tiny signal.

With skip connection: the output is [1.045,0][1.045, 0], preserving the original input’s magnitude. The skip connection ensures that even if the learned transformation F(x)F(\mathbf{x}) is small, the signal is not lost. During backpropagation, the gradient flows directly through the addition, making deep training stable.


What comes next

CNNs handle spatial data by exploiting local patterns and translation equivariance. But many problems involve sequential data: text, speech, time series. The order of elements matters, and the sequence length can vary.

Recurrent neural networks and LSTMs tackle this by introducing memory: a hidden state that carries information from one time step to the next.

Start typing to search across all content
navigate Enter open Esc close