Search…

ResNet: deep residual learning

In this series (26 parts)
  1. Neural networks: the basic building block
  2. Forward pass and backpropagation
  3. Training neural networks: a practical guide
  4. Convolutional neural networks
  5. Recurrent neural networks and LSTMs
  6. Attention mechanism and transformers
  7. Word embeddings: from one-hot to dense representations
  8. Transfer learning and fine-tuning
  9. Optimization techniques for deep networks
  10. Regularization for deep networks
  11. Encoder-decoder architectures
  12. Generative models: an overview
  13. Restricted Boltzmann Machines
  14. Deep Belief Networks
  15. Variational Autoencoders
  16. Generative Adversarial Networks: training and theory
  17. DCGAN, conditional GANs, and GAN variants
  18. Representation learning and self-supervised learning
  19. Domain adaptation and fine-tuning strategies
  20. Distributed representations and latent spaces
  21. AutoML and hyperparameter optimization
  22. Neural architecture search
  23. Network compression and efficient inference
  24. Graph neural networks
  25. Practical deep learning: debugging and tuning
  26. ResNet: deep residual learning
Ready — compare signal flow
Click Forward Pass to send a signal through both networks and see how they differ.
Plain Network ResNet (with skip connections) Skip connection

Prerequisites

You should understand:


The problem: deeper is not always better

In theory, a deeper network should perform at least as well as a shallower one. If the extra layers learned the identity mapping (just pass the input through unchanged), the deeper network would behave exactly like the shallower one. In practice, this does not happen.

When researchers stacked more layers onto standard CNNs, they observed something unexpected: training accuracy got worse, not just test accuracy. This is not overfitting — an overfit model would have high training accuracy but low test accuracy. This is the degradation problem: the optimization itself struggles when networks get deep.

The 56-layer plain network has higher training error than the 20-layer network. Both the training and test curves are worse. The deeper network cannot even learn what the shallower network learned. This is not a capacity problem — the deeper network has strictly more capacity. The problem is that standard gradient-based optimization cannot find the right weights when the network is too deep.

Why does this happen?

Consider a 50-layer plain network. During backpropagation, the gradient must flow back through all 50 layers. At each layer, the gradient is multiplied by the layer’s weight matrix and the derivative of the activation function. If these multiplications consistently produce values less than 1, the gradient shrinks exponentially — this is the vanishing gradient problem. By the time the gradient reaches the early layers, it is essentially zero. Those layers stop learning.

The opposite can also happen: if the multiplications produce values greater than 1, the gradient grows exponentially (exploding gradients). While techniques like batch normalization and careful initialization help, they don’t fully solve the problem for very deep networks.


The residual learning idea

The CNN post introduced skip connections briefly. Here we go deeper into why they work and how they are designed.

The core insight

Instead of asking a block of layers to learn the desired output H(x)H(\mathbf{x}) directly, let them learn only the residual: the difference between the desired output and the input.

F(x)=H(x)xF(\mathbf{x}) = H(\mathbf{x}) - \mathbf{x}

The block’s output becomes:

y=F(x)+x\mathbf{y} = F(\mathbf{x}) + \mathbf{x}

where F(x)F(\mathbf{x}) is whatever the stacked layers compute, and x\mathbf{x} is added back via a shortcut connection (also called a skip connection).

Why is this easier to optimize?

Think about it this way. If the optimal transformation for a block is close to the identity (just pass the input through), then:

  • A plain network must learn weights such that H(x)=xH(\mathbf{x}) = \mathbf{x}. This requires the weight matrices to converge to an identity-like configuration, which is not trivial.
  • A residual network only needs F(x)=0F(\mathbf{x}) = 0. Driving weights toward zero is much easier — it is the natural tendency of weight decay and initialization near zero.

In practice, most layers in a deep network need to make only small modifications to their input. Residual learning makes “do nothing” the default, and the network only has to learn the small deviations.

Gradient flow through residual blocks

This is the mathematical reason ResNets train so well. Consider backpropagation through a residual block:

y=F(x)+x\mathbf{y} = F(\mathbf{x}) + \mathbf{x}

The gradient of the loss L\mathcal{L} with respect to x\mathbf{x} is:

Lx=Ly(F(x)x+I)\frac{\partial \mathcal{L}}{\partial \mathbf{x}} = \frac{\partial \mathcal{L}}{\partial \mathbf{y}} \cdot \left( \frac{\partial F(\mathbf{x})}{\partial \mathbf{x}} + \mathbf{I} \right)

where I\mathbf{I} is the identity matrix. The gradient has two paths:

  1. Through the residual function: LyF(x)x\frac{\partial \mathcal{L}}{\partial \mathbf{y}} \cdot \frac{\partial F(\mathbf{x})}{\partial \mathbf{x}} — this can vanish if FF has small gradients
  2. Through the skip connection: LyI\frac{\partial \mathcal{L}}{\partial \mathbf{y}} \cdot \mathbf{I} — this always passes the gradient through unchanged

Even if the residual path vanishes completely, the skip connection guarantees gradient flow. With LL residual blocks stacked, the gradient at the input of block 1 still has a direct path through all LL skip connections. This is why ResNets can be 152 layers deep without degradation.


The residual block

A residual block is the fundamental building unit of ResNet. There are two types.

Basic block (used in ResNet-18 and ResNet-34)

The basic block has two 3×33 \times 3 convolution layers with batch normalization and ReLU activation:

graph LR
  X["x (input)"] --> Conv1["3×3 Conv, BN, ReLU"]
  Conv1 --> Conv2["3×3 Conv, BN"]
  X --> Add["⊕"]
  Conv2 --> Add
  Add --> ReLU2["ReLU"]
  ReLU2 --> Y["y (output)"]
  style X fill:#e8f4f8,stroke:#2196F3
  style Y fill:#e8f4f8,stroke:#2196F3
  style Add fill:#fff3e0,stroke:#FF9800

In equations:

y=ReLU(BN(Conv2(ReLU(BN(Conv1(x)))))+x)\mathbf{y} = \text{ReLU}\big(\text{BN}(\text{Conv}_2(\text{ReLU}(\text{BN}(\text{Conv}_1(\mathbf{x}))))) + \mathbf{x}\big)

Notice that ReLU is applied after the addition. This is important — applying ReLU before the addition would break the identity shortcut, because ReLU would clip negative values in the residual before they could cancel with the shortcut.

Bottleneck block (used in ResNet-50, 101, 152)

For deeper networks, the basic block is too expensive. A 3×33 \times 3 convolution on 256 channels involves 256×256×3×3590K256 \times 256 \times 3 \times 3 \approx 590K parameters per layer. The bottleneck block reduces this by using three layers:

  1. 1×1 Conv: reduce channels (e.g., 256 → 64) — the “bottleneck”
  2. 3×3 Conv: process at reduced dimensionality (64 channels)
  3. 1×1 Conv: expand back (64 → 256)
graph LR
  X["x (256-ch)"] --> Conv1["1×1 Conv, BN, ReLU<br/>256 → 64"]
  Conv1 --> Conv2["3×3 Conv, BN, ReLU<br/>64 → 64"]
  Conv2 --> Conv3["1×1 Conv, BN<br/>64 → 256"]
  X --> Add["⊕"]
  Conv3 --> Add
  Add --> ReLU3["ReLU"]
  ReLU3 --> Y["y (256-ch)"]
  style X fill:#e8f4f8,stroke:#2196F3
  style Y fill:#e8f4f8,stroke:#2196F3
  style Add fill:#fff3e0,stroke:#FF9800

Parameter comparison for layers operating on 256 channels:

  • Basic block: 2×(256×256×3×3)=1,179,6482 \times (256 \times 256 \times 3 \times 3) = 1,179,648 parameters
  • Bottleneck block: (256×64×1×1)+(64×64×3×3)+(64×256×1×1)=16,384+36,864+16,384=69,632(256 \times 64 \times 1 \times 1) + (64 \times 64 \times 3 \times 3) + (64 \times 256 \times 1 \times 1) = 16,384 + 36,864 + 16,384 = 69,632 parameters

The bottleneck block uses 17× fewer parameters while having three layers instead of two. This is how ResNet-50 can have 50 layers and still be computationally tractable.

Handling dimension mismatches

The addition F(x)+xF(\mathbf{x}) + \mathbf{x} requires that F(x)F(\mathbf{x}) and x\mathbf{x} have the same dimensions. This works fine when the input and output have the same number of channels and spatial size. But when the block changes the number of channels or uses a stride to reduce spatial resolution, the shortcut needs a projection:

y=F(x)+Wsx\mathbf{y} = F(\mathbf{x}) + W_s \mathbf{x}

where WsW_s is a 1×11 \times 1 convolution that matches the dimensions. For example, when going from 64 channels to 128 channels with stride 2, the projection is a 1×11 \times 1 conv with 128 filters and stride 2.


ResNet architectures

All ResNet variants share the same overall structure:

  1. Initial convolution: 7×77 \times 7 conv with 64 filters, stride 2, followed by batch norm, ReLU, and 3×33 \times 3 max pool with stride 2. This reduces a 224×224224 \times 224 input to 56×5656 \times 56.
  2. Four stages of residual blocks, doubling channels and halving spatial resolution at each stage transition: 56×56×6428×28×12814×14×2567×7×51256 \times 56 \times 64 \to 28 \times 28 \times 128 \to 14 \times 14 \times 256 \to 7 \times 7 \times 512
  3. Global average pooling: reduces 7×7×5127 \times 7 \times 512 to a 512-dimensional vector (or 2048 for bottleneck variants)
  4. Fully connected layer: 512 (or 2048) → number of classes
Architecture Block type Blocks per stage Total layers Parameters
ResNet-18 Basic [2, 2, 2, 2] 18 11.7M
ResNet-34 Basic [3, 4, 6, 3] 34 21.8M
ResNet-50 Bottleneck [3, 4, 6, 3] 50 25.6M
ResNet-101 Bottleneck [3, 4, 23, 3] 101 44.5M
ResNet-152 Bottleneck [3, 8, 36, 3] 152 60.2M

ResNet family — all trained on ImageNet

How to count layers: each basic block has 2 conv layers, each bottleneck block has 3. For ResNet-50: (3+4+6+3)×3=48(3+4+6+3) \times 3 = 48 conv layers in residual blocks +1+ 1 initial conv +1+ 1 FC layer =50= 50.

ResNet-18 detailed architecture

Let’s trace the full structure of ResNet-18 to make it concrete:

Stage Output size Layer details Blocks
Input 224 × 224 × 3
Conv1 112 × 112 × 64 7×7 conv, stride 2, BN, ReLU
Pool 56 × 56 × 64 3×3 max pool, stride 2
Stage 1 56 × 56 × 64 3×3 conv → 3×3 conv (×2) 2
Stage 2 28 × 28 × 128 3×3 conv → 3×3 conv (×2) 2
Stage 3 14 × 14 × 256 3×3 conv → 3×3 conv (×2) 2
Stage 4 7 × 7 × 512 3×3 conv → 3×3 conv (×2) 2
Avg Pool 1 × 1 × 512 Global average pooling
FC 1000 Fully connected

ResNet-18 layer-by-layer breakdown

The first block in stages 2, 3, and 4 uses stride 2 in its first convolution to halve the spatial resolution, and a 1×11 \times 1 projection shortcut to match the increased channel count.


Example 1: tracing a forward pass through a basic block

Let’s work through concrete numbers. Consider a basic residual block with input x=[2.0,1.0,0.5,1.5]\mathbf{x} = [2.0, -1.0, 0.5, 1.5] (a simplified 4-dimensional input for clarity).

Layer 1: 3×33 \times 3 Conv → BN → ReLU

Suppose after convolution and batch normalization, we get:

BN(Conv1(x))=[0.8,0.3,1.2,0.5]\text{BN}(\text{Conv}_1(\mathbf{x})) = [0.8, -0.3, 1.2, -0.5]

After ReLU (clamp negatives to zero):

ReLU([0.8,0.3,1.2,0.5])=[0.8,0,1.2,0]\text{ReLU}([0.8, -0.3, 1.2, -0.5]) = [0.8, 0, 1.2, 0]

Layer 2: 3×33 \times 3 Conv → BN (no ReLU yet)

BN(Conv2([0.8,0,1.2,0]))=[0.1,0.2,0.3,0.1]\text{BN}(\text{Conv}_2([0.8, 0, 1.2, 0])) = [0.1, -0.2, 0.3, -0.1]

This is F(x)F(\mathbf{x}), the residual.

Add the skip connection:

F(x)+x=[0.1,0.2,0.3,0.1]+[2.0,1.0,0.5,1.5]=[2.1,1.2,0.8,1.4]F(\mathbf{x}) + \mathbf{x} = [0.1, -0.2, 0.3, -0.1] + [2.0, -1.0, 0.5, 1.5] = [2.1, -1.2, 0.8, 1.4]

Final ReLU:

y=ReLU([2.1,1.2,0.8,1.4])=[2.1,0,0.8,1.4]\mathbf{y} = \text{ReLU}([2.1, -1.2, 0.8, 1.4]) = [2.1, 0, 0.8, 1.4]

Notice how the residual F(x)F(\mathbf{x}) is small (values around 0.1 to 0.3). The block makes a minor adjustment to the input rather than computing the output from scratch. The skip connection preserves the input’s magnitude even when the learned residual is tiny.

Without the skip connection, the output would just be ReLU([0.1,0.2,0.3,0.1])=[0.1,0,0.3,0]\text{ReLU}([0.1, -0.2, 0.3, -0.1]) = [0.1, 0, 0.3, 0] — most of the input information would be lost.


Example 2: gradient flow comparison

Consider 20 layers stacked together. For each layer ll, the gradient is multiplied by ylxl\frac{\partial \mathbf{y}_l}{\partial \mathbf{x}_l}.

Plain network: the gradient through 20 layers is:

Lx1=Ly20l=120Hlxl\frac{\partial \mathcal{L}}{\partial \mathbf{x}_1} = \frac{\partial \mathcal{L}}{\partial \mathbf{y}_{20}} \prod_{l=1}^{20} \frac{\partial H_l}{\partial \mathbf{x}_l}

If each factor has magnitude around 0.8 (less than 1), the gradient shrinks by 0.8200.0120.8^{20} \approx 0.012. Only 1.2% of the gradient survives. At 50 layers: 0.8500.000010.8^{50} \approx 0.00001. The early layers are effectively frozen.

Residual network: each block contributes Flxl+I\frac{\partial F_l}{\partial \mathbf{x}_l} + \mathbf{I}. The gradient through 20 blocks expands to a sum of 2202^{20} terms (each term corresponds to a path that either goes through or skips each block). Even if many terms are small, the term that skips all residual paths — the product of all identity shortcuts — contributes:

l=120I=I\prod_{l=1}^{20} \mathbf{I} = \mathbf{I}

This means the gradient has a direct path from the loss to the earliest layers with no attenuation. In practice, the gradient is a mix of short and long paths, providing a rich signal at every depth.


Batch normalization placement

The original ResNet uses post-activation batch normalization: Conv → BN → ReLU. Later research (He et al., 2016, “Identity Mappings in Deep Residual Networks”) showed that pre-activation works better for very deep networks:

Original (post-activation): y=ReLU(BN(Conv(x))+x)\mathbf{y} = \text{ReLU}(\text{BN}(\text{Conv}(\mathbf{x})) + \mathbf{x})

Pre-activation: y=Conv(ReLU(BN(x)))+x\mathbf{y} = \text{Conv}(\text{ReLU}(\text{BN}(\mathbf{x}))) + \mathbf{x}

In the pre-activation design, the shortcut path is a clean identity — nothing (no BN, no ReLU) sits on the shortcut. This makes the gradient flow even cleaner. Pre-activation ResNets showed improvements especially for networks deeper than 100 layers (ResNet-200 and beyond).


Example 3: parameter count of a bottleneck block

Let’s count parameters for one bottleneck block in ResNet-50’s Stage 3, where the input has 512 channels and the bottleneck narrows to 128 channels, with output of 512 channels:

1×1 Conv (reduce): 512×128×1×1=65,536512 \times 128 \times 1 \times 1 = 65,536 weights + 128128 biases (though biases are often omitted when using BN)

3×3 Conv (process): 128×128×3×3=147,456128 \times 128 \times 3 \times 3 = 147,456 weights

1×1 Conv (expand): 128×512×1×1=65,536128 \times 512 \times 1 \times 1 = 65,536 weights

Batch norm (per layer): 2×C2 \times C parameters (scale γ\gamma and shift β\beta): 2×128+2×128+2×512=1,7922 \times 128 + 2 \times 128 + 2 \times 512 = 1,792

Total for one block: 65,536+147,456+65,536+1,792=280,32065,536 + 147,456 + 65,536 + 1,792 = 280,320 parameters.

If this stage has 6 such blocks (as in ResNet-50), that is 6×280,3201.68M6 \times 280,320 \approx 1.68M parameters for one stage.

Compare this to 6 basic blocks at 512 channels: 6×2×(512×512×3×3)=6×4,718,59228.3M6 \times 2 \times (512 \times 512 \times 3 \times 3) = 6 \times 4,718,592 \approx 28.3M parameters — 17× more expensive.


Why ResNet works: multiple perspectives

1. Ensemble of shallow networks

Veit et al. (2016) showed that a ResNet can be viewed as an ensemble of many paths of different lengths. Unrolling a network with nn residual blocks produces 2n2^n paths (at each block, the signal can go through the residual function or skip it). Experiments showed that:

  • Most of the gradient flows through short paths (5–17 blocks in a 110-layer network)
  • Deleting a single residual block (forcing one path to always skip) causes only a small accuracy drop — unlike plain networks, where removing a layer is catastrophic
  • The network behaves less like a single deep pipeline and more like an ensemble of many moderately deep networks

2. Smooth loss landscape

Li et al. (2018) visualized the loss landscapes of plain and residual networks. Plain deep networks have chaotic, non-convex landscapes full of sharp minima. ResNets have much smoother landscapes — the skip connections prevent the loss surface from becoming too rough, making optimization easier.

3. Feature refinement

Each residual block refines the representation incrementally. Early blocks handle low-level features (edges, textures), and each subsequent block adds a small correction. This is more stable than asking each layer to transform the representation completely.


Common ResNet variants

ResNeXt

Instead of a single 3×33 \times 3 conv in the bottleneck, ResNeXt uses grouped convolutionsCC parallel paths (called “cardinality”) that each process a subset of channels:

F(x)=i=1CTi(x)F(\mathbf{x}) = \sum_{i=1}^{C} \mathcal{T}_i(\mathbf{x})

ResNeXt-50 (32×4d) uses 32 groups, each with 4 channels in the bottleneck. This gives better accuracy than ResNet-50 with similar computational cost. Cardinality (number of groups) is more effective than making the network deeper or wider.

Wide ResNets (WRN)

Zagoruyko & Komodakis (2016) found that making residual blocks wider (more channels per block) is more efficient than making the network deeper. A WRN-28-10 (28 layers, 10× width multiplier) outperforms ResNet-1001 while being 8× faster.

SE-ResNet (Squeeze-and-Excitation)

Adds a lightweight channel attention mechanism after each residual block. A squeeze-and-excitation module:

  1. Squeeze: global average pool to get one number per channel
  2. Excitation: two FC layers that output a weight per channel
  3. Scale: multiply each channel by its weight

This lets the network learn to emphasize informative channels and suppress less useful ones. SE-ResNet-50 beats ResNet-50 with only ~10% more parameters.


ImageNet results

ResNet’s impact on the ImageNet Large Scale Visual Recognition Challenge (ILSVRC):

Year Model Top-5 error Layers
2012 AlexNet 16.4% 8
2014 VGGNet-19 7.3% 19
2014 GoogLeNet 6.7% 22
2015 ResNet-152 3.6% 152
2015 Human performance ~5.1%

ImageNet classification — top-5 error rate over the years

ResNet was the first model to surpass human-level performance on ImageNet classification. The winning entry used an ensemble of ResNets with 152 layers each.


Practical usage

When to use which variant

  • ResNet-18 / 34: good starting points for smaller datasets, fast training. Use the basic block.
  • ResNet-50: the standard workhorse. Most pretrained model libraries default to this. Bottleneck blocks give a good accuracy-efficiency tradeoff.
  • ResNet-101 / 152: for problems where you need maximum accuracy and have enough data and compute.

Transfer learning

ResNets are the most common backbone for transfer learning. A typical workflow:

  1. Load a ResNet-50 pretrained on ImageNet
  2. Remove the final FC layer (the 1000-class classifier)
  3. Add your own classifier head (e.g., FC → softmax over your number of classes)
  4. Fine-tune: freeze early layers, train later layers + your head, then optionally unfreeze everything with a small learning rate

ResNet features transfer well because the residual blocks learn hierarchical representations: edges → textures → parts → objects. The early features are generic enough to be useful across very different domains (medical images, satellite imagery, industrial defect detection).

Implementation notes

  • Initialization: use He initialization (fan-in, normal) for conv layers. The residual path benefits from starting with small outputs.
  • Learning rate schedule: warm up for a few epochs, then use cosine decay or step decay (divide by 10 at epochs 30, 60, 90 for ImageNet).
  • Data augmentation: random crop, horizontal flip, and color jitter are standard. Deeper ResNets benefit more from stronger augmentation (Cutout, MixUp, CutMix).
  • Regularization: weight decay of 5×1045 \times 10^{-4} is standard. Deeper variants may benefit from dropout or stochastic depth (randomly dropping entire residual blocks during training).

Key takeaways

  1. Deeper plain networks degrade — they have worse training accuracy, not just test accuracy. This is an optimization problem, not an overfitting problem.
  2. Residual learning reformulates the problem: instead of learning H(x)H(\mathbf{x}), learn F(x)=H(x)xF(\mathbf{x}) = H(\mathbf{x}) - \mathbf{x}. This makes “do nothing” the easy default.
  3. Skip connections provide a direct gradient highway through arbitrarily deep networks. The identity term in Fx+I\frac{\partial F}{\partial \mathbf{x}} + \mathbf{I} guarantees gradient flow.
  4. Bottleneck blocks (1×1 → 3×3 → 1×1) make very deep networks computationally feasible by reducing channel dimensions.
  5. ResNet-50 is the standard workhorse for transfer learning and is the most commonly used pretrained backbone.
  6. ResNets behave like ensembles of many shallow paths, which explains their robustness and trainability.
Start typing to search across all content
navigate Enter open Esc close