ResNet: deep residual learning
In this series (26 parts)
- Neural networks: the basic building block
- Forward pass and backpropagation
- Training neural networks: a practical guide
- Convolutional neural networks
- Recurrent neural networks and LSTMs
- Attention mechanism and transformers
- Word embeddings: from one-hot to dense representations
- Transfer learning and fine-tuning
- Optimization techniques for deep networks
- Regularization for deep networks
- Encoder-decoder architectures
- Generative models: an overview
- Restricted Boltzmann Machines
- Deep Belief Networks
- Variational Autoencoders
- Generative Adversarial Networks: training and theory
- DCGAN, conditional GANs, and GAN variants
- Representation learning and self-supervised learning
- Domain adaptation and fine-tuning strategies
- Distributed representations and latent spaces
- AutoML and hyperparameter optimization
- Neural architecture search
- Network compression and efficient inference
- Graph neural networks
- Practical deep learning: debugging and tuning
- ResNet: deep residual learning
Prerequisites
You should understand:
- Convolutional neural networks: convolution, pooling, feature maps, and the basics of skip connections
- Training neural networks: backpropagation, gradient flow, and batch normalization
The problem: deeper is not always better
In theory, a deeper network should perform at least as well as a shallower one. If the extra layers learned the identity mapping (just pass the input through unchanged), the deeper network would behave exactly like the shallower one. In practice, this does not happen.
When researchers stacked more layers onto standard CNNs, they observed something unexpected: training accuracy got worse, not just test accuracy. This is not overfitting — an overfit model would have high training accuracy but low test accuracy. This is the degradation problem: the optimization itself struggles when networks get deep.
The 56-layer plain network has higher training error than the 20-layer network. Both the training and test curves are worse. The deeper network cannot even learn what the shallower network learned. This is not a capacity problem — the deeper network has strictly more capacity. The problem is that standard gradient-based optimization cannot find the right weights when the network is too deep.
Why does this happen?
Consider a 50-layer plain network. During backpropagation, the gradient must flow back through all 50 layers. At each layer, the gradient is multiplied by the layer’s weight matrix and the derivative of the activation function. If these multiplications consistently produce values less than 1, the gradient shrinks exponentially — this is the vanishing gradient problem. By the time the gradient reaches the early layers, it is essentially zero. Those layers stop learning.
The opposite can also happen: if the multiplications produce values greater than 1, the gradient grows exponentially (exploding gradients). While techniques like batch normalization and careful initialization help, they don’t fully solve the problem for very deep networks.
The residual learning idea
The CNN post introduced skip connections briefly. Here we go deeper into why they work and how they are designed.
The core insight
Instead of asking a block of layers to learn the desired output directly, let them learn only the residual: the difference between the desired output and the input.
The block’s output becomes:
where is whatever the stacked layers compute, and is added back via a shortcut connection (also called a skip connection).
Why is this easier to optimize?
Think about it this way. If the optimal transformation for a block is close to the identity (just pass the input through), then:
- A plain network must learn weights such that . This requires the weight matrices to converge to an identity-like configuration, which is not trivial.
- A residual network only needs . Driving weights toward zero is much easier — it is the natural tendency of weight decay and initialization near zero.
In practice, most layers in a deep network need to make only small modifications to their input. Residual learning makes “do nothing” the default, and the network only has to learn the small deviations.
Gradient flow through residual blocks
This is the mathematical reason ResNets train so well. Consider backpropagation through a residual block:
The gradient of the loss with respect to is:
where is the identity matrix. The gradient has two paths:
- Through the residual function: — this can vanish if has small gradients
- Through the skip connection: — this always passes the gradient through unchanged
Even if the residual path vanishes completely, the skip connection guarantees gradient flow. With residual blocks stacked, the gradient at the input of block 1 still has a direct path through all skip connections. This is why ResNets can be 152 layers deep without degradation.
The residual block
A residual block is the fundamental building unit of ResNet. There are two types.
Basic block (used in ResNet-18 and ResNet-34)
The basic block has two convolution layers with batch normalization and ReLU activation:
graph LR X["x (input)"] --> Conv1["3×3 Conv, BN, ReLU"] Conv1 --> Conv2["3×3 Conv, BN"] X --> Add["⊕"] Conv2 --> Add Add --> ReLU2["ReLU"] ReLU2 --> Y["y (output)"] style X fill:#e8f4f8,stroke:#2196F3 style Y fill:#e8f4f8,stroke:#2196F3 style Add fill:#fff3e0,stroke:#FF9800
In equations:
Notice that ReLU is applied after the addition. This is important — applying ReLU before the addition would break the identity shortcut, because ReLU would clip negative values in the residual before they could cancel with the shortcut.
Bottleneck block (used in ResNet-50, 101, 152)
For deeper networks, the basic block is too expensive. A convolution on 256 channels involves parameters per layer. The bottleneck block reduces this by using three layers:
- 1×1 Conv: reduce channels (e.g., 256 → 64) — the “bottleneck”
- 3×3 Conv: process at reduced dimensionality (64 channels)
- 1×1 Conv: expand back (64 → 256)
graph LR X["x (256-ch)"] --> Conv1["1×1 Conv, BN, ReLU<br/>256 → 64"] Conv1 --> Conv2["3×3 Conv, BN, ReLU<br/>64 → 64"] Conv2 --> Conv3["1×1 Conv, BN<br/>64 → 256"] X --> Add["⊕"] Conv3 --> Add Add --> ReLU3["ReLU"] ReLU3 --> Y["y (256-ch)"] style X fill:#e8f4f8,stroke:#2196F3 style Y fill:#e8f4f8,stroke:#2196F3 style Add fill:#fff3e0,stroke:#FF9800
Parameter comparison for layers operating on 256 channels:
- Basic block: parameters
- Bottleneck block: parameters
The bottleneck block uses 17× fewer parameters while having three layers instead of two. This is how ResNet-50 can have 50 layers and still be computationally tractable.
Handling dimension mismatches
The addition requires that and have the same dimensions. This works fine when the input and output have the same number of channels and spatial size. But when the block changes the number of channels or uses a stride to reduce spatial resolution, the shortcut needs a projection:
where is a convolution that matches the dimensions. For example, when going from 64 channels to 128 channels with stride 2, the projection is a conv with 128 filters and stride 2.
ResNet architectures
All ResNet variants share the same overall structure:
- Initial convolution: conv with 64 filters, stride 2, followed by batch norm, ReLU, and max pool with stride 2. This reduces a input to .
- Four stages of residual blocks, doubling channels and halving spatial resolution at each stage transition:
- Global average pooling: reduces to a 512-dimensional vector (or 2048 for bottleneck variants)
- Fully connected layer: 512 (or 2048) → number of classes
| Architecture | Block type | Blocks per stage | Total layers | Parameters |
|---|---|---|---|---|
| ResNet-18 | Basic | [2, 2, 2, 2] | 18 | 11.7M |
| ResNet-34 | Basic | [3, 4, 6, 3] | 34 | 21.8M |
| ResNet-50 | Bottleneck | [3, 4, 6, 3] | 50 | 25.6M |
| ResNet-101 | Bottleneck | [3, 4, 23, 3] | 101 | 44.5M |
| ResNet-152 | Bottleneck | [3, 8, 36, 3] | 152 | 60.2M |
ResNet family — all trained on ImageNet
How to count layers: each basic block has 2 conv layers, each bottleneck block has 3. For ResNet-50: conv layers in residual blocks initial conv FC layer .
ResNet-18 detailed architecture
Let’s trace the full structure of ResNet-18 to make it concrete:
| Stage | Output size | Layer details | Blocks |
|---|---|---|---|
| Input | 224 × 224 × 3 | — | — |
| Conv1 | 112 × 112 × 64 | 7×7 conv, stride 2, BN, ReLU | — |
| Pool | 56 × 56 × 64 | 3×3 max pool, stride 2 | — |
| Stage 1 | 56 × 56 × 64 | 3×3 conv → 3×3 conv (×2) | 2 |
| Stage 2 | 28 × 28 × 128 | 3×3 conv → 3×3 conv (×2) | 2 |
| Stage 3 | 14 × 14 × 256 | 3×3 conv → 3×3 conv (×2) | 2 |
| Stage 4 | 7 × 7 × 512 | 3×3 conv → 3×3 conv (×2) | 2 |
| Avg Pool | 1 × 1 × 512 | Global average pooling | — |
| FC | 1000 | Fully connected | — |
ResNet-18 layer-by-layer breakdown
The first block in stages 2, 3, and 4 uses stride 2 in its first convolution to halve the spatial resolution, and a projection shortcut to match the increased channel count.
Example 1: tracing a forward pass through a basic block
Let’s work through concrete numbers. Consider a basic residual block with input (a simplified 4-dimensional input for clarity).
Layer 1: Conv → BN → ReLU
Suppose after convolution and batch normalization, we get:
After ReLU (clamp negatives to zero):
Layer 2: Conv → BN (no ReLU yet)
This is , the residual.
Add the skip connection:
Final ReLU:
Notice how the residual is small (values around 0.1 to 0.3). The block makes a minor adjustment to the input rather than computing the output from scratch. The skip connection preserves the input’s magnitude even when the learned residual is tiny.
Without the skip connection, the output would just be — most of the input information would be lost.
Example 2: gradient flow comparison
Consider 20 layers stacked together. For each layer , the gradient is multiplied by .
Plain network: the gradient through 20 layers is:
If each factor has magnitude around 0.8 (less than 1), the gradient shrinks by . Only 1.2% of the gradient survives. At 50 layers: . The early layers are effectively frozen.
Residual network: each block contributes . The gradient through 20 blocks expands to a sum of terms (each term corresponds to a path that either goes through or skips each block). Even if many terms are small, the term that skips all residual paths — the product of all identity shortcuts — contributes:
This means the gradient has a direct path from the loss to the earliest layers with no attenuation. In practice, the gradient is a mix of short and long paths, providing a rich signal at every depth.
Batch normalization placement
The original ResNet uses post-activation batch normalization: Conv → BN → ReLU. Later research (He et al., 2016, “Identity Mappings in Deep Residual Networks”) showed that pre-activation works better for very deep networks:
Original (post-activation):
Pre-activation:
In the pre-activation design, the shortcut path is a clean identity — nothing (no BN, no ReLU) sits on the shortcut. This makes the gradient flow even cleaner. Pre-activation ResNets showed improvements especially for networks deeper than 100 layers (ResNet-200 and beyond).
Example 3: parameter count of a bottleneck block
Let’s count parameters for one bottleneck block in ResNet-50’s Stage 3, where the input has 512 channels and the bottleneck narrows to 128 channels, with output of 512 channels:
1×1 Conv (reduce): weights + biases (though biases are often omitted when using BN)
3×3 Conv (process): weights
1×1 Conv (expand): weights
Batch norm (per layer): parameters (scale and shift ):
Total for one block: parameters.
If this stage has 6 such blocks (as in ResNet-50), that is parameters for one stage.
Compare this to 6 basic blocks at 512 channels: parameters — 17× more expensive.
Why ResNet works: multiple perspectives
1. Ensemble of shallow networks
Veit et al. (2016) showed that a ResNet can be viewed as an ensemble of many paths of different lengths. Unrolling a network with residual blocks produces paths (at each block, the signal can go through the residual function or skip it). Experiments showed that:
- Most of the gradient flows through short paths (5–17 blocks in a 110-layer network)
- Deleting a single residual block (forcing one path to always skip) causes only a small accuracy drop — unlike plain networks, where removing a layer is catastrophic
- The network behaves less like a single deep pipeline and more like an ensemble of many moderately deep networks
2. Smooth loss landscape
Li et al. (2018) visualized the loss landscapes of plain and residual networks. Plain deep networks have chaotic, non-convex landscapes full of sharp minima. ResNets have much smoother landscapes — the skip connections prevent the loss surface from becoming too rough, making optimization easier.
3. Feature refinement
Each residual block refines the representation incrementally. Early blocks handle low-level features (edges, textures), and each subsequent block adds a small correction. This is more stable than asking each layer to transform the representation completely.
Common ResNet variants
ResNeXt
Instead of a single conv in the bottleneck, ResNeXt uses grouped convolutions — parallel paths (called “cardinality”) that each process a subset of channels:
ResNeXt-50 (32×4d) uses 32 groups, each with 4 channels in the bottleneck. This gives better accuracy than ResNet-50 with similar computational cost. Cardinality (number of groups) is more effective than making the network deeper or wider.
Wide ResNets (WRN)
Zagoruyko & Komodakis (2016) found that making residual blocks wider (more channels per block) is more efficient than making the network deeper. A WRN-28-10 (28 layers, 10× width multiplier) outperforms ResNet-1001 while being 8× faster.
SE-ResNet (Squeeze-and-Excitation)
Adds a lightweight channel attention mechanism after each residual block. A squeeze-and-excitation module:
- Squeeze: global average pool to get one number per channel
- Excitation: two FC layers that output a weight per channel
- Scale: multiply each channel by its weight
This lets the network learn to emphasize informative channels and suppress less useful ones. SE-ResNet-50 beats ResNet-50 with only ~10% more parameters.
ImageNet results
ResNet’s impact on the ImageNet Large Scale Visual Recognition Challenge (ILSVRC):
| Year | Model | Top-5 error | Layers |
|---|---|---|---|
| 2012 | AlexNet | 16.4% | 8 |
| 2014 | VGGNet-19 | 7.3% | 19 |
| 2014 | GoogLeNet | 6.7% | 22 |
| 2015 | ResNet-152 | 3.6% | 152 |
| 2015 | Human performance | ~5.1% | — |
ImageNet classification — top-5 error rate over the years
ResNet was the first model to surpass human-level performance on ImageNet classification. The winning entry used an ensemble of ResNets with 152 layers each.
Practical usage
When to use which variant
- ResNet-18 / 34: good starting points for smaller datasets, fast training. Use the basic block.
- ResNet-50: the standard workhorse. Most pretrained model libraries default to this. Bottleneck blocks give a good accuracy-efficiency tradeoff.
- ResNet-101 / 152: for problems where you need maximum accuracy and have enough data and compute.
Transfer learning
ResNets are the most common backbone for transfer learning. A typical workflow:
- Load a ResNet-50 pretrained on ImageNet
- Remove the final FC layer (the 1000-class classifier)
- Add your own classifier head (e.g., FC → softmax over your number of classes)
- Fine-tune: freeze early layers, train later layers + your head, then optionally unfreeze everything with a small learning rate
ResNet features transfer well because the residual blocks learn hierarchical representations: edges → textures → parts → objects. The early features are generic enough to be useful across very different domains (medical images, satellite imagery, industrial defect detection).
Implementation notes
- Initialization: use He initialization (fan-in, normal) for conv layers. The residual path benefits from starting with small outputs.
- Learning rate schedule: warm up for a few epochs, then use cosine decay or step decay (divide by 10 at epochs 30, 60, 90 for ImageNet).
- Data augmentation: random crop, horizontal flip, and color jitter are standard. Deeper ResNets benefit more from stronger augmentation (Cutout, MixUp, CutMix).
- Regularization: weight decay of is standard. Deeper variants may benefit from dropout or stochastic depth (randomly dropping entire residual blocks during training).
Key takeaways
- Deeper plain networks degrade — they have worse training accuracy, not just test accuracy. This is an optimization problem, not an overfitting problem.
- Residual learning reformulates the problem: instead of learning , learn . This makes “do nothing” the easy default.
- Skip connections provide a direct gradient highway through arbitrarily deep networks. The identity term in guarantees gradient flow.
- Bottleneck blocks (1×1 → 3×3 → 1×1) make very deep networks computationally feasible by reducing channel dimensions.
- ResNet-50 is the standard workhorse for transfer learning and is the most commonly used pretrained backbone.
- ResNets behave like ensembles of many shallow paths, which explains their robustness and trainability.