Deep Learning from Scratch · Part 26

Jun 27, 2026 · 20 min read · Deep Learning

ResNet: deep residual learning

In this series (26 parts)

Ready — compare signal flow

Click Forward Pass to send a signal through both networks and see how they differ.

Plain Network ResNet (with skip connections) Skip connection

Prerequisites

You should understand:

Convolutional neural networks: convolution, pooling, feature maps, and the basics of skip connections
Training neural networks: backpropagation, gradient flow, and batch normalization

The problem: deeper is not always better

In theory, a deeper network should perform at least as well as a shallower one. If the extra layers learned the identity mapping (just pass the input through unchanged), the deeper network would behave exactly like the shallower one. In practice, this does not happen.

When researchers stacked more layers onto standard CNNs, they observed something unexpected: training accuracy got worse, not just test accuracy. This is not overfitting — an overfit model would have high training accuracy but low test accuracy. This is the degradation problem: the optimization itself struggles when networks get deep.

The 56-layer plain network has higher training error than the 20-layer network. Both the training and test curves are worse. The deeper network cannot even learn what the shallower network learned. This is not a capacity problem — the deeper network has strictly more capacity. The problem is that standard gradient-based optimization cannot find the right weights when the network is too deep.

Why does this happen?

Consider a 50-layer plain network. During backpropagation, the gradient must flow back through all 50 layers. At each layer, the gradient is multiplied by the layer’s weight matrix and the derivative of the activation function. If these multiplications consistently produce values less than 1, the gradient shrinks exponentially — this is the vanishing gradient problem. By the time the gradient reaches the early layers, it is essentially zero. Those layers stop learning.

The opposite can also happen: if the multiplications produce values greater than 1, the gradient grows exponentially (exploding gradients). While techniques like batch normalization and careful initialization help, they don’t fully solve the problem for very deep networks.

The residual learning idea

The CNN post introduced skip connections briefly. Here we go deeper into why they work and how they are designed.

The core insight

Instead of asking a block of layers to learn the desired output $H(\mathbf{x})$ directly, let them learn only the residual: the difference between the desired output and the input.

$F(\mathbf{x}) = H(\mathbf{x}) - \mathbf{x}$

The block’s output becomes:

$\mathbf{y} = F(\mathbf{x}) + \mathbf{x}$

where $F(\mathbf{x})$ is whatever the stacked layers compute, and $\mathbf{x}$ is added back via a shortcut connection (also called a skip connection).

Why is this easier to optimize?

Think about it this way. If the optimal transformation for a block is close to the identity (just pass the input through), then:

A plain network must learn weights such that $H(\mathbf{x}) = \mathbf{x}$ . This requires the weight matrices to converge to an identity-like configuration, which is not trivial.
A residual network only needs $F(\mathbf{x}) = 0$ . Driving weights toward zero is much easier — it is the natural tendency of weight decay and initialization near zero.

In practice, most layers in a deep network need to make only small modifications to their input. Residual learning makes “do nothing” the default, and the network only has to learn the small deviations.

Gradient flow through residual blocks

This is the mathematical reason ResNets train so well. Consider backpropagation through a residual block:

$\mathbf{y} = F(\mathbf{x}) + \mathbf{x}$

The gradient of the loss $\mathcal{L}$ with respect to $\mathbf{x}$ is:

$\frac{\partial \mathcal{L}}{\partial \mathbf{x}} = \frac{\partial \mathcal{L}}{\partial \mathbf{y}} \cdot \left( \frac{\partial F(\mathbf{x})}{\partial \mathbf{x}} + \mathbf{I} \right)$

where $\mathbf{I}$ is the identity matrix. The gradient has two paths:

Through the residual function: $\frac{\partial \mathcal{L}}{\partial \mathbf{y}} \cdot \frac{\partial F(\mathbf{x})}{\partial \mathbf{x}}$ — this can vanish if $F$ has small gradients
Through the skip connection: $\frac{\partial \mathcal{L}}{\partial \mathbf{y}} \cdot \mathbf{I}$ — this always passes the gradient through unchanged

Even if the residual path vanishes completely, the skip connection guarantees gradient flow. With $L$ residual blocks stacked, the gradient at the input of block 1 still has a direct path through all $L$ skip connections. This is why ResNets can be 152 layers deep without degradation.

The residual block

A residual block is the fundamental building unit of ResNet. There are two types.

Basic block (used in ResNet-18 and ResNet-34)

The basic block has two $3 \times 3$ convolution layers with batch normalization and ReLU activation:

graph LR
  X["x (input)"] --> Conv1["3×3 Conv, BN, ReLU"]
  Conv1 --> Conv2["3×3 Conv, BN"]
  X --> Add["⊕"]
  Conv2 --> Add
  Add --> ReLU2["ReLU"]
  ReLU2 --> Y["y (output)"]
  style X fill:#e8f4f8,stroke:#2196F3
  style Y fill:#e8f4f8,stroke:#2196F3
  style Add fill:#fff3e0,stroke:#FF9800

In equations:

$\mathbf{y} = \text{ReLU}\big(\text{BN}(\text{Conv}_2(\text{ReLU}(\text{BN}(\text{Conv}_1(\mathbf{x}))))) + \mathbf{x}\big)$

Notice that ReLU is applied after the addition. This is important — applying ReLU before the addition would break the identity shortcut, because ReLU would clip negative values in the residual before they could cancel with the shortcut.

Bottleneck block (used in ResNet-50, 101, 152)

For deeper networks, the basic block is too expensive. A $3 \times 3$ convolution on 256 channels involves $256 \times 256 \times 3 \times 3 \approx 590K$ parameters per layer. The bottleneck block reduces this by using three layers:

1×1 Conv: reduce channels (e.g., 256 → 64) — the “bottleneck”
3×3 Conv: process at reduced dimensionality (64 channels)
1×1 Conv: expand back (64 → 256)

graph LR
  X["x (256-ch)"] --> Conv1["1×1 Conv, BN, ReLU<br/>256 → 64"]
  Conv1 --> Conv2["3×3 Conv, BN, ReLU<br/>64 → 64"]
  Conv2 --> Conv3["1×1 Conv, BN<br/>64 → 256"]
  X --> Add["⊕"]
  Conv3 --> Add
  Add --> ReLU3["ReLU"]
  ReLU3 --> Y["y (256-ch)"]
  style X fill:#e8f4f8,stroke:#2196F3
  style Y fill:#e8f4f8,stroke:#2196F3
  style Add fill:#fff3e0,stroke:#FF9800

Parameter comparison for layers operating on 256 channels:

Basic block: $2 \times (256 \times 256 \times 3 \times 3) = 1,179,648$ parameters
Bottleneck block: $(256 \times 64 \times 1 \times 1) + (64 \times 64 \times 3 \times 3) + (64 \times 256 \times 1 \times 1) = 16,384 + 36,864 + 16,384 = 69,632$ parameters

The bottleneck block uses 17× fewer parameters while having three layers instead of two. This is how ResNet-50 can have 50 layers and still be computationally tractable.

Handling dimension mismatches

The addition $F(\mathbf{x}) + \mathbf{x}$ requires that $F(\mathbf{x})$ and $\mathbf{x}$ have the same dimensions. This works fine when the input and output have the same number of channels and spatial size. But when the block changes the number of channels or uses a stride to reduce spatial resolution, the shortcut needs a projection:

$\mathbf{y} = F(\mathbf{x}) + W_s \mathbf{x}$

where $W_s$ is a $1 \times 1$ convolution that matches the dimensions. For example, when going from 64 channels to 128 channels with stride 2, the projection is a $1 \times 1$ conv with 128 filters and stride 2.

ResNet architectures

All ResNet variants share the same overall structure:

Initial convolution: $7 \times 7$ conv with 64 filters, stride 2, followed by batch norm, ReLU, and $3 \times 3$ max pool with stride 2. This reduces a $224 \times 224$ input to $56 \times 56$ .
Four stages of residual blocks, doubling channels and halving spatial resolution at each stage transition: $56 \times 56 \times 64 \to 28 \times 28 \times 128 \to 14 \times 14 \times 256 \to 7 \times 7 \times 512$
Global average pooling: reduces $7 \times 7 \times 512$ to a 512-dimensional vector (or 2048 for bottleneck variants)
Fully connected layer: 512 (or 2048) → number of classes

Architecture	Block type	Blocks per stage	Total layers	Parameters
ResNet-18	Basic	[2, 2, 2, 2]	18	11.7M
ResNet-34	Basic	[3, 4, 6, 3]	34	21.8M
ResNet-50	Bottleneck	[3, 4, 6, 3]	50	25.6M
ResNet-101	Bottleneck	[3, 4, 23, 3]	101	44.5M
ResNet-152	Bottleneck	[3, 8, 36, 3]	152	60.2M

ResNet family — all trained on ImageNet

How to count layers: each basic block has 2 conv layers, each bottleneck block has 3. For ResNet-50: $(3+4+6+3) \times 3 = 48$ conv layers in residual blocks $+ 1$ initial conv $+ 1$ FC layer $= 50$ .

ResNet-18 detailed architecture

Let’s trace the full structure of ResNet-18 to make it concrete:

Stage	Output size	Layer details	Blocks
Input	224 × 224 × 3	—	—
Conv1	112 × 112 × 64	7×7 conv, stride 2, BN, ReLU	—
Pool	56 × 56 × 64	3×3 max pool, stride 2	—
Stage 1	56 × 56 × 64	3×3 conv → 3×3 conv (×2)	2
Stage 2	28 × 28 × 128	3×3 conv → 3×3 conv (×2)	2
Stage 3	14 × 14 × 256	3×3 conv → 3×3 conv (×2)	2
Stage 4	7 × 7 × 512	3×3 conv → 3×3 conv (×2)	2
Avg Pool	1 × 1 × 512	Global average pooling	—
FC	1000	Fully connected	—

ResNet-18 layer-by-layer breakdown

The first block in stages 2, 3, and 4 uses stride 2 in its first convolution to halve the spatial resolution, and a $1 \times 1$ projection shortcut to match the increased channel count.

Example 1: tracing a forward pass through a basic block

Let’s work through concrete numbers. Consider a basic residual block with input $\mathbf{x} = [2.0, -1.0, 0.5, 1.5]$ (a simplified 4-dimensional input for clarity).

Layer 1: $3 \times 3$ Conv → BN → ReLU

Suppose after convolution and batch normalization, we get:

$\text{BN}(\text{Conv}_1(\mathbf{x})) = [0.8, -0.3, 1.2, -0.5]$

After ReLU (clamp negatives to zero):

$\text{ReLU}([0.8, -0.3, 1.2, -0.5]) = [0.8, 0, 1.2, 0]$

Layer 2: $3 \times 3$ Conv → BN (no ReLU yet)

$\text{BN}(\text{Conv}_2([0.8, 0, 1.2, 0])) = [0.1, -0.2, 0.3, -0.1]$

This is $F(\mathbf{x})$ , the residual.

Add the skip connection:

$F(\mathbf{x}) + \mathbf{x} = [0.1, -0.2, 0.3, -0.1] + [2.0, -1.0, 0.5, 1.5] = [2.1, -1.2, 0.8, 1.4]$

Final ReLU:

$\mathbf{y} = \text{ReLU}([2.1, -1.2, 0.8, 1.4]) = [2.1, 0, 0.8, 1.4]$

Notice how the residual $F(\mathbf{x})$ is small (values around 0.1 to 0.3). The block makes a minor adjustment to the input rather than computing the output from scratch. The skip connection preserves the input’s magnitude even when the learned residual is tiny.

Without the skip connection, the output would just be $\text{ReLU}([0.1, -0.2, 0.3, -0.1]) = [0.1, 0, 0.3, 0]$ — most of the input information would be lost.

Example 2: gradient flow comparison

Consider 20 layers stacked together. For each layer $l$ , the gradient is multiplied by $\frac{\partial \mathbf{y}_l}{\partial \mathbf{x}_l}$ .

Plain network: the gradient through 20 layers is:

$\frac{\partial \mathcal{L}}{\partial \mathbf{x}_1} = \frac{\partial \mathcal{L}}{\partial \mathbf{y}_{20}} \prod_{l=1}^{20} \frac{\partial H_l}{\partial \mathbf{x}_l}$

If each factor has magnitude around 0.8 (less than 1), the gradient shrinks by $0.8^{20} \approx 0.012$ . Only 1.2% of the gradient survives. At 50 layers: $0.8^{50} \approx 0.00001$ . The early layers are effectively frozen.

Residual network: each block contributes $\frac{\partial F_l}{\partial \mathbf{x}_l} + \mathbf{I}$ . The gradient through 20 blocks expands to a sum of $2^{20}$ terms (each term corresponds to a path that either goes through or skips each block). Even if many terms are small, the term that skips all residual paths — the product of all identity shortcuts — contributes:

$\prod_{l=1}^{20} \mathbf{I} = \mathbf{I}$

This means the gradient has a direct path from the loss to the earliest layers with no attenuation. In practice, the gradient is a mix of short and long paths, providing a rich signal at every depth.

Batch normalization placement

The original ResNet uses post-activation batch normalization: Conv → BN → ReLU. Later research (He et al., 2016, “Identity Mappings in Deep Residual Networks”) showed that pre-activation works better for very deep networks:

Original (post-activation): $\mathbf{y} = \text{ReLU}(\text{BN}(\text{Conv}(\mathbf{x})) + \mathbf{x})$

Pre-activation: $\mathbf{y} = \text{Conv}(\text{ReLU}(\text{BN}(\mathbf{x}))) + \mathbf{x}$

In the pre-activation design, the shortcut path is a clean identity — nothing (no BN, no ReLU) sits on the shortcut. This makes the gradient flow even cleaner. Pre-activation ResNets showed improvements especially for networks deeper than 100 layers (ResNet-200 and beyond).

Example 3: parameter count of a bottleneck block

Let’s count parameters for one bottleneck block in ResNet-50’s Stage 3, where the input has 512 channels and the bottleneck narrows to 128 channels, with output of 512 channels:

1×1 Conv (reduce): $512 \times 128 \times 1 \times 1 = 65,536$ weights + $128$ biases (though biases are often omitted when using BN)

3×3 Conv (process): $128 \times 128 \times 3 \times 3 = 147,456$ weights

1×1 Conv (expand): $128 \times 512 \times 1 \times 1 = 65,536$ weights

Batch norm (per layer): $2 \times C$ parameters (scale $\gamma$ and shift $\beta$ ): $2 \times 128 + 2 \times 128 + 2 \times 512 = 1,792$

Total for one block: $65,536 + 147,456 + 65,536 + 1,792 = 280,320$ parameters.

If this stage has 6 such blocks (as in ResNet-50), that is $6 \times 280,320 \approx 1.68M$ parameters for one stage.

Compare this to 6 basic blocks at 512 channels: $6 \times 2 \times (512 \times 512 \times 3 \times 3) = 6 \times 4,718,592 \approx 28.3M$ parameters — 17× more expensive.

Why ResNet works: multiple perspectives

1. Ensemble of shallow networks

Veit et al. (2016) showed that a ResNet can be viewed as an ensemble of many paths of different lengths. Unrolling a network with $n$ residual blocks produces $2^n$ paths (at each block, the signal can go through the residual function or skip it). Experiments showed that:

Most of the gradient flows through short paths (5–17 blocks in a 110-layer network)
Deleting a single residual block (forcing one path to always skip) causes only a small accuracy drop — unlike plain networks, where removing a layer is catastrophic
The network behaves less like a single deep pipeline and more like an ensemble of many moderately deep networks

2. Smooth loss landscape

Li et al. (2018) visualized the loss landscapes of plain and residual networks. Plain deep networks have chaotic, non-convex landscapes full of sharp minima. ResNets have much smoother landscapes — the skip connections prevent the loss surface from becoming too rough, making optimization easier.

Each residual block refines the representation incrementally. Early blocks handle low-level features (edges, textures), and each subsequent block adds a small correction. This is more stable than asking each layer to transform the representation completely.

Common ResNet variants

ResNeXt

Instead of a single $3 \times 3$ conv in the bottleneck, ResNeXt uses grouped convolutions — $C$ parallel paths (called “cardinality”) that each process a subset of channels:

$F(\mathbf{x}) = \sum_{i=1}^{C} \mathcal{T}_i(\mathbf{x})$

ResNeXt-50 (32×4d) uses 32 groups, each with 4 channels in the bottleneck. This gives better accuracy than ResNet-50 with similar computational cost. Cardinality (number of groups) is more effective than making the network deeper or wider.

Wide ResNets (WRN)

Zagoruyko & Komodakis (2016) found that making residual blocks wider (more channels per block) is more efficient than making the network deeper. A WRN-28-10 (28 layers, 10× width multiplier) outperforms ResNet-1001 while being 8× faster.

SE-ResNet (Squeeze-and-Excitation)

Adds a lightweight channel attention mechanism after each residual block. A squeeze-and-excitation module:

Squeeze: global average pool to get one number per channel
Excitation: two FC layers that output a weight per channel
Scale: multiply each channel by its weight

This lets the network learn to emphasize informative channels and suppress less useful ones. SE-ResNet-50 beats ResNet-50 with only ~10% more parameters.

ImageNet results

ResNet’s impact on the ImageNet Large Scale Visual Recognition Challenge (ILSVRC):

Year	Model	Top-5 error	Layers
2012	AlexNet	16.4%	8
2014	VGGNet-19	7.3%	19
2014	GoogLeNet	6.7%	22
2015	ResNet-152	3.6%	152
2015	Human performance	~5.1%	—

ImageNet classification — top-5 error rate over the years

ResNet was the first model to surpass human-level performance on ImageNet classification. The winning entry used an ensemble of ResNets with 152 layers each.

Practical usage

When to use which variant

ResNet-18 / 34: good starting points for smaller datasets, fast training. Use the basic block.
ResNet-50: the standard workhorse. Most pretrained model libraries default to this. Bottleneck blocks give a good accuracy-efficiency tradeoff.
ResNet-101 / 152: for problems where you need maximum accuracy and have enough data and compute.

Transfer learning

ResNets are the most common backbone for transfer learning. A typical workflow:

Load a ResNet-50 pretrained on ImageNet
Remove the final FC layer (the 1000-class classifier)
Add your own classifier head (e.g., FC → softmax over your number of classes)
Fine-tune: freeze early layers, train later layers + your head, then optionally unfreeze everything with a small learning rate

ResNet features transfer well because the residual blocks learn hierarchical representations: edges → textures → parts → objects. The early features are generic enough to be useful across very different domains (medical images, satellite imagery, industrial defect detection).

Implementation notes

Initialization: use He initialization (fan-in, normal) for conv layers. The residual path benefits from starting with small outputs.
Learning rate schedule: warm up for a few epochs, then use cosine decay or step decay (divide by 10 at epochs 30, 60, 90 for ImageNet).
Data augmentation: random crop, horizontal flip, and color jitter are standard. Deeper ResNets benefit more from stronger augmentation (Cutout, MixUp, CutMix).
Regularization: weight decay of $5 \times 10^{-4}$ is standard. Deeper variants may benefit from dropout or stochastic depth (randomly dropping entire residual blocks during training).

Key takeaways

Deeper plain networks degrade — they have worse training accuracy, not just test accuracy. This is an optimization problem, not an overfitting problem.
Residual learning reformulates the problem: instead of learning $H(\mathbf{x})$ , learn $F(\mathbf{x}) = H(\mathbf{x}) - \mathbf{x}$ . This makes “do nothing” the easy default.
Skip connections provide a direct gradient highway through arbitrarily deep networks. The identity term in $\frac{\partial F}{\partial \mathbf{x}} + \mathbf{I}$ guarantees gradient flow.
Bottleneck blocks (1×1 → 3×3 → 1×1) make very deep networks computationally feasible by reducing channel dimensions.
ResNet-50 is the standard workhorse for transfer learning and is the most commonly used pretrained backbone.
ResNets behave like ensembles of many shallow paths, which explains their robustness and trainability.

← Back to all series