Nov 5, 2025 · 18 min read · Deep Learning

Transfer learning and fine-tuning

In this series (25 parts)

Prerequisites: Convolutional neural networks and Training neural networks.

Most deep learning models are not trained from scratch. They start from weights that someone else already trained on a large dataset, then adapt those weights to a new task. This is transfer learning, and it works because early layers in a network learn features that are useful across many problems. You get the benefit of large-scale training without paying the cost yourself.

Why training from scratch is expensive

A ResNet-50 has roughly 25 million parameters. Training it on ImageNet (1.4 million images, 1000 classes) takes days on multiple GPUs. Most teams don’t have that kind of data or compute for their specific problem. A hospital classifying X-ray images might have 5,000 labeled samples. A startup detecting defects on a factory line might have 500.

Training a large model on a small dataset leads to severe overfitting. The model memorizes the training data instead of learning general patterns. You could use a smaller model, but then you lose the ability to learn rich features. This is the core tension of the bias-variance tradeoff: you need enough capacity to capture complexity, but enough data to constrain that capacity.

Transfer learning breaks this tradeoff. Take a model already trained on a large dataset (the source task), then reuse its learned features for your problem (the target task). The pretrained weights give you the capacity of a large model with the data efficiency of a small one.

What pretrained models learn

Neural networks build up features in layers. In a CNN trained on natural images:

Layer 1 learns edge detectors: horizontal, vertical, diagonal edges.
Layer 2 combines edges into textures and corners.
Layer 3 builds object parts: wheels, eyes, fur patches.
Layer 4+ composes parts into whole objects: faces, cars, dogs.

The key insight is that early layers learn general features. Edges and textures are useful whether you’re classifying cats, diagnosing tumors, or detecting cracks in concrete. Later layers learn task-specific features tied to the original training objective.

This hierarchy means you can take a pretrained model, keep the general early layers, and replace or adapt the task-specific later layers. Two main strategies do this: feature extraction and fine-tuning.

Feature extraction: freeze the backbone

The simplest transfer approach. Take a pretrained model, remove its final classification layer (the “head”), and replace it with a new head for your task. Freeze all the backbone weights so they don’t update during training. Only train the new head.

The backbone acts as a fixed feature extractor. It transforms your input into a rich feature vector, and your small head learns to map those features to your target classes. Since you’re only training a few layers, this is fast and needs little data. You can often train with basic gradient descent or Adam on a single GPU in minutes.

When to use it: your target data is small and your target domain is similar to the source domain. If you’re classifying dog breeds and the backbone was trained on ImageNet (which includes many dog breeds), the frozen features will already be highly relevant.

Limitation: the backbone features are locked. If your target domain differs a lot from the source domain, frozen features may not capture what matters in your data.

Fine-tuning: unfreeze and adapt

Fine-tuning goes further. After attaching your new head, you unfreeze some (or all) of the backbone layers and train them alongside the head. The pretrained weights serve as a strong initialization, and training nudges them toward your target task.

The standard recipe:

Replace the head and train only the head for a few epochs with the backbone frozen. This avoids destroying pretrained features with random gradients from the untrained head.
Unfreeze some backbone layers, typically starting from the top (closest to the output).
Train with a small learning rate. You want to adjust the weights gently, not overwrite them.

When to use it: your target dataset is medium-sized, or your target domain differs enough that frozen features aren’t sufficient. Fine-tuning gives the model flexibility to adapt while still benefiting from the pretrained initialization.

Risk: with a small dataset and too many unfrozen layers, you can overfit. The model has enough capacity to memorize your training data. Regularization and dropout help, but the best defense is unfreezing only as many layers as your data can support.

The diagram below shows a 4-layer model during fine-tuning. Layers 1 and 2 stay frozen (they hold general features). Layers 3, 4, and the classification head are trainable.

graph TD
  subgraph "Frozen Layers"
      I["Input"] --> L1["Layer 1 (frozen)"]
      L1 --> L2["Layer 2 (frozen)"]
  end
  subgraph "Unfrozen Layers"
      L2 --> L3["Layer 3 (trainable)"]
      L3 --> L4["Layer 4 (trainable)"]
  end
  subgraph "New Head"
      L4 --> H["Classification Head (trainable)"]
      H --> O["Output"]
  end
  style L1 fill:#d4e6f1,stroke:#2c3e50
  style L2 fill:#d4e6f1,stroke:#2c3e50
  style L3 fill:#d5f5e3,stroke:#27ae60
  style L4 fill:#d5f5e3,stroke:#27ae60
  style H fill:#fdebd0,stroke:#e67e22

In feature extraction mode, all four backbone layers would be frozen, and only the orange classification head trains. In fine-tuning mode, you choose how many layers to unfreeze based on your data and domain.

Choosing a strategy

Transfer learning accuracy by strategy

The decision depends on two factors: how much target data you have, and how similar the target domain is to the source domain.

graph TD
  Start["How much labeled target data?"] -->|Small| SmallData["Similar to source domain?"]
  Start -->|Large| LargeData["Similar to source domain?"]
  SmallData -->|Yes| FE["Feature extraction"]
  SmallData -->|No| FTTL["Fine-tune top layers carefully"]
  LargeData -->|Yes| FT["Fine-tune all layers"]
  LargeData -->|No| Scratch["Fine-tune aggressively
or train from scratch"]
  style FE fill:#d5f5e3,stroke:#27ae60
  style FTTL fill:#fdebd0,stroke:#e67e22
  style FT fill:#d5f5e3,stroke:#27ae60
  style Scratch fill:#fadbd8,stroke:#e74c3c

The table below summarizes each strategy, when to pick it, and what to watch for.

Strategy	Layers trained	LR for backbone	When to use	Risk
Feature extraction	Head only	N/A (frozen)	Small data, similar domain	Underfitting if domains differ
Fine-tune top layers	Head + top 1-2 layers	Very low (1e-5)	Small-medium data, moderate domain gap	Overfitting on small data
Fine-tune all layers	All layers	Low (1e-4 to 1e-5)	Large data, similar or moderate domain	Slower training, higher compute cost
Train from scratch	All layers	Normal (1e-3)	Large data, very different domain	Needs massive data and compute

Layer-wise learning rate scaling

When you fine-tune multiple layers, a single learning rate for the whole network is not ideal. Lower layers hold general features that should change slowly. Upper layers hold task-specific features that need more adjustment. Layer-wise learning rate scaling assigns different learning rates to different layers.

The common approach: pick a base learning rate for the top layer, then multiply by a decay factor $\gamma$ for each layer going down. For layer $l$ (counting from 1 at the bottom to $L$ at the top), the learning rate is:

$\text{lr}_l = \text{lr}_{\text{base}} \times \gamma^{(L - l)}$

Worked example: layer-wise LR scaling

Setup: base learning rate $\text{lr}_{\text{base}} = 0.001$ , four layers, decay factor $\gamma = 0.1$ .

Layer 4 is closest to the output. Layer 1 is closest to the input.

Layer 4 (top):

$\text{lr}_4 = 0.001 \times 0.1^{(4-4)} = 0.001 \times 0.1^0 = 0.001 \times 1 = 0.001$

Layer 3:

$\text{lr}_3 = 0.001 \times 0.1^{(4-3)} = 0.001 \times 0.1^1 = 0.001 \times 0.1 = 0.0001$

Layer 2:

$\text{lr}_2 = 0.001 \times 0.1^{(4-2)} = 0.001 \times 0.1^2 = 0.001 \times 0.01 = 0.00001$

Layer 1 (bottom):

$\text{lr}_1 = 0.001 \times 0.1^{(4-1)} = 0.001 \times 0.1^3 = 0.001 \times 0.001 = 0.000001$

Layer 1 trains at 1/1000th the rate of layer 4. The general features in early layers barely change, while task-specific features near the output adapt quickly. In practice, $\gamma = 0.1$ is aggressive. Values like $\gamma = 0.5$ or $\gamma = 0.7$ are more common, giving a gentler decay that still preserves the principle: lower layers change less.

Counting trainable parameters

Understanding how many parameters you’re actually training helps you gauge overfitting risk. More trainable parameters relative to your dataset size means higher risk. Let’s compare feature extraction and fine-tuning on a concrete architecture.

Worked example: feature extraction vs. fine-tuning parameter counts

Backbone: 4 layers, each with a $512 \times 512$ weight matrix and a 512-dimensional bias vector.

Parameters per layer:

$512 \times 512 + 512 = 262{,}144 + 512 = 262{,}656$

Total backbone parameters:

$4 \times 262{,}656 = 1{,}050{,}624$

Classification head: two layers mapping the backbone output to 2 classes.

Dense layer: $512 \to 128$ . Parameters: $512 \times 128 + 128 = 65{,}536 + 128 = 65{,}664$ .
Output layer: $128 \to 2$ . Parameters: $128 \times 2 + 2 = 256 + 2 = 258$ .
Total head: $65{,}664 + 258 = 65{,}922$ .

Feature extraction (entire backbone frozen):

$\text{Trainable parameters} = 65{,}922$

That is roughly 6% of the full model. Even with a few hundred training samples, this is manageable. The frozen backbone contributes zero trainable parameters.

Fine-tuning last 2 layers (layers 3 and 4 unfrozen):

$\text{Unfrozen backbone} = 2 \times 262{,}656 = 525{,}312$

$\text{Total trainable} = 525{,}312 + 65{,}922 = 591{,}234$

Now you’re training about 56% of the model. You need a proportionally larger dataset to avoid overfitting. If you have 1,000 training samples and 591,234 trainable parameters, the model can easily memorize the data. This is exactly why the choice between feature extraction and fine-tuning matters: it controls how many parameters compete for your limited data.

Domain shift: when transfer works and when it doesn’t

Transfer learning assumes that features learned on the source domain are useful for the target domain. This assumption breaks down when the domains are very different.

Domain similarity refers to how alike the source and target data distributions are. ImageNet features transfer well to other natural image tasks (pets, food, landscapes) because the visual patterns overlap. They transfer less well to medical imaging (X-rays, MRIs) or satellite imagery, where the textures and structures look very different from everyday photographs.

Target scenario	Similarity to ImageNet	Recommended approach	Layers to unfreeze
Pet breed classification	High	Feature extraction or light fine-tuning	Head only, or top 1 layer
Food image recognition	Medium-high	Fine-tune top layers	Top 2-3 layers
Chest X-ray diagnosis	Low-medium	Fine-tune most layers with small LR	Most or all layers
Satellite land-use mapping	Low	Fine-tune all layers or train from scratch	All layers
Microscopy cell counting	Low	Fine-tune aggressively, consider domain-specific pretraining	All layers

When the domain gap is large, two things help. First, fine-tune more aggressively: unfreeze more layers, train longer, and possibly use a slightly higher learning rate. Second, if possible, find a model pretrained on a domain closer to yours. Medical imaging researchers often pretrain on RadImageNet (a dataset of medical images) rather than standard ImageNet. This gives a better starting point because the low-level features (tissue textures, contrast patterns) are already relevant.

Transfer learning for NLP: BERT

Transfer learning transformed computer vision first, then revolutionized natural language processing. The BERT model (Bidirectional Encoder Representations from Transformers) showed that pretraining a deep attention-based network on large text corpora, then fine-tuning for specific tasks, dramatically outperformed training task-specific models from scratch.

How BERT pretrains

BERT learns from two self-supervised objectives on unlabeled text:

Masked language modeling: randomly mask 15% of input tokens and predict them from context. This forces the model to learn bidirectional representations. Unlike static word embeddings that assign one fixed vector per word, BERT produces context-dependent representations where the same word gets different vectors in different sentences.
Next sentence prediction: given two sentences, predict whether the second follows the first in the original text. This teaches the model about sentence-level relationships.

The result is a model with deep knowledge of language structure, syntax, and semantics, trained on billions of words from books and Wikipedia.

Fine-tuning BERT for downstream tasks

For classification, you take the output embedding of the special [CLS] token (a 768-dimensional vector in BERT-base), pass it through a linear layer, and apply softmax. You then fine-tune the entire model on your labeled data with a small learning rate, typically $2 \times 10^{-5}$ to $5 \times 10^{-5}$ .

The same pretrained BERT can be fine-tuned for sentiment analysis, named entity recognition, question answering, and dozens of other tasks. You just swap the head and train on task-specific labeled data. This is the same idea as vision transfer learning: pretrain once on a huge general corpus, then adapt cheaply to any downstream problem.

Worked example: BERT classification head

We use 4 dimensions instead of 768 to keep the math clear. The approach is identical at full scale.

Setup: binary classification (class 0 or class 1).

The [CLS] token output embedding:

$h = \begin{bmatrix} 0.3 & -0.1 & 0.5 & 0.2 \end{bmatrix}$

Weight matrix $W \in \mathbb{R}^{4 \times 2}$ and bias $b \in \mathbb{R}^2$ for the classification head:

$W = \begin{bmatrix} 0.4 & -0.2 \\ 0.1 & 0.3 \\ -0.3 & 0.5 \\ 0.2 & -0.1 \end{bmatrix}, \quad b = \begin{bmatrix} 0.1 & -0.1 \end{bmatrix}$

Step 1: compute logits.

The logits are $z = hW + b$ . We compute each dot product separately:

$z_0 = (0.3)(0.4) + (-0.1)(0.1) + (0.5)(-0.3) + (0.2)(0.2) + 0.1$

$= 0.12 - 0.01 - 0.15 + 0.04 + 0.1 = 0.10$

$z_1 = (0.3)(-0.2) + (-0.1)(0.3) + (0.5)(0.5) + (0.2)(-0.1) + (-0.1)$

$= -0.06 - 0.03 + 0.25 - 0.02 - 0.1 = 0.04$

$z = \begin{bmatrix} 0.10 & 0.04 \end{bmatrix}$

Step 2: apply softmax.

Softmax converts logits into probabilities:

$p_k = \frac{e^{z_k}}{\sum_j e^{z_j}}$

$e^{0.10} \approx 1.1052, \quad e^{0.04} \approx 1.0408$

$\text{sum} = 1.1052 + 1.0408 = 2.1460$

$p_0 = \frac{1.1052}{2.1460} \approx 0.5150, \quad p_1 = \frac{1.0408}{2.1460} \approx 0.4850$

The model gives a slight preference to class 0, but it is far from confident.

Step 3: compute cross-entropy loss.

The true label is class 0. Cross-entropy loss for the correct class:

$L = -\log(p_{\text{true}}) = -\log(0.5150) \approx 0.664$

A loss of 0.664 is high (for reference, a perfectly confident correct prediction gives $L = 0$ ). This loss signal feeds back through the entire model via backpropagation, using the chain rule to compute gradients for every layer. With a small learning rate, the pretrained BERT weights shift gradually toward better classification performance. After a few epochs on your labeled dataset, the loss drops and the model becomes confident on your task.

Summary

Transfer learning lets you build on work already done. Pretrained models capture general features in their early layers and task-specific features in their later layers. Feature extraction freezes the backbone and trains only a new head, which is fast and data-efficient. Fine-tuning unfreezes some or all layers, adapting them with a small learning rate for greater flexibility. Layer-wise learning rate scaling preserves general features by updating lower layers more slowly. The right strategy depends on your dataset size and how similar your domain is to the pretraining domain. In NLP, models like BERT brought the same idea to language: pretrain once on a massive corpus, fine-tune cheaply for any downstream task.

What comes next

With transfer learning in your toolkit, you can train strong models even with limited data. The next article covers optimization techniques for deep networks, including batch normalization, weight initialization strategies, and learning rate schedules that make training faster and more stable.

← Back to all series