Transfer learning and fine-tuning
In this series (25 parts)
- Neural networks: the basic building block
- Forward pass and backpropagation
- Training neural networks: a practical guide
- Convolutional neural networks
- Recurrent neural networks and LSTMs
- Attention mechanism and transformers
- Word embeddings: from one-hot to dense representations
- Transfer learning and fine-tuning
- Optimization techniques for deep networks
- Regularization for deep networks
- Encoder-decoder architectures
- Generative models: an overview
- Restricted Boltzmann Machines
- Deep Belief Networks
- Variational Autoencoders
- Generative Adversarial Networks: training and theory
- DCGAN, conditional GANs, and GAN variants
- Representation learning and self-supervised learning
- Domain adaptation and fine-tuning strategies
- Distributed representations and latent spaces
- AutoML and hyperparameter optimization
- Neural architecture search
- Network compression and efficient inference
- Graph neural networks
- Practical deep learning: debugging and tuning
Prerequisites: Convolutional neural networks and Training neural networks.
Most deep learning models are not trained from scratch. They start from weights that someone else already trained on a large dataset, then adapt those weights to a new task. This is transfer learning, and it works because early layers in a network learn features that are useful across many problems. You get the benefit of large-scale training without paying the cost yourself.
Why training from scratch is expensive
A ResNet-50 has roughly 25 million parameters. Training it on ImageNet (1.4 million images, 1000 classes) takes days on multiple GPUs. Most teams don’t have that kind of data or compute for their specific problem. A hospital classifying X-ray images might have 5,000 labeled samples. A startup detecting defects on a factory line might have 500.
Training a large model on a small dataset leads to severe overfitting. The model memorizes the training data instead of learning general patterns. You could use a smaller model, but then you lose the ability to learn rich features. This is the core tension of the bias-variance tradeoff: you need enough capacity to capture complexity, but enough data to constrain that capacity.
Transfer learning breaks this tradeoff. Take a model already trained on a large dataset (the source task), then reuse its learned features for your problem (the target task). The pretrained weights give you the capacity of a large model with the data efficiency of a small one.
What pretrained models learn
Neural networks build up features in layers. In a CNN trained on natural images:
- Layer 1 learns edge detectors: horizontal, vertical, diagonal edges.
- Layer 2 combines edges into textures and corners.
- Layer 3 builds object parts: wheels, eyes, fur patches.
- Layer 4+ composes parts into whole objects: faces, cars, dogs.
The key insight is that early layers learn general features. Edges and textures are useful whether you’re classifying cats, diagnosing tumors, or detecting cracks in concrete. Later layers learn task-specific features tied to the original training objective.
This hierarchy means you can take a pretrained model, keep the general early layers, and replace or adapt the task-specific later layers. Two main strategies do this: feature extraction and fine-tuning.
Feature extraction: freeze the backbone
The simplest transfer approach. Take a pretrained model, remove its final classification layer (the “head”), and replace it with a new head for your task. Freeze all the backbone weights so they don’t update during training. Only train the new head.
The backbone acts as a fixed feature extractor. It transforms your input into a rich feature vector, and your small head learns to map those features to your target classes. Since you’re only training a few layers, this is fast and needs little data. You can often train with basic gradient descent or Adam on a single GPU in minutes.
When to use it: your target data is small and your target domain is similar to the source domain. If you’re classifying dog breeds and the backbone was trained on ImageNet (which includes many dog breeds), the frozen features will already be highly relevant.
Limitation: the backbone features are locked. If your target domain differs a lot from the source domain, frozen features may not capture what matters in your data.
Fine-tuning: unfreeze and adapt
Fine-tuning goes further. After attaching your new head, you unfreeze some (or all) of the backbone layers and train them alongside the head. The pretrained weights serve as a strong initialization, and training nudges them toward your target task.
The standard recipe:
- Replace the head and train only the head for a few epochs with the backbone frozen. This avoids destroying pretrained features with random gradients from the untrained head.
- Unfreeze some backbone layers, typically starting from the top (closest to the output).
- Train with a small learning rate. You want to adjust the weights gently, not overwrite them.
When to use it: your target dataset is medium-sized, or your target domain differs enough that frozen features aren’t sufficient. Fine-tuning gives the model flexibility to adapt while still benefiting from the pretrained initialization.
Risk: with a small dataset and too many unfrozen layers, you can overfit. The model has enough capacity to memorize your training data. Regularization and dropout help, but the best defense is unfreezing only as many layers as your data can support.
The diagram below shows a 4-layer model during fine-tuning. Layers 1 and 2 stay frozen (they hold general features). Layers 3, 4, and the classification head are trainable.
graph TD
subgraph "Frozen Layers"
I["Input"] --> L1["Layer 1 (frozen)"]
L1 --> L2["Layer 2 (frozen)"]
end
subgraph "Unfrozen Layers"
L2 --> L3["Layer 3 (trainable)"]
L3 --> L4["Layer 4 (trainable)"]
end
subgraph "New Head"
L4 --> H["Classification Head (trainable)"]
H --> O["Output"]
end
style L1 fill:#d4e6f1,stroke:#2c3e50
style L2 fill:#d4e6f1,stroke:#2c3e50
style L3 fill:#d5f5e3,stroke:#27ae60
style L4 fill:#d5f5e3,stroke:#27ae60
style H fill:#fdebd0,stroke:#e67e22
In feature extraction mode, all four backbone layers would be frozen, and only the orange classification head trains. In fine-tuning mode, you choose how many layers to unfreeze based on your data and domain.
Choosing a strategy
Transfer learning accuracy by strategy
The decision depends on two factors: how much target data you have, and how similar the target domain is to the source domain.
graph TD Start["How much labeled target data?"] -->|Small| SmallData["Similar to source domain?"] Start -->|Large| LargeData["Similar to source domain?"] SmallData -->|Yes| FE["Feature extraction"] SmallData -->|No| FTTL["Fine-tune top layers carefully"] LargeData -->|Yes| FT["Fine-tune all layers"] LargeData -->|No| Scratch["Fine-tune aggressively or train from scratch"] style FE fill:#d5f5e3,stroke:#27ae60 style FTTL fill:#fdebd0,stroke:#e67e22 style FT fill:#d5f5e3,stroke:#27ae60 style Scratch fill:#fadbd8,stroke:#e74c3c
The table below summarizes each strategy, when to pick it, and what to watch for.
| Strategy | Layers trained | LR for backbone | When to use | Risk |
|---|---|---|---|---|
| Feature extraction | Head only | N/A (frozen) | Small data, similar domain | Underfitting if domains differ |
| Fine-tune top layers | Head + top 1-2 layers | Very low (1e-5) | Small-medium data, moderate domain gap | Overfitting on small data |
| Fine-tune all layers | All layers | Low (1e-4 to 1e-5) | Large data, similar or moderate domain | Slower training, higher compute cost |
| Train from scratch | All layers | Normal (1e-3) | Large data, very different domain | Needs massive data and compute |
Layer-wise learning rate scaling
When you fine-tune multiple layers, a single learning rate for the whole network is not ideal. Lower layers hold general features that should change slowly. Upper layers hold task-specific features that need more adjustment. Layer-wise learning rate scaling assigns different learning rates to different layers.
The common approach: pick a base learning rate for the top layer, then multiply by a decay factor for each layer going down. For layer (counting from 1 at the bottom to at the top), the learning rate is:
Worked example: layer-wise LR scaling
Setup: base learning rate , four layers, decay factor .
Layer 4 is closest to the output. Layer 1 is closest to the input.
Layer 4 (top):
Layer 3:
Layer 2:
Layer 1 (bottom):
Layer 1 trains at 1/1000th the rate of layer 4. The general features in early layers barely change, while task-specific features near the output adapt quickly. In practice, is aggressive. Values like or are more common, giving a gentler decay that still preserves the principle: lower layers change less.
Counting trainable parameters
Understanding how many parameters you’re actually training helps you gauge overfitting risk. More trainable parameters relative to your dataset size means higher risk. Let’s compare feature extraction and fine-tuning on a concrete architecture.
Worked example: feature extraction vs. fine-tuning parameter counts
Backbone: 4 layers, each with a weight matrix and a 512-dimensional bias vector.
Parameters per layer:
Total backbone parameters:
Classification head: two layers mapping the backbone output to 2 classes.
- Dense layer: . Parameters: .
- Output layer: . Parameters: .
- Total head: .
Feature extraction (entire backbone frozen):
That is roughly 6% of the full model. Even with a few hundred training samples, this is manageable. The frozen backbone contributes zero trainable parameters.
Fine-tuning last 2 layers (layers 3 and 4 unfrozen):
Now you’re training about 56% of the model. You need a proportionally larger dataset to avoid overfitting. If you have 1,000 training samples and 591,234 trainable parameters, the model can easily memorize the data. This is exactly why the choice between feature extraction and fine-tuning matters: it controls how many parameters compete for your limited data.
Domain shift: when transfer works and when it doesn’t
Transfer learning assumes that features learned on the source domain are useful for the target domain. This assumption breaks down when the domains are very different.
Domain similarity refers to how alike the source and target data distributions are. ImageNet features transfer well to other natural image tasks (pets, food, landscapes) because the visual patterns overlap. They transfer less well to medical imaging (X-rays, MRIs) or satellite imagery, where the textures and structures look very different from everyday photographs.
| Target scenario | Similarity to ImageNet | Recommended approach | Layers to unfreeze |
|---|---|---|---|
| Pet breed classification | High | Feature extraction or light fine-tuning | Head only, or top 1 layer |
| Food image recognition | Medium-high | Fine-tune top layers | Top 2-3 layers |
| Chest X-ray diagnosis | Low-medium | Fine-tune most layers with small LR | Most or all layers |
| Satellite land-use mapping | Low | Fine-tune all layers or train from scratch | All layers |
| Microscopy cell counting | Low | Fine-tune aggressively, consider domain-specific pretraining | All layers |
When the domain gap is large, two things help. First, fine-tune more aggressively: unfreeze more layers, train longer, and possibly use a slightly higher learning rate. Second, if possible, find a model pretrained on a domain closer to yours. Medical imaging researchers often pretrain on RadImageNet (a dataset of medical images) rather than standard ImageNet. This gives a better starting point because the low-level features (tissue textures, contrast patterns) are already relevant.
Transfer learning for NLP: BERT
Transfer learning transformed computer vision first, then revolutionized natural language processing. The BERT model (Bidirectional Encoder Representations from Transformers) showed that pretraining a deep attention-based network on large text corpora, then fine-tuning for specific tasks, dramatically outperformed training task-specific models from scratch.
How BERT pretrains
BERT learns from two self-supervised objectives on unlabeled text:
- Masked language modeling: randomly mask 15% of input tokens and predict them from context. This forces the model to learn bidirectional representations. Unlike static word embeddings that assign one fixed vector per word, BERT produces context-dependent representations where the same word gets different vectors in different sentences.
- Next sentence prediction: given two sentences, predict whether the second follows the first in the original text. This teaches the model about sentence-level relationships.
The result is a model with deep knowledge of language structure, syntax, and semantics, trained on billions of words from books and Wikipedia.
Fine-tuning BERT for downstream tasks
For classification, you take the output embedding of the special [CLS] token (a 768-dimensional vector in BERT-base), pass it through a linear layer, and apply softmax. You then fine-tune the entire model on your labeled data with a small learning rate, typically to .
The same pretrained BERT can be fine-tuned for sentiment analysis, named entity recognition, question answering, and dozens of other tasks. You just swap the head and train on task-specific labeled data. This is the same idea as vision transfer learning: pretrain once on a huge general corpus, then adapt cheaply to any downstream problem.
Worked example: BERT classification head
We use 4 dimensions instead of 768 to keep the math clear. The approach is identical at full scale.
Setup: binary classification (class 0 or class 1).
The [CLS] token output embedding:
Weight matrix and bias for the classification head:
Step 1: compute logits.
The logits are . We compute each dot product separately:
Step 2: apply softmax.
Softmax converts logits into probabilities:
The model gives a slight preference to class 0, but it is far from confident.
Step 3: compute cross-entropy loss.
The true label is class 0. Cross-entropy loss for the correct class:
A loss of 0.664 is high (for reference, a perfectly confident correct prediction gives ). This loss signal feeds back through the entire model via backpropagation, using the chain rule to compute gradients for every layer. With a small learning rate, the pretrained BERT weights shift gradually toward better classification performance. After a few epochs on your labeled dataset, the loss drops and the model becomes confident on your task.
Summary
Transfer learning lets you build on work already done. Pretrained models capture general features in their early layers and task-specific features in their later layers. Feature extraction freezes the backbone and trains only a new head, which is fast and data-efficient. Fine-tuning unfreezes some or all layers, adapting them with a small learning rate for greater flexibility. Layer-wise learning rate scaling preserves general features by updating lower layers more slowly. The right strategy depends on your dataset size and how similar your domain is to the pretraining domain. In NLP, models like BERT brought the same idea to language: pretrain once on a massive corpus, fine-tune cheaply for any downstream task.
What comes next
With transfer learning in your toolkit, you can train strong models even with limited data. The next article covers optimization techniques for deep networks, including batch normalization, weight initialization strategies, and learning rate schedules that make training faster and more stable.