Search…

Representation learning and self-supervised learning

In this series (25 parts)
  1. Neural networks: the basic building block
  2. Forward pass and backpropagation
  3. Training neural networks: a practical guide
  4. Convolutional neural networks
  5. Recurrent neural networks and LSTMs
  6. Attention mechanism and transformers
  7. Word embeddings: from one-hot to dense representations
  8. Transfer learning and fine-tuning
  9. Optimization techniques for deep networks
  10. Regularization for deep networks
  11. Encoder-decoder architectures
  12. Generative models: an overview
  13. Restricted Boltzmann Machines
  14. Deep Belief Networks
  15. Variational Autoencoders
  16. Generative Adversarial Networks: training and theory
  17. DCGAN, conditional GANs, and GAN variants
  18. Representation learning and self-supervised learning
  19. Domain adaptation and fine-tuning strategies
  20. Distributed representations and latent spaces
  21. AutoML and hyperparameter optimization
  22. Neural architecture search
  23. Network compression and efficient inference
  24. Graph neural networks
  25. Practical deep learning: debugging and tuning

Prerequisites

Before reading this article, make sure you are comfortable with:

What makes a good representation?

A representation is a transformation of raw data into a form that makes useful information easy to extract. Your raw data might be a 224×224×3224 \times 224 \times 3 image (150,528 numbers). A good representation compresses this into, say, a 512-dimensional vector that captures the important structure and discards noise.

Three properties define a good representation:

Compact: It uses far fewer dimensions than the raw data, keeping only what matters. This is not just about storage. A lower-dimensional representation has a smaller hypothesis space, which means simpler downstream classifiers can work with it.

Disentangled: Different dimensions correspond to different factors of variation. One dimension might capture lighting, another object shape, another texture. When factors are entangled, you need more data and more complex models to tease them apart.

Transferable: A representation learned on one task works well on other tasks. If you learn features on ImageNet that also help with medical image classification, those features capture something genuinely useful about visual structure.

The key question is: how do you learn these representations? Supervised learning works, but it requires labels. Labels are expensive. Self-supervised learning aims to learn representations from the data itself.

Learned 2D features showing class separation

Autoencoders as representation learners

An autoencoder has two parts: an encoder ff maps input xx to a lower-dimensional code z=f(x)z = f(x), and a decoder gg reconstructs the input as x^=g(z)\hat{x} = g(z). Training minimizes reconstruction error:

L=xg(f(x))2\mathcal{L} = \|x - g(f(x))\|^2

The bottleneck forces the encoder to keep only the most important information. If zz has 64 dimensions and xx has 150,528, the encoder must compress by a factor of ~2,350x. It can only do this by learning the structure in the data.

The learned zz is a representation. But plain autoencoders have a problem: they tend to learn identity-like mappings through the bottleneck if the capacity is high enough. The representation may encode pixel patterns rather than semantic content.

Denoising autoencoders

A denoising autoencoder (DAE) addresses this by corrupting the input and training the network to recover the clean version:

LDAE=xg(f(x~))2\mathcal{L}_{\text{DAE}} = \|x - g(f(\tilde{x}))\|^2

where x~\tilde{x} is a corrupted version of xx (e.g., adding Gaussian noise, randomly zeroing pixels, or applying random masks).

Why does this help? To denoise, the encoder must learn the underlying structure of clean data. It can’t simply memorize pixel values because the noise changes every time. The representation must capture what’s typical about the data, exactly what we want for downstream tasks.

This idea, corrupting inputs and training to recover the original, is a precursor to the masked autoencoder approach we’ll see later.

Contrastive learning: pull similar, push dissimilar

Contrastive learning takes a different approach. Instead of reconstruction, it learns representations by comparing pairs of samples. The core idea: representations of similar items should be close together, and representations of dissimilar items should be far apart.

For each sample, you create a positive pair (two views of the same thing) and negative pairs (views of different things). The loss pulls positives together in embedding space and pushes negatives apart.

Where do the pairs come from? In self-supervised learning, you create them through data augmentation. Take an image, apply two different random augmentations (crop, flip, color jitter, blur), and call those the positive pair. Any other image in the batch is a negative.

This is powerful because the network must learn what’s invariant across augmentations. Color jitter forces it to not rely on exact colors. Cropping forces it to not rely on absolute position. What’s left? The actual content, shape, texture, and identity of objects.

flowchart TD
  IMG["Original image x"] --> AUG1["Augmentation 1
(crop + color jitter)"]
  IMG --> AUG2["Augmentation 2
(crop + blur)"]
  AUG1 --> ENC1["Encoder f"]
  AUG2 --> ENC2["Encoder f
(shared weights)"]
  ENC1 --> Z1["z₁"]
  ENC2 --> Z2["z₂"]
  Z1 -->|"Pull together"| LOSS["Contrastive loss"]
  Z2 -->|"Pull together"| LOSS
  NEG["z₃, z₄, ... (other images)"] -->|"Push apart"| LOSS

  style IMG fill:#9775fa,color:#fff
  style LOSS fill:#ff6b6b,color:#fff

SimCLR: a simple contrastive framework

SimCLR (2020) is a clean, effective contrastive learning framework. The pipeline:

  1. Sample a batch of NN images
  2. Apply two random augmentations to each, creating 2N2N augmented views
  3. Encode all views with a shared encoder ff (e.g., ResNet-50)
  4. Project through a small MLP head gg to get zi=g(f(x~i))z_i = g(f(\tilde{x}_i))
  5. Compute the NT-Xent loss over all pairs

The NT-Xent (Normalized Temperature-scaled Cross-Entropy) loss for a positive pair (i,j)(i, j) is:

i,j=logexp(sim(zi,zj)/τ)k=12N1[ki]exp(sim(zi,zk)/τ)\ell_{i,j} = -\log \frac{\exp(\text{sim}(z_i, z_j) / \tau)}{\sum_{k=1}^{2N} \mathbf{1}_{[k \neq i]} \exp(\text{sim}(z_i, z_k) / \tau)}

where sim(zi,zj)=zizjzizj\text{sim}(z_i, z_j) = \frac{z_i^\top z_j}{\|z_i\| \|z_j\|} is cosine similarity and τ\tau is a temperature parameter.

The temperature τ\tau controls how sharp the distribution is. Small τ\tau makes the loss focus heavily on hard negatives. Large τ\tau treats all negatives more equally. SimCLR uses τ=0.5\tau = 0.5.

A key finding: the projection head gg matters. Representations are taken from before the projection head (from ff, not from gg) for downstream tasks. The projection head learns to discard information useful for contrastive learning but not for downstream tasks (like augmentation-specific details).

flowchart LR
  X["Image x"] --> T1["Aug t₁"] --> E1["Encoder f"] --> H1["h₁"]
  X --> T2["Aug t₂"] --> E2["Encoder f"] --> H2["h₂"]
  H1 --> P1["Projection g"] --> Z1["z₁"]
  H2 --> P2["Projection g"] --> Z2["z₂"]
  Z1 --> NT["NT-Xent Loss"]
  Z2 --> NT

  H1 -.->|"Use for
downstream"| DS["Downstream task"]

  style X fill:#9775fa,color:#fff
  style NT fill:#ff6b6b,color:#fff
  style DS fill:#51cf66,color:#fff

Masked autoencoders (MAE)

Masked Autoencoders (2021) brought the “predict the missing part” idea from NLP (like BERT’s masked language modeling) to vision. The approach is simple:

  1. Divide the image into patches (e.g., 16×1616 \times 16 patches for a 224×224224 \times 224 image = 196 patches)
  2. Randomly mask a large fraction (75%) of patches
  3. Encode only the visible patches with a Vision Transformer
  4. Decode to reconstruct the masked patches

Why mask 75%? With fewer masked patches, the task is too easy, the model can interpolate from nearby visible patches. With 75% masking, the model must understand global structure to fill in the gaps. This is a much harder pretext task, so it produces better representations.

MAE is computationally efficient because the encoder only processes 25% of patches. The decoder is lightweight and only used during pretraining. For downstream tasks, you keep just the encoder.

Compared to contrastive learning, MAE doesn’t need negative pairs, large batch sizes, or momentum encoders. It’s simpler and scales well.

Self-supervised vs supervised: label efficiency

The practical value of self-supervised learning shows up in label efficiency. Here’s the typical workflow:

  1. Pretrain on a large unlabeled dataset using self-supervised learning
  2. Fine-tune on a small labeled dataset for your specific task

Compared to training from scratch on the small labeled dataset, self-supervised pretraining gives dramatically better results. The pretrained model has already learned low-level features (edges, textures), mid-level features (parts, shapes), and some high-level features (object structure). Fine-tuning just adapts these to your specific classification task.

This matters most when labeled data is scarce. In medical imaging, for example, you might have millions of unlabeled X-rays but only a few hundred expert-labeled examples of a rare condition. Self-supervised pretraining on the unlabeled data, followed by fine-tuning on the labeled examples, can outperform a supervised model trained on 10x or even 100x more labeled data.

Self-supervised methods comparison

MethodModalityPretext taskYearKey result vs supervised
Denoising AEImagesReconstruct clean from noisy2008Early representation learning baseline
SimCLRImagesContrastive (augmented views)202076.5% top-1 on ImageNet (linear eval)
MoCo v2ImagesContrastive (momentum encoder)2020Matched supervised on detection/segmentation
BYOLImagesPredict one view from another (no negatives)202074.3% without negative pairs
MAEImagesReconstruct masked patches202187.8% top-1 fine-tuned (ViT-H)
BERTTextMasked language modeling2018Transformed NLP, state-of-art on 11 tasks
DINOImagesSelf-distillation, no labels2021Emergent segmentation in attention maps

Example 1: NT-Xent loss

Let’s compute the NT-Xent loss for a simplified case with one positive pair and one negative.

Representations (after projection head):

  • Anchor: z1=[1,0]z_1 = [1, 0]
  • Positive: z1+=[0.9,0.1]z_1^+ = [0.9, 0.1]
  • Negative: z2=[0.8,0.6]z_2 = [-0.8, 0.6]

Temperature: τ=0.5\tau = 0.5

Step 1: Cosine similarities

sim(z1,z1+)=z1z1+z1z1+=(1)(0.9)+(0)(0.1)10.81+0.01=0.91.0×0.9055=0.9939\text{sim}(z_1, z_1^+) = \frac{z_1 \cdot z_1^+}{\|z_1\| \|z_1^+\|} = \frac{(1)(0.9) + (0)(0.1)}{\sqrt{1} \cdot \sqrt{0.81 + 0.01}} = \frac{0.9}{1.0 \times 0.9055} = 0.9939 sim(z1,z2)=(1)(0.8)+(0)(0.6)10.64+0.36=0.81.0×1.0=0.8\text{sim}(z_1, z_2) = \frac{(1)(-0.8) + (0)(0.6)}{\sqrt{1} \cdot \sqrt{0.64 + 0.36}} = \frac{-0.8}{1.0 \times 1.0} = -0.8

Step 2: Scale by temperature

sim(z1,z1+)τ=0.99390.5=1.9878\frac{\text{sim}(z_1, z_1^+)}{\tau} = \frac{0.9939}{0.5} = 1.9878 sim(z1,z2)τ=0.80.5=1.6\frac{\text{sim}(z_1, z_2)}{\tau} = \frac{-0.8}{0.5} = -1.6

Step 3: NT-Xent loss

=logexp(1.9878)exp(1.9878)+exp(1.6)\ell = -\log \frac{\exp(1.9878)}{\exp(1.9878) + \exp(-1.6)} =log7.2987.298+0.2019= -\log \frac{7.298}{7.298 + 0.2019} =log7.2987.4999=log(0.9731)=0.0273= -\log \frac{7.298}{7.4999} = -\log(0.9731) = 0.0273

The loss is very low (0.0273) because the positive pair is very similar (cosine 0.99) and the negative is very dissimilar (cosine -0.8). The model has done a good job here.

Now suppose we had a harder negative, z3=[0.7,0.3]z_3 = [0.7, 0.3] with sim(z1,z3)=0.70.58=0.919\text{sim}(z_1, z_3) = \frac{0.7}{\sqrt{0.58}} = 0.919. The temperature-scaled value would be 0.919/0.5=1.8380.919 / 0.5 = 1.838, and:

=logexp(1.9878)exp(1.9878)+exp(1.838)=log7.2987.298+6.284=log(0.537)=0.622\ell = -\log \frac{\exp(1.9878)}{\exp(1.9878) + \exp(1.838)} = -\log \frac{7.298}{7.298 + 6.284} = -\log(0.537) = 0.622

Much higher loss! Hard negatives (samples that are similar to the anchor but shouldn’t be) drive the most learning.

Example 2: denoising autoencoder

Input (4 features, normalized to [0,1]):

x=[0.8,0.3,0.9,0.5]x = [0.8, 0.3, 0.9, 0.5]

Add Gaussian noise (σ=0.15\sigma = 0.15). Suppose the noise values are [0.15,0.12,0.18,0.11][0.15, -0.12, -0.18, 0.11]:

x~=[0.95,0.18,0.72,0.61]\tilde{x} = [0.95, 0.18, 0.72, 0.61]

The encoder maps this to a 2D latent code:

z=f(x~)=[0.6,0.4]z = f(\tilde{x}) = [0.6, 0.4]

The decoder reconstructs:

x^=g(z)=[0.79,0.28,0.88,0.51]\hat{x} = g(z) = [0.79, 0.28, 0.88, 0.51]

MSE loss (compared to clean xx, not noisy x~\tilde{x}):

L=14i=14(xix^i)2\mathcal{L} = \frac{1}{4} \sum_{i=1}^{4} (x_i - \hat{x}_i)^2 =14[(0.80.79)2+(0.30.28)2+(0.90.88)2+(0.50.51)2]= \frac{1}{4}[(0.8 - 0.79)^2 + (0.3 - 0.28)^2 + (0.9 - 0.88)^2 + (0.5 - 0.51)^2] =14[0.0001+0.0004+0.0004+0.0001]=0.0014=0.00025= \frac{1}{4}[0.0001 + 0.0004 + 0.0004 + 0.0001] = \frac{0.001}{4} = 0.00025

That’s a very low loss. The network successfully denoised the input, recovering values close to the clean original. Note that reconstruction is measured against xx, not x~\tilde{x}. This is what forces the encoder to learn the structure of clean data rather than memorizing the noise pattern.

For comparison, if we measured MSE against the noisy input x~\tilde{x}:

14[(0.950.79)2+(0.180.28)2+(0.720.88)2+(0.610.51)2]\frac{1}{4}[(0.95 - 0.79)^2 + (0.18 - 0.28)^2 + (0.72 - 0.88)^2 + (0.61 - 0.51)^2] =14[0.0256+0.01+0.0256+0.01]=0.07124=0.0178= \frac{1}{4}[0.0256 + 0.01 + 0.0256 + 0.01] = \frac{0.0712}{4} = 0.0178

This is 71x larger, confirming the reconstruction is much closer to the clean signal.

Example 3: label efficiency comparison

Consider this experimental setup:

Dataset: 1,000,000 unlabeled medical images + 100 labeled examples across 5 disease classes.

Approach A (train from scratch): Train a ResNet-50 on the 100 labeled examples only.

  • 100 examples / 5 classes = 20 per class
  • With 23 million parameters, severe overfitting is almost guaranteed
  • Expected accuracy: ~35% (barely above random for 5 classes)

Approach B (self-supervised pretrain + fine-tune):

  1. Pretrain with SimCLR on all 1,000,000 unlabeled images for 100 epochs
  2. Freeze the encoder, train a linear classifier on the 100 labeled examples
  3. Fine-tune end-to-end with a small learning rate
  • Step 2 (linear eval): ~60% accuracy. The pretrained features already capture useful medical image structure
  • Step 3 (fine-tuned): ~72% accuracy. Fine-tuning adapts the features to the specific disease classification task

Why the gap? The self-supervised model has seen a million images. It learned what normal anatomy looks like, how X-ray contrast works, common visual patterns. With only 100 labels, Approach A can’t learn any of this. It must learn everything from pixel patterns in 20 images per class.

This ratio matters: when you go from 100 to 1,000 labeled examples, the gap between approaches narrows. At 10,000+ labels, training from scratch becomes competitive. Self-supervised pretraining provides the most value when labels are scarce relative to model complexity.

Key takeaways

  1. Representations matter more than models. A good representation with a simple classifier often beats a complex model with raw features.

  2. Self-supervised learning unlocks unlabeled data. The vast majority of data in the world is unlabeled. SSL lets you use it.

  3. Contrastive methods learn by comparison: what should be similar vs different. They need careful negative sampling and large batches.

  4. Masked prediction methods learn by reconstruction. They’re simpler and scale well but need high masking ratios to work.

  5. The projection head trick: for contrastive methods, take representations from before the projection head. The head discards task-irrelevant info that’s useful downstream.

What comes next

Good representations only help if they transfer to your target domain. But what happens when the source and target domains are very different? The next article on domain adaptation and fine-tuning strategies covers how to handle distribution shift, from simple fine-tuning tricks like layer-wise learning rate decay to adversarial domain adaptation methods like DANN.

Start typing to search across all content
navigate Enter open Esc close