Dec 25, 2025 · 20 min read · Deep Learning

Representation learning and self-supervised learning

In this series (25 parts)

Prerequisites

Before reading this article, make sure you are comfortable with:

Transfer learning: pretraining on one task and fine-tuning on another
Information theory basics: entropy, KL divergence, mutual information

What makes a good representation?

A representation is a transformation of raw data into a form that makes useful information easy to extract. Your raw data might be a $224 \times 224 \times 3$ image (150,528 numbers). A good representation compresses this into, say, a 512-dimensional vector that captures the important structure and discards noise.

Three properties define a good representation:

Compact: It uses far fewer dimensions than the raw data, keeping only what matters. This is not just about storage. A lower-dimensional representation has a smaller hypothesis space, which means simpler downstream classifiers can work with it.

Disentangled: Different dimensions correspond to different factors of variation. One dimension might capture lighting, another object shape, another texture. When factors are entangled, you need more data and more complex models to tease them apart.

Transferable: A representation learned on one task works well on other tasks. If you learn features on ImageNet that also help with medical image classification, those features capture something genuinely useful about visual structure.

The key question is: how do you learn these representations? Supervised learning works, but it requires labels. Labels are expensive. Self-supervised learning aims to learn representations from the data itself.

Learned 2D features showing class separation

Autoencoders as representation learners

An autoencoder has two parts: an encoder $f$ maps input $x$ to a lower-dimensional code $z = f(x)$ , and a decoder $g$ reconstructs the input as $\hat{x} = g(z)$ . Training minimizes reconstruction error:

\mathcal{L} = \|x - g(f(x))\|^2

The bottleneck forces the encoder to keep only the most important information. If $z$ has 64 dimensions and $x$ has 150,528, the encoder must compress by a factor of ~2,350x. It can only do this by learning the structure in the data.

The learned $z$ is a representation. But plain autoencoders have a problem: they tend to learn identity-like mappings through the bottleneck if the capacity is high enough. The representation may encode pixel patterns rather than semantic content.

Denoising autoencoders

A denoising autoencoder (DAE) addresses this by corrupting the input and training the network to recover the clean version:

\mathcal{L}_{\text{DAE}} = \|x - g(f(\tilde{x}))\|^2

where $\tilde{x}$ is a corrupted version of $x$ (e.g., adding Gaussian noise, randomly zeroing pixels, or applying random masks).

Why does this help? To denoise, the encoder must learn the underlying structure of clean data. It can’t simply memorize pixel values because the noise changes every time. The representation must capture what’s typical about the data, exactly what we want for downstream tasks.

This idea, corrupting inputs and training to recover the original, is a precursor to the masked autoencoder approach we’ll see later.

Contrastive learning: pull similar, push dissimilar

Contrastive learning takes a different approach. Instead of reconstruction, it learns representations by comparing pairs of samples. The core idea: representations of similar items should be close together, and representations of dissimilar items should be far apart.

For each sample, you create a positive pair (two views of the same thing) and negative pairs (views of different things). The loss pulls positives together in embedding space and pushes negatives apart.

Where do the pairs come from? In self-supervised learning, you create them through data augmentation. Take an image, apply two different random augmentations (crop, flip, color jitter, blur), and call those the positive pair. Any other image in the batch is a negative.

This is powerful because the network must learn what’s invariant across augmentations. Color jitter forces it to not rely on exact colors. Cropping forces it to not rely on absolute position. What’s left? The actual content, shape, texture, and identity of objects.

flowchart TD
  IMG["Original image x"] --> AUG1["Augmentation 1
(crop + color jitter)"]
  IMG --> AUG2["Augmentation 2
(crop + blur)"]
  AUG1 --> ENC1["Encoder f"]
  AUG2 --> ENC2["Encoder f
(shared weights)"]
  ENC1 --> Z1["z₁"]
  ENC2 --> Z2["z₂"]
  Z1 -->|"Pull together"| LOSS["Contrastive loss"]
  Z2 -->|"Pull together"| LOSS
  NEG["z₃, z₄, ... (other images)"] -->|"Push apart"| LOSS

  style IMG fill:#9775fa,color:#fff
  style LOSS fill:#ff6b6b,color:#fff

SimCLR: a simple contrastive framework

SimCLR (2020) is a clean, effective contrastive learning framework. The pipeline:

Sample a batch of $N$ images
Apply two random augmentations to each, creating $2N$ augmented views
Encode all views with a shared encoder $f$ (e.g., ResNet-50)
Project through a small MLP head $g$ to get $z_i = g(f(\tilde{x}_i))$
Compute the NT-Xent loss over all pairs

The NT-Xent (Normalized Temperature-scaled Cross-Entropy) loss for a positive pair $(i, j)$ is:

\ell_{i,j} = -\log \frac{\exp(\text{sim}(z_i, z_j) / \tau)}{\sum_{k=1}^{2N} \mathbf{1}_{[k \neq i]} \exp(\text{sim}(z_i, z_k) / \tau)}

where $\text{sim}(z_i, z_j) = \frac{z_i^\top z_j}{\|z_i\| \|z_j\|}$ is cosine similarity and $\tau$ is a temperature parameter.

The temperature $\tau$ controls how sharp the distribution is. Small $\tau$ makes the loss focus heavily on hard negatives. Large $\tau$ treats all negatives more equally. SimCLR uses $\tau = 0.5$ .

A key finding: the projection head $g$ matters. Representations are taken from before the projection head (from $f$ , not from $g$ ) for downstream tasks. The projection head learns to discard information useful for contrastive learning but not for downstream tasks (like augmentation-specific details).

flowchart LR
  X["Image x"] --> T1["Aug t₁"] --> E1["Encoder f"] --> H1["h₁"]
  X --> T2["Aug t₂"] --> E2["Encoder f"] --> H2["h₂"]
  H1 --> P1["Projection g"] --> Z1["z₁"]
  H2 --> P2["Projection g"] --> Z2["z₂"]
  Z1 --> NT["NT-Xent Loss"]
  Z2 --> NT

  H1 -.->|"Use for
downstream"| DS["Downstream task"]

  style X fill:#9775fa,color:#fff
  style NT fill:#ff6b6b,color:#fff
  style DS fill:#51cf66,color:#fff

Masked autoencoders (MAE)

Masked Autoencoders (2021) brought the “predict the missing part” idea from NLP (like BERT’s masked language modeling) to vision. The approach is simple:

Divide the image into patches (e.g., $16 \times 16$ patches for a $224 \times 224$ image = 196 patches)
Randomly mask a large fraction (75%) of patches
Encode only the visible patches with a Vision Transformer
Decode to reconstruct the masked patches

Why mask 75%? With fewer masked patches, the task is too easy, the model can interpolate from nearby visible patches. With 75% masking, the model must understand global structure to fill in the gaps. This is a much harder pretext task, so it produces better representations.

MAE is computationally efficient because the encoder only processes 25% of patches. The decoder is lightweight and only used during pretraining. For downstream tasks, you keep just the encoder.

Compared to contrastive learning, MAE doesn’t need negative pairs, large batch sizes, or momentum encoders. It’s simpler and scales well.

Self-supervised vs supervised: label efficiency

The practical value of self-supervised learning shows up in label efficiency. Here’s the typical workflow:

Pretrain on a large unlabeled dataset using self-supervised learning
Fine-tune on a small labeled dataset for your specific task

Compared to training from scratch on the small labeled dataset, self-supervised pretraining gives dramatically better results. The pretrained model has already learned low-level features (edges, textures), mid-level features (parts, shapes), and some high-level features (object structure). Fine-tuning just adapts these to your specific classification task.

This matters most when labeled data is scarce. In medical imaging, for example, you might have millions of unlabeled X-rays but only a few hundred expert-labeled examples of a rare condition. Self-supervised pretraining on the unlabeled data, followed by fine-tuning on the labeled examples, can outperform a supervised model trained on 10x or even 100x more labeled data.

Self-supervised methods comparison

Method	Modality	Pretext task	Year	Key result vs supervised
Denoising AE	Images	Reconstruct clean from noisy	2008	Early representation learning baseline
SimCLR	Images	Contrastive (augmented views)	2020	76.5% top-1 on ImageNet (linear eval)
MoCo v2	Images	Contrastive (momentum encoder)	2020	Matched supervised on detection/segmentation
BYOL	Images	Predict one view from another (no negatives)	2020	74.3% without negative pairs
MAE	Images	Reconstruct masked patches	2021	87.8% top-1 fine-tuned (ViT-H)
BERT	Text	Masked language modeling	2018	Transformed NLP, state-of-art on 11 tasks
DINO	Images	Self-distillation, no labels	2021	Emergent segmentation in attention maps

Example 1: NT-Xent loss

Let’s compute the NT-Xent loss for a simplified case with one positive pair and one negative.

Representations (after projection head):

Anchor: $z_1 = [1, 0]$
Positive: $z_1^+ = [0.9, 0.1]$
Negative: $z_2 = [-0.8, 0.6]$

Temperature: $\tau = 0.5$

Step 1: Cosine similarities

\text{sim}(z_1, z_1^+) = \frac{z_1 \cdot z_1^+}{\|z_1\| \|z_1^+\|} = \frac{(1)(0.9) + (0)(0.1)}{\sqrt{1} \cdot \sqrt{0.81 + 0.01}} = \frac{0.9}{1.0 \times 0.9055} = 0.9939

\text{sim}(z_1, z_2) = \frac{(1)(-0.8) + (0)(0.6)}{\sqrt{1} \cdot \sqrt{0.64 + 0.36}} = \frac{-0.8}{1.0 \times 1.0} = -0.8

Step 2: Scale by temperature

\frac{\text{sim}(z_1, z_1^+)}{\tau} = \frac{0.9939}{0.5} = 1.9878

\frac{\text{sim}(z_1, z_2)}{\tau} = \frac{-0.8}{0.5} = -1.6

Step 3: NT-Xent loss

\ell = -\log \frac{\exp(1.9878)}{\exp(1.9878) + \exp(-1.6)}

= -\log \frac{7.298}{7.298 + 0.2019}

= -\log \frac{7.298}{7.4999} = -\log(0.9731) = 0.0273

The loss is very low (0.0273) because the positive pair is very similar (cosine 0.99) and the negative is very dissimilar (cosine -0.8). The model has done a good job here.

Now suppose we had a harder negative, $z_3 = [0.7, 0.3]$ with $\text{sim}(z_1, z_3) = \frac{0.7}{\sqrt{0.58}} = 0.919$ . The temperature-scaled value would be $0.919 / 0.5 = 1.838$ , and:

\ell = -\log \frac{\exp(1.9878)}{\exp(1.9878) + \exp(1.838)} = -\log \frac{7.298}{7.298 + 6.284} = -\log(0.537) = 0.622

Much higher loss! Hard negatives (samples that are similar to the anchor but shouldn’t be) drive the most learning.

Example 2: denoising autoencoder

Input (4 features, normalized to [0,1]):

x = [0.8, 0.3, 0.9, 0.5]

Add Gaussian noise ( $\sigma = 0.15$ ). Suppose the noise values are $[0.15, -0.12, -0.18, 0.11]$ :

\tilde{x} = [0.95, 0.18, 0.72, 0.61]

The encoder maps this to a 2D latent code:

z = f(\tilde{x}) = [0.6, 0.4]

The decoder reconstructs:

\hat{x} = g(z) = [0.79, 0.28, 0.88, 0.51]

MSE loss (compared to clean $x$ , not noisy $\tilde{x}$ ):

\mathcal{L} = \frac{1}{4} \sum_{i=1}^{4} (x_i - \hat{x}_i)^2

= \frac{1}{4}[(0.8 - 0.79)^2 + (0.3 - 0.28)^2 + (0.9 - 0.88)^2 + (0.5 - 0.51)^2]

= \frac{1}{4}[0.0001 + 0.0004 + 0.0004 + 0.0001] = \frac{0.001}{4} = 0.00025

That’s a very low loss. The network successfully denoised the input, recovering values close to the clean original. Note that reconstruction is measured against $x$ , not $\tilde{x}$ . This is what forces the encoder to learn the structure of clean data rather than memorizing the noise pattern.

For comparison, if we measured MSE against the noisy input $\tilde{x}$ :

\frac{1}{4}[(0.95 - 0.79)^2 + (0.18 - 0.28)^2 + (0.72 - 0.88)^2 + (0.61 - 0.51)^2]

= \frac{1}{4}[0.0256 + 0.01 + 0.0256 + 0.01] = \frac{0.0712}{4} = 0.0178

This is 71x larger, confirming the reconstruction is much closer to the clean signal.

Example 3: label efficiency comparison

Consider this experimental setup:

Dataset: 1,000,000 unlabeled medical images + 100 labeled examples across 5 disease classes.

Approach A (train from scratch): Train a ResNet-50 on the 100 labeled examples only.

100 examples / 5 classes = 20 per class
With 23 million parameters, severe overfitting is almost guaranteed
Expected accuracy: ~35% (barely above random for 5 classes)

Approach B (self-supervised pretrain + fine-tune):

Pretrain with SimCLR on all 1,000,000 unlabeled images for 100 epochs
Freeze the encoder, train a linear classifier on the 100 labeled examples
Fine-tune end-to-end with a small learning rate

Step 2 (linear eval): ~60% accuracy. The pretrained features already capture useful medical image structure
Step 3 (fine-tuned): ~72% accuracy. Fine-tuning adapts the features to the specific disease classification task

Why the gap? The self-supervised model has seen a million images. It learned what normal anatomy looks like, how X-ray contrast works, common visual patterns. With only 100 labels, Approach A can’t learn any of this. It must learn everything from pixel patterns in 20 images per class.

This ratio matters: when you go from 100 to 1,000 labeled examples, the gap between approaches narrows. At 10,000+ labels, training from scratch becomes competitive. Self-supervised pretraining provides the most value when labels are scarce relative to model complexity.

Key takeaways

Representations matter more than models. A good representation with a simple classifier often beats a complex model with raw features.
Self-supervised learning unlocks unlabeled data. The vast majority of data in the world is unlabeled. SSL lets you use it.
Contrastive methods learn by comparison: what should be similar vs different. They need careful negative sampling and large batches.
Masked prediction methods learn by reconstruction. They’re simpler and scale well but need high masking ratios to work.
The projection head trick: for contrastive methods, take representations from before the projection head. The head discards task-irrelevant info that’s useful downstream.

What comes next

Good representations only help if they transfer to your target domain. But what happens when the source and target domains are very different? The next article on domain adaptation and fine-tuning strategies covers how to handle distribution shift, from simple fine-tuning tricks like layer-wise learning rate decay to adversarial domain adaptation methods like DANN.

← Back to all series