Representation learning and self-supervised learning
In this series (25 parts)
- Neural networks: the basic building block
- Forward pass and backpropagation
- Training neural networks: a practical guide
- Convolutional neural networks
- Recurrent neural networks and LSTMs
- Attention mechanism and transformers
- Word embeddings: from one-hot to dense representations
- Transfer learning and fine-tuning
- Optimization techniques for deep networks
- Regularization for deep networks
- Encoder-decoder architectures
- Generative models: an overview
- Restricted Boltzmann Machines
- Deep Belief Networks
- Variational Autoencoders
- Generative Adversarial Networks: training and theory
- DCGAN, conditional GANs, and GAN variants
- Representation learning and self-supervised learning
- Domain adaptation and fine-tuning strategies
- Distributed representations and latent spaces
- AutoML and hyperparameter optimization
- Neural architecture search
- Network compression and efficient inference
- Graph neural networks
- Practical deep learning: debugging and tuning
Prerequisites
Before reading this article, make sure you are comfortable with:
- Transfer learning: pretraining on one task and fine-tuning on another
- Information theory basics: entropy, KL divergence, mutual information
What makes a good representation?
A representation is a transformation of raw data into a form that makes useful information easy to extract. Your raw data might be a image (150,528 numbers). A good representation compresses this into, say, a 512-dimensional vector that captures the important structure and discards noise.
Three properties define a good representation:
Compact: It uses far fewer dimensions than the raw data, keeping only what matters. This is not just about storage. A lower-dimensional representation has a smaller hypothesis space, which means simpler downstream classifiers can work with it.
Disentangled: Different dimensions correspond to different factors of variation. One dimension might capture lighting, another object shape, another texture. When factors are entangled, you need more data and more complex models to tease them apart.
Transferable: A representation learned on one task works well on other tasks. If you learn features on ImageNet that also help with medical image classification, those features capture something genuinely useful about visual structure.
The key question is: how do you learn these representations? Supervised learning works, but it requires labels. Labels are expensive. Self-supervised learning aims to learn representations from the data itself.
Learned 2D features showing class separation
Autoencoders as representation learners
An autoencoder has two parts: an encoder maps input to a lower-dimensional code , and a decoder reconstructs the input as . Training minimizes reconstruction error:
The bottleneck forces the encoder to keep only the most important information. If has 64 dimensions and has 150,528, the encoder must compress by a factor of ~2,350x. It can only do this by learning the structure in the data.
The learned is a representation. But plain autoencoders have a problem: they tend to learn identity-like mappings through the bottleneck if the capacity is high enough. The representation may encode pixel patterns rather than semantic content.
Denoising autoencoders
A denoising autoencoder (DAE) addresses this by corrupting the input and training the network to recover the clean version:
where is a corrupted version of (e.g., adding Gaussian noise, randomly zeroing pixels, or applying random masks).
Why does this help? To denoise, the encoder must learn the underlying structure of clean data. It can’t simply memorize pixel values because the noise changes every time. The representation must capture what’s typical about the data, exactly what we want for downstream tasks.
This idea, corrupting inputs and training to recover the original, is a precursor to the masked autoencoder approach we’ll see later.
Contrastive learning: pull similar, push dissimilar
Contrastive learning takes a different approach. Instead of reconstruction, it learns representations by comparing pairs of samples. The core idea: representations of similar items should be close together, and representations of dissimilar items should be far apart.
For each sample, you create a positive pair (two views of the same thing) and negative pairs (views of different things). The loss pulls positives together in embedding space and pushes negatives apart.
Where do the pairs come from? In self-supervised learning, you create them through data augmentation. Take an image, apply two different random augmentations (crop, flip, color jitter, blur), and call those the positive pair. Any other image in the batch is a negative.
This is powerful because the network must learn what’s invariant across augmentations. Color jitter forces it to not rely on exact colors. Cropping forces it to not rely on absolute position. What’s left? The actual content, shape, texture, and identity of objects.
flowchart TD IMG["Original image x"] --> AUG1["Augmentation 1 (crop + color jitter)"] IMG --> AUG2["Augmentation 2 (crop + blur)"] AUG1 --> ENC1["Encoder f"] AUG2 --> ENC2["Encoder f (shared weights)"] ENC1 --> Z1["z₁"] ENC2 --> Z2["z₂"] Z1 -->|"Pull together"| LOSS["Contrastive loss"] Z2 -->|"Pull together"| LOSS NEG["z₃, z₄, ... (other images)"] -->|"Push apart"| LOSS style IMG fill:#9775fa,color:#fff style LOSS fill:#ff6b6b,color:#fff
SimCLR: a simple contrastive framework
SimCLR (2020) is a clean, effective contrastive learning framework. The pipeline:
- Sample a batch of images
- Apply two random augmentations to each, creating augmented views
- Encode all views with a shared encoder (e.g., ResNet-50)
- Project through a small MLP head to get
- Compute the NT-Xent loss over all pairs
The NT-Xent (Normalized Temperature-scaled Cross-Entropy) loss for a positive pair is:
where is cosine similarity and is a temperature parameter.
The temperature controls how sharp the distribution is. Small makes the loss focus heavily on hard negatives. Large treats all negatives more equally. SimCLR uses .
A key finding: the projection head matters. Representations are taken from before the projection head (from , not from ) for downstream tasks. The projection head learns to discard information useful for contrastive learning but not for downstream tasks (like augmentation-specific details).
flowchart LR X["Image x"] --> T1["Aug t₁"] --> E1["Encoder f"] --> H1["h₁"] X --> T2["Aug t₂"] --> E2["Encoder f"] --> H2["h₂"] H1 --> P1["Projection g"] --> Z1["z₁"] H2 --> P2["Projection g"] --> Z2["z₂"] Z1 --> NT["NT-Xent Loss"] Z2 --> NT H1 -.->|"Use for downstream"| DS["Downstream task"] style X fill:#9775fa,color:#fff style NT fill:#ff6b6b,color:#fff style DS fill:#51cf66,color:#fff
Masked autoencoders (MAE)
Masked Autoencoders (2021) brought the “predict the missing part” idea from NLP (like BERT’s masked language modeling) to vision. The approach is simple:
- Divide the image into patches (e.g., patches for a image = 196 patches)
- Randomly mask a large fraction (75%) of patches
- Encode only the visible patches with a Vision Transformer
- Decode to reconstruct the masked patches
Why mask 75%? With fewer masked patches, the task is too easy, the model can interpolate from nearby visible patches. With 75% masking, the model must understand global structure to fill in the gaps. This is a much harder pretext task, so it produces better representations.
MAE is computationally efficient because the encoder only processes 25% of patches. The decoder is lightweight and only used during pretraining. For downstream tasks, you keep just the encoder.
Compared to contrastive learning, MAE doesn’t need negative pairs, large batch sizes, or momentum encoders. It’s simpler and scales well.
Self-supervised vs supervised: label efficiency
The practical value of self-supervised learning shows up in label efficiency. Here’s the typical workflow:
- Pretrain on a large unlabeled dataset using self-supervised learning
- Fine-tune on a small labeled dataset for your specific task
Compared to training from scratch on the small labeled dataset, self-supervised pretraining gives dramatically better results. The pretrained model has already learned low-level features (edges, textures), mid-level features (parts, shapes), and some high-level features (object structure). Fine-tuning just adapts these to your specific classification task.
This matters most when labeled data is scarce. In medical imaging, for example, you might have millions of unlabeled X-rays but only a few hundred expert-labeled examples of a rare condition. Self-supervised pretraining on the unlabeled data, followed by fine-tuning on the labeled examples, can outperform a supervised model trained on 10x or even 100x more labeled data.
Self-supervised methods comparison
| Method | Modality | Pretext task | Year | Key result vs supervised |
|---|---|---|---|---|
| Denoising AE | Images | Reconstruct clean from noisy | 2008 | Early representation learning baseline |
| SimCLR | Images | Contrastive (augmented views) | 2020 | 76.5% top-1 on ImageNet (linear eval) |
| MoCo v2 | Images | Contrastive (momentum encoder) | 2020 | Matched supervised on detection/segmentation |
| BYOL | Images | Predict one view from another (no negatives) | 2020 | 74.3% without negative pairs |
| MAE | Images | Reconstruct masked patches | 2021 | 87.8% top-1 fine-tuned (ViT-H) |
| BERT | Text | Masked language modeling | 2018 | Transformed NLP, state-of-art on 11 tasks |
| DINO | Images | Self-distillation, no labels | 2021 | Emergent segmentation in attention maps |
Example 1: NT-Xent loss
Let’s compute the NT-Xent loss for a simplified case with one positive pair and one negative.
Representations (after projection head):
- Anchor:
- Positive:
- Negative:
Temperature:
Step 1: Cosine similarities
Step 2: Scale by temperature
Step 3: NT-Xent loss
The loss is very low (0.0273) because the positive pair is very similar (cosine 0.99) and the negative is very dissimilar (cosine -0.8). The model has done a good job here.
Now suppose we had a harder negative, with . The temperature-scaled value would be , and:
Much higher loss! Hard negatives (samples that are similar to the anchor but shouldn’t be) drive the most learning.
Example 2: denoising autoencoder
Input (4 features, normalized to [0,1]):
Add Gaussian noise (). Suppose the noise values are :
The encoder maps this to a 2D latent code:
The decoder reconstructs:
MSE loss (compared to clean , not noisy ):
That’s a very low loss. The network successfully denoised the input, recovering values close to the clean original. Note that reconstruction is measured against , not . This is what forces the encoder to learn the structure of clean data rather than memorizing the noise pattern.
For comparison, if we measured MSE against the noisy input :
This is 71x larger, confirming the reconstruction is much closer to the clean signal.
Example 3: label efficiency comparison
Consider this experimental setup:
Dataset: 1,000,000 unlabeled medical images + 100 labeled examples across 5 disease classes.
Approach A (train from scratch): Train a ResNet-50 on the 100 labeled examples only.
- 100 examples / 5 classes = 20 per class
- With 23 million parameters, severe overfitting is almost guaranteed
- Expected accuracy: ~35% (barely above random for 5 classes)
Approach B (self-supervised pretrain + fine-tune):
- Pretrain with SimCLR on all 1,000,000 unlabeled images for 100 epochs
- Freeze the encoder, train a linear classifier on the 100 labeled examples
- Fine-tune end-to-end with a small learning rate
- Step 2 (linear eval): ~60% accuracy. The pretrained features already capture useful medical image structure
- Step 3 (fine-tuned): ~72% accuracy. Fine-tuning adapts the features to the specific disease classification task
Why the gap? The self-supervised model has seen a million images. It learned what normal anatomy looks like, how X-ray contrast works, common visual patterns. With only 100 labels, Approach A can’t learn any of this. It must learn everything from pixel patterns in 20 images per class.
This ratio matters: when you go from 100 to 1,000 labeled examples, the gap between approaches narrows. At 10,000+ labels, training from scratch becomes competitive. Self-supervised pretraining provides the most value when labels are scarce relative to model complexity.
Key takeaways
-
Representations matter more than models. A good representation with a simple classifier often beats a complex model with raw features.
-
Self-supervised learning unlocks unlabeled data. The vast majority of data in the world is unlabeled. SSL lets you use it.
-
Contrastive methods learn by comparison: what should be similar vs different. They need careful negative sampling and large batches.
-
Masked prediction methods learn by reconstruction. They’re simpler and scale well but need high masking ratios to work.
-
The projection head trick: for contrastive methods, take representations from before the projection head. The head discards task-irrelevant info that’s useful downstream.
What comes next
Good representations only help if they transfer to your target domain. But what happens when the source and target domains are very different? The next article on domain adaptation and fine-tuning strategies covers how to handle distribution shift, from simple fine-tuning tricks like layer-wise learning rate decay to adversarial domain adaptation methods like DANN.