Distributed representations and latent spaces
In this series (25 parts)
- Neural networks: the basic building block
- Forward pass and backpropagation
- Training neural networks: a practical guide
- Convolutional neural networks
- Recurrent neural networks and LSTMs
- Attention mechanism and transformers
- Word embeddings: from one-hot to dense representations
- Transfer learning and fine-tuning
- Optimization techniques for deep networks
- Regularization for deep networks
- Encoder-decoder architectures
- Generative models: an overview
- Restricted Boltzmann Machines
- Deep Belief Networks
- Variational Autoencoders
- Generative Adversarial Networks: training and theory
- DCGAN, conditional GANs, and GAN variants
- Representation learning and self-supervised learning
- Domain adaptation and fine-tuning strategies
- Distributed representations and latent spaces
- AutoML and hyperparameter optimization
- Neural architecture search
- Network compression and efficient inference
- Graph neural networks
- Practical deep learning: debugging and tuning
Prerequisites
Before reading this article, make sure you are comfortable with:
- Word embeddings: how words are mapped to dense vectors and why that’s useful
- Variational autoencoders: the ELBO, encoder-decoder structure, and sampling from latent spaces
- Representation learning: what makes a good representation, contrastive and self-supervised approaches
Local vs distributed representations
There are two fundamentally different ways to represent concepts as vectors.
Local representation (one-hot): Each concept gets its own dedicated dimension. “Cat” might be , “dog” is , “car” is . One neuron fires per concept. Simple, but wasteful. With 50,000 concepts, you need 50,000 dimensions. And the representation says nothing about relationships: cat and dog are the same distance apart as cat and car.
Distributed representation: Each concept is a pattern of activation across many dimensions. “Cat” might be . Multiple dimensions contribute to each concept, and each dimension participates in representing many concepts. This is what neural networks learn.
Why is distributed better? Three reasons:
-
Efficiency: binary dimensions can represent concepts. With 100 dimensions, you can distinguish more patterns than atoms in the universe. Local representations can only represent concepts with dimensions.
-
Generalization: Similar concepts get similar representations. A classifier that learns to recognize cats gets some ability to recognize dogs for free, because their representations overlap.
-
Compositionality: Dimensions can combine to express new concepts never seen in training. A model that has seen “red car” and “blue truck” might generalize to “red truck” because the color and vehicle-type features are partially separable.
flowchart LR
subgraph Local["Local (one-hot)"]
direction TB
C1["cat = [1,0,0,0]"]
C2["dog = [0,1,0,0]"]
C3["car = [0,0,1,0]"]
C4["truck = [0,0,0,1]"]
end
subgraph Distributed["Distributed"]
direction TB
D1["cat = [0.8, −0.2, 0.5, 0.1]"]
D2["dog = [0.7, −0.1, 0.4, 0.2]"]
D3["car = [−0.3, 0.9, −0.1, 0.6]"]
D4["truck = [−0.2, 0.8, −0.2, 0.7]"]
end
Local -->|"No similarity
structure"| NOTE1["d(cat,dog) = d(cat,car) = √2"]
Distributed -->|"Similar things
are close"| NOTE2["d(cat,dog) < d(cat,car)"]
style NOTE1 fill:#ff6b6b,color:#fff
style NOTE2 fill:#51cf66,color:#fff
Representation capacity: local vs distributed encoding (N=5 neurons)
Geometry of learned latent spaces
When neural networks learn representations, the resulting latent spaces develop geometric structure. This structure is not imposed by the architecture. It emerges from the data and the training objective.
Clustering: Samples from the same class cluster together. In the latent space of an image classifier, all cat images end up near each other, far from car images. This is what makes classification easy: a linear boundary can separate the clusters.
Linear structure: Many latent spaces exhibit approximate linear relationships between concepts. The most famous example is from word embeddings: . This linearity means that semantic relationships correspond to geometric directions.
Smoothness: In well-trained generative models (VAEs, GANs), nearby points in latent space produce similar outputs. If you smoothly move from one point to another, the generated output changes gradually. There are no sudden jumps. This smoothness enables interpolation.
Manifold structure: Real data often lives on a low-dimensional manifold within the high-dimensional input space. A face image has millions of pixels but only varies along a few meaningful dimensions (pose, expression, lighting). Good representations learn to parameterize this manifold.
Linear structure in word embeddings: analogies
Word embedding models like Word2Vec and GloVe learn distributed representations where linear relationships encode semantic analogies.
The analogy “king is to queen as man is to woman” translates to:
Why does this work? The difference captures the “royalty” direction. Adding this direction to moves along the same semantic axis, landing near .
This works for many relationship types:
- Gender:
- Country-capital:
- Tense:
The linear structure isn’t perfect. It works best for common, well-represented relationships. Rare words or complex multi-step analogies often fail.
Disentangled representations
A disentangled representation is one where each dimension (or a small group of dimensions) corresponds to a single, independent factor of variation in the data.
Consider images of faces. The underlying factors might include:
- Pose (left/right rotation)
- Expression (happy, sad, neutral)
- Lighting direction
- Hair color
- Age
In a perfectly disentangled representation, changing one latent dimension would change only one factor. Sliding the “pose” dimension rotates the face without changing expression. Sliding “expression” changes the smile without moving the head.
In an entangled representation, dimensions mix multiple factors. Changing one dimension might simultaneously rotate the face AND change the lighting. This makes the representation harder to interpret and control.
Why do we want disentanglement? It enables:
- Controllable generation: change one attribute without affecting others
- Interpretability: each dimension has a clear meaning
- Better transfer: disentangled features often transfer better to new tasks
- Data efficiency: downstream tasks need fewer examples when features are clean
Measuring disentanglement
Several metrics quantify how disentangled a representation is:
Mutual Information Gap (MIG): For each ground-truth factor , find the latent dimension with the highest mutual information with that factor. MIG is the gap between the top-1 and top-2 mutual informations, normalized:
where is the latent dimension most informative about factor , and is the second most informative. Higher MIG means each factor is captured by a single dimension (large gap between top-1 and top-2).
Separated Attribute Predictability (SAP): Train a simple classifier to predict each factor from each latent dimension. SAP measures the gap between the most predictive and second most predictive dimension for each factor. Similar idea to MIG but uses predictive accuracy instead of mutual information.
Informally, both metrics ask: “Is each factor clearly captured by one dimension, not smeared across many?” A high score means yes.
Latent space interpolation
Given two points in latent space, and , interpolation generates intermediate points:
Linear interpolation (lerp):
This is the simplest approach. For VAEs with Gaussian priors, spherical linear interpolation (slerp) often works better because it follows the surface of the hypersphere where most of the probability mass concentrates:
where .
Good interpolation is a sign of a well-structured latent space. If intermediate points produce coherent outputs (e.g., smooth transitions between two faces), the model has learned a smooth manifold. If intermediate points produce garbage, the latent space has “holes” where no real data lives.
flowchart LR
subgraph Latent["2D Latent Space"]
Z1["z₁
(smiling face)"]
ZM1["z₀.₃₃"]
ZM2["z₀.₆₇"]
Z2["z₂
(serious face)"]
Z1 --> ZM1 --> ZM2 --> Z2
end
subgraph Outputs["Decoded Images"]
I1["😊 Smiling"]
I2["🙂 Slight smile"]
I3["😐 Neutral"]
I4["😑 Serious"]
end
Z1 -.-> I1
ZM1 -.-> I2
ZM2 -.-> I3
Z2 -.-> I4
style Z1 fill:#9775fa,color:#fff
style Z2 fill:#ff6b6b,color:#fff
style ZM1 fill:#b197fc,color:#fff
style ZM2 fill:#ff8787,color:#fff
Latent arithmetic in GANs
GANs also learn latent spaces with linear structure, even though they’re trained very differently from word embedding models.
In the GAN latent space, you can find directions that correspond to visual attributes. For example, in a face GAN:
where is a “smile direction” discovered by averaging the latent codes of smiling faces minus neutral faces.
StyleGAN makes this especially clean because its space is more disentangled than the raw space. The mapping network “unfolds” the latent space, making linear directions correspond more cleanly to single attributes.
Common latent arithmetic operations:
- Attribute manipulation: add or subtract attribute directions
- Style mixing: combine different layers’ latent codes from different images
- Truncation: move latent codes toward the mean to increase quality at the cost of diversity
Representation types comparison
| Type | Example | Dims needed for 50K concepts | Similarity built-in | Generalizes to unseen |
|---|---|---|---|---|
| One-hot (local) | Word indices | 50,000 | No (all equidistant) | No |
| Bag of words | Document vectors | 50,000 (sparse) | Partial (overlap-based) | Somewhat |
| Distributed (dense) | Word2Vec, GloVe | 100-300 | Yes (cosine similarity) | Yes |
| Contextual | BERT, GPT embeddings | 768-1024 | Yes (context-dependent) | Yes |
| VAE latent | Image features | 32-512 | Yes (Euclidean distance) | Yes (interpolation) |
| GAN latent | StyleGAN space | 512 | Yes (linear structure) | Yes (arithmetic) |
Example 1: word embedding analogy
Given the following 3-dimensional word vectors:
Analogy computation: king - man + woman = ?
Now compare this result to candidate vectors using cosine similarity:
Cosine with queen :
Distractor 1:
Distractor 2:
Results: queen (0.977), prince (0.844), castle (0.577). The analogy correctly identifies “queen” as the nearest neighbor. The difference vector captures “royalty,” and adding it to “woman” lands closest to “queen.”
Example 2: latent space interpolation
Given two latent codes:
Linear interpolation at 4 points ():
(start):
:
:
(end):
Summary of the interpolation path:
| Character | ||
|---|---|---|
| 0.00 | Source (e.g., face A) | |
| 0.33 | Blend, mostly A | |
| 0.67 | Blend, mostly B | |
| 1.00 | Target (e.g., face B) |
Notice how the path crosses through the origin area at . In a well-structured latent space, this would produce a coherent intermediate output. In a poorly structured space, you might get artifacts here, especially if the path passes through low-density regions.
The norm of each point: , , , . The midpoints have smaller norms than the endpoints. This is why spherical interpolation can work better: it keeps the norm more constant.
Example 3: disentanglement
Consider a 2D latent space encoding two factors: shape (3 values: circle, square, triangle) and color (4 values: red, green, blue, yellow). That’s possible combinations.
Perfectly disentangled case:
Dimension 1 () encodes shape only:
- Circle:
- Square:
- Triangle:
Dimension 2 () encodes color only:
- Red:
- Green:
- Blue:
- Yellow:
A red circle is at . A blue triangle is at . Changing changes shape without affecting color. The 12 combinations form a clean grid in 2D.
MIG for this representation: For the shape factor, has very high mutual information (it perfectly predicts shape), while has near-zero mutual information with shape. The gap is large. Similarly for the color factor. MIG would be close to 1.0 (the maximum).
Entangled case:
Both dimensions contribute to both factors:
- Red circle:
- Red square:
- Green circle:
- Green square:
Now is correlated with both shape and color. To predict shape, you need to look at both dimensions. MIG would be low because the top-1 and top-2 informative dimensions have similar mutual information with each factor. There’s no single dimension that clearly owns a factor.
In practice, achieving perfect disentanglement requires either supervision (telling the model what the factors are) or strong inductive biases. The -VAE encourages disentanglement by increasing the weight on the KL term in the ELBO, which pushes the posterior closer to a factorial prior. This comes at the cost of reconstruction quality, because the model is forced to use a simpler (more independent) latent code.
Why this matters for deep learning
Distributed representations are the reason deep learning works at all. Consider the alternative: if a network used one-hot representations internally, it would need to see every possible combination of features during training. With distributed representations, it can generalize from seen combinations to unseen ones.
A face recognition model doesn’t need to see every person under every lighting condition. If it learns separate features for identity and lighting (distributed and partially disentangled), seeing person A under light 1 and person B under light 2 helps it recognize person A under light 2. This combinatorial generalization is what makes neural networks data-efficient despite having millions of parameters.
The latent spaces of generative models take this further. They don’t just classify; they parameterize the data manifold. This enables generation, interpolation, and manipulation. Every advance in GAN quality, VAE expressiveness, or self-supervised representation learning is fundamentally an advance in learning better distributed representations.
Summary
Local (one-hot) representations are simple but wasteful and capture no relationships. Distributed representations spread meaning across many dimensions, enabling efficiency, similarity, and generalization. Learned latent spaces develop geometric structure: clustering, linear relationships, and smoothness. Disentangled representations cleanly separate factors of variation, enabling controllable generation. Interpolation and arithmetic in latent space demonstrate that these models learn meaningful, structured representations of their data.
What comes next
This article completes the core deep learning series on representations, generative models, and learning strategies. The next article on AutoML and neural architecture search explores how to automate the design of neural network architectures themselves, moving from hand-crafted designs to learned ones.