Search…

Distributed representations and latent spaces

In this series (25 parts)
  1. Neural networks: the basic building block
  2. Forward pass and backpropagation
  3. Training neural networks: a practical guide
  4. Convolutional neural networks
  5. Recurrent neural networks and LSTMs
  6. Attention mechanism and transformers
  7. Word embeddings: from one-hot to dense representations
  8. Transfer learning and fine-tuning
  9. Optimization techniques for deep networks
  10. Regularization for deep networks
  11. Encoder-decoder architectures
  12. Generative models: an overview
  13. Restricted Boltzmann Machines
  14. Deep Belief Networks
  15. Variational Autoencoders
  16. Generative Adversarial Networks: training and theory
  17. DCGAN, conditional GANs, and GAN variants
  18. Representation learning and self-supervised learning
  19. Domain adaptation and fine-tuning strategies
  20. Distributed representations and latent spaces
  21. AutoML and hyperparameter optimization
  22. Neural architecture search
  23. Network compression and efficient inference
  24. Graph neural networks
  25. Practical deep learning: debugging and tuning

Prerequisites

Before reading this article, make sure you are comfortable with:

Local vs distributed representations

There are two fundamentally different ways to represent concepts as vectors.

Local representation (one-hot): Each concept gets its own dedicated dimension. “Cat” might be [1,0,0,0][1, 0, 0, 0], “dog” is [0,1,0,0][0, 1, 0, 0], “car” is [0,0,1,0][0, 0, 1, 0]. One neuron fires per concept. Simple, but wasteful. With 50,000 concepts, you need 50,000 dimensions. And the representation says nothing about relationships: cat and dog are the same distance apart as cat and car.

Distributed representation: Each concept is a pattern of activation across many dimensions. “Cat” might be [0.8,0.2,0.5,0.1][0.8, -0.2, 0.5, 0.1]. Multiple dimensions contribute to each concept, and each dimension participates in representing many concepts. This is what neural networks learn.

Why is distributed better? Three reasons:

  1. Efficiency: nn binary dimensions can represent 2n2^n concepts. With 100 dimensions, you can distinguish more patterns than atoms in the universe. Local representations can only represent nn concepts with nn dimensions.

  2. Generalization: Similar concepts get similar representations. A classifier that learns to recognize cats gets some ability to recognize dogs for free, because their representations overlap.

  3. Compositionality: Dimensions can combine to express new concepts never seen in training. A model that has seen “red car” and “blue truck” might generalize to “red truck” because the color and vehicle-type features are partially separable.

flowchart LR
  subgraph Local["Local (one-hot)"]
      direction TB
      C1["cat = [1,0,0,0]"]
      C2["dog = [0,1,0,0]"]
      C3["car = [0,0,1,0]"]
      C4["truck = [0,0,0,1]"]
  end

  subgraph Distributed["Distributed"]
      direction TB
      D1["cat = [0.8, −0.2, 0.5, 0.1]"]
      D2["dog = [0.7, −0.1, 0.4, 0.2]"]
      D3["car = [−0.3, 0.9, −0.1, 0.6]"]
      D4["truck = [−0.2, 0.8, −0.2, 0.7]"]
  end

  Local -->|"No similarity
structure"| NOTE1["d(cat,dog) = d(cat,car) = √2"]
  Distributed -->|"Similar things
are close"| NOTE2["d(cat,dog) < d(cat,car)"]

  style NOTE1 fill:#ff6b6b,color:#fff
  style NOTE2 fill:#51cf66,color:#fff

Representation capacity: local vs distributed encoding (N=5 neurons)

Geometry of learned latent spaces

When neural networks learn representations, the resulting latent spaces develop geometric structure. This structure is not imposed by the architecture. It emerges from the data and the training objective.

Clustering: Samples from the same class cluster together. In the latent space of an image classifier, all cat images end up near each other, far from car images. This is what makes classification easy: a linear boundary can separate the clusters.

Linear structure: Many latent spaces exhibit approximate linear relationships between concepts. The most famous example is from word embeddings: kingman+womanqueen\vec{\text{king}} - \vec{\text{man}} + \vec{\text{woman}} \approx \vec{\text{queen}}. This linearity means that semantic relationships correspond to geometric directions.

Smoothness: In well-trained generative models (VAEs, GANs), nearby points in latent space produce similar outputs. If you smoothly move from one point to another, the generated output changes gradually. There are no sudden jumps. This smoothness enables interpolation.

Manifold structure: Real data often lives on a low-dimensional manifold within the high-dimensional input space. A face image has millions of pixels but only varies along a few meaningful dimensions (pose, expression, lighting). Good representations learn to parameterize this manifold.

Linear structure in word embeddings: analogies

Word embedding models like Word2Vec and GloVe learn distributed representations where linear relationships encode semantic analogies.

The analogy “king is to queen as man is to woman” translates to:

v(king)v(man)+v(woman)v(queen)\vec{v}(\text{king}) - \vec{v}(\text{man}) + \vec{v}(\text{woman}) \approx \vec{v}(\text{queen})

Why does this work? The difference v(king)v(man)\vec{v}(\text{king}) - \vec{v}(\text{man}) captures the “royalty” direction. Adding this direction to v(woman)\vec{v}(\text{woman}) moves along the same semantic axis, landing near v(queen)\vec{v}(\text{queen}).

This works for many relationship types:

  • Gender: v(brother)v(sister)v(king)v(queen)\vec{v}(\text{brother}) - \vec{v}(\text{sister}) \approx \vec{v}(\text{king}) - \vec{v}(\text{queen})
  • Country-capital: v(Paris)v(France)v(Tokyo)v(Japan)\vec{v}(\text{Paris}) - \vec{v}(\text{France}) \approx \vec{v}(\text{Tokyo}) - \vec{v}(\text{Japan})
  • Tense: v(walking)v(walked)v(swimming)v(swam)\vec{v}(\text{walking}) - \vec{v}(\text{walked}) \approx \vec{v}(\text{swimming}) - \vec{v}(\text{swam})

The linear structure isn’t perfect. It works best for common, well-represented relationships. Rare words or complex multi-step analogies often fail.

Disentangled representations

A disentangled representation is one where each dimension (or a small group of dimensions) corresponds to a single, independent factor of variation in the data.

Consider images of faces. The underlying factors might include:

  • Pose (left/right rotation)
  • Expression (happy, sad, neutral)
  • Lighting direction
  • Hair color
  • Age

In a perfectly disentangled representation, changing one latent dimension would change only one factor. Sliding the “pose” dimension rotates the face without changing expression. Sliding “expression” changes the smile without moving the head.

In an entangled representation, dimensions mix multiple factors. Changing one dimension might simultaneously rotate the face AND change the lighting. This makes the representation harder to interpret and control.

Why do we want disentanglement? It enables:

  • Controllable generation: change one attribute without affecting others
  • Interpretability: each dimension has a clear meaning
  • Better transfer: disentangled features often transfer better to new tasks
  • Data efficiency: downstream tasks need fewer examples when features are clean

Measuring disentanglement

Several metrics quantify how disentangled a representation is:

Mutual Information Gap (MIG): For each ground-truth factor kk, find the latent dimension with the highest mutual information with that factor. MIG is the gap between the top-1 and top-2 mutual informations, normalized:

MIG=1Kk=1KI(zj(1);vk)I(zj(2);vk)H(vk)\text{MIG} = \frac{1}{K} \sum_{k=1}^{K} \frac{I(z_{j^{(1)}}; v_k) - I(z_{j^{(2)}}; v_k)}{H(v_k)}

where j(1)j^{(1)} is the latent dimension most informative about factor kk, and j(2)j^{(2)} is the second most informative. Higher MIG means each factor is captured by a single dimension (large gap between top-1 and top-2).

Separated Attribute Predictability (SAP): Train a simple classifier to predict each factor from each latent dimension. SAP measures the gap between the most predictive and second most predictive dimension for each factor. Similar idea to MIG but uses predictive accuracy instead of mutual information.

Informally, both metrics ask: “Is each factor clearly captured by one dimension, not smeared across many?” A high score means yes.

Latent space interpolation

Given two points in latent space, z1z_1 and z2z_2, interpolation generates intermediate points:

Linear interpolation (lerp):

zt=(1t)z1+tz2,t[0,1]z_t = (1 - t) \cdot z_1 + t \cdot z_2, \quad t \in [0, 1]

This is the simplest approach. For VAEs with Gaussian priors, spherical linear interpolation (slerp) often works better because it follows the surface of the hypersphere where most of the probability mass concentrates:

zt=sin((1t)θ)sinθz1+sin(tθ)sinθz2z_t = \frac{\sin((1-t)\theta)}{\sin\theta} z_1 + \frac{\sin(t\theta)}{\sin\theta} z_2

where θ=arccos(z1z2z1z2)\theta = \arccos\left(\frac{z_1 \cdot z_2}{\|z_1\|\|z_2\|}\right).

Good interpolation is a sign of a well-structured latent space. If intermediate points produce coherent outputs (e.g., smooth transitions between two faces), the model has learned a smooth manifold. If intermediate points produce garbage, the latent space has “holes” where no real data lives.

flowchart LR
  subgraph Latent["2D Latent Space"]
      Z1["z₁
(smiling face)"]
      ZM1["z₀.₃₃"]
      ZM2["z₀.₆₇"]
      Z2["z₂
(serious face)"]
      Z1 --> ZM1 --> ZM2 --> Z2
  end

  subgraph Outputs["Decoded Images"]
      I1["😊 Smiling"]
      I2["🙂 Slight smile"]
      I3["😐 Neutral"]
      I4["😑 Serious"]
  end

  Z1 -.-> I1
  ZM1 -.-> I2
  ZM2 -.-> I3
  Z2 -.-> I4

  style Z1 fill:#9775fa,color:#fff
  style Z2 fill:#ff6b6b,color:#fff
  style ZM1 fill:#b197fc,color:#fff
  style ZM2 fill:#ff8787,color:#fff

Latent arithmetic in GANs

GANs also learn latent spaces with linear structure, even though they’re trained very differently from word embedding models.

In the GAN latent space, you can find directions that correspond to visual attributes. For example, in a face GAN:

zsmiling=zneutral face+Δzsmilez_{\text{smiling}} = z_{\text{neutral face}} + \Delta z_{\text{smile}}

where Δzsmile\Delta z_{\text{smile}} is a “smile direction” discovered by averaging the latent codes of smiling faces minus neutral faces.

StyleGAN makes this especially clean because its WW space is more disentangled than the raw ZZ space. The mapping network zwz \to w “unfolds” the latent space, making linear directions correspond more cleanly to single attributes.

Common latent arithmetic operations:

  • Attribute manipulation: add or subtract attribute directions
  • Style mixing: combine different layers’ latent codes from different images
  • Truncation: move latent codes toward the mean to increase quality at the cost of diversity

Representation types comparison

TypeExampleDims needed for 50K conceptsSimilarity built-inGeneralizes to unseen
One-hot (local)Word indices50,000No (all equidistant)No
Bag of wordsDocument vectors50,000 (sparse)Partial (overlap-based)Somewhat
Distributed (dense)Word2Vec, GloVe100-300Yes (cosine similarity)Yes
ContextualBERT, GPT embeddings768-1024Yes (context-dependent)Yes
VAE latentImage features32-512Yes (Euclidean distance)Yes (interpolation)
GAN latentStyleGAN WW space512Yes (linear structure)Yes (arithmetic)

Example 1: word embedding analogy

Given the following 3-dimensional word vectors:

v(king)=[0.8,0.3,0.1]\vec{v}(\text{king}) = [0.8, 0.3, -0.1] v(man)=[0.4,0.2,0.2]\vec{v}(\text{man}) = [0.4, 0.2, -0.2] v(woman)=[0.3,0.5,0.1]\vec{v}(\text{woman}) = [0.3, 0.5, 0.1] v(queen)=[0.7,0.6,0.0]\vec{v}(\text{queen}) = [0.7, 0.6, 0.0]

Analogy computation: king - man + woman = ?

v(king)v(man)+v(woman)\vec{v}(\text{king}) - \vec{v}(\text{man}) + \vec{v}(\text{woman}) =[0.80.4+0.3,  0.30.2+0.5,  0.1(0.2)+0.1]= [0.8 - 0.4 + 0.3, \; 0.3 - 0.2 + 0.5, \; -0.1 - (-0.2) + 0.1] =[0.7,  0.6,  0.2]= [0.7, \; 0.6, \; 0.2]

Now compare this result [0.7,0.6,0.2][0.7, 0.6, 0.2] to candidate vectors using cosine similarity:

Cosine with queen [0.7,0.6,0.0][0.7, 0.6, 0.0]:

cos(r,v(queen))=(0.7)(0.7)+(0.6)(0.6)+(0.2)(0.0)0.49+0.36+0.040.49+0.36+0.0\text{cos}(\mathbf{r}, \vec{v}(\text{queen})) = \frac{(0.7)(0.7) + (0.6)(0.6) + (0.2)(0.0)}{\sqrt{0.49 + 0.36 + 0.04} \cdot \sqrt{0.49 + 0.36 + 0.0}} =0.49+0.36+00.890.85=0.850.9434×0.9220=0.850.8698=0.9772= \frac{0.49 + 0.36 + 0}{\sqrt{0.89} \cdot \sqrt{0.85}} = \frac{0.85}{0.9434 \times 0.9220} = \frac{0.85}{0.8698} = 0.9772

Distractor 1: v(prince)=[0.6,0.1,0.3]\vec{v}(\text{prince}) = [0.6, 0.1, 0.3]

cos(r,v(prince))=0.42+0.06+0.060.890.46=0.540.9434×0.6782=0.540.6398=0.8440\text{cos}(\mathbf{r}, \vec{v}(\text{prince})) = \frac{0.42 + 0.06 + 0.06}{\sqrt{0.89} \cdot \sqrt{0.46}} = \frac{0.54}{0.9434 \times 0.6782} = \frac{0.54}{0.6398} = 0.8440

Distractor 2: v(castle)=[0.1,0.8,0.4]\vec{v}(\text{castle}) = [-0.1, 0.8, 0.4]

cos(r,v(castle))=0.07+0.48+0.080.890.81=0.490.9434×0.9=0.490.8491=0.5771\text{cos}(\mathbf{r}, \vec{v}(\text{castle})) = \frac{-0.07 + 0.48 + 0.08}{\sqrt{0.89} \cdot \sqrt{0.81}} = \frac{0.49}{0.9434 \times 0.9} = \frac{0.49}{0.8491} = 0.5771

Results: queen (0.977), prince (0.844), castle (0.577). The analogy correctly identifies “queen” as the nearest neighbor. The difference vector v(king)v(man)=[0.4,0.1,0.1]\vec{v}(\text{king}) - \vec{v}(\text{man}) = [0.4, 0.1, 0.1] captures “royalty,” and adding it to “woman” lands closest to “queen.”

Example 2: latent space interpolation

Given two latent codes:

z1=[2.0,1.5,0.3]z_1 = [2.0, -1.5, 0.3] z2=[1.0,2.0,0.8]z_2 = [-1.0, 2.0, 0.8]

Linear interpolation at 4 points (t=0,0.33,0.67,1.0t = 0, 0.33, 0.67, 1.0):

zt=(1t)z1+tz2z_t = (1 - t) \cdot z_1 + t \cdot z_2

t=0t = 0 (start):

z0=1.0[2.0,1.5,0.3]+0[1.0,2.0,0.8]=[2.0,1.5,0.3]z_0 = 1.0 \cdot [2.0, -1.5, 0.3] + 0 \cdot [-1.0, 2.0, 0.8] = [2.0, -1.5, 0.3]

t=0.33t = 0.33:

z0.33=0.67[2.0,1.5,0.3]+0.33[1.0,2.0,0.8]z_{0.33} = 0.67 \cdot [2.0, -1.5, 0.3] + 0.33 \cdot [-1.0, 2.0, 0.8] =[1.34,1.005,0.201]+[0.33,0.66,0.264]= [1.34, -1.005, 0.201] + [-0.33, 0.66, 0.264] =[1.01,0.345,0.465]= [1.01, -0.345, 0.465]

t=0.67t = 0.67:

z0.67=0.33[2.0,1.5,0.3]+0.67[1.0,2.0,0.8]z_{0.67} = 0.33 \cdot [2.0, -1.5, 0.3] + 0.67 \cdot [-1.0, 2.0, 0.8] =[0.66,0.495,0.099]+[0.67,1.34,0.536]= [0.66, -0.495, 0.099] + [-0.67, 1.34, 0.536] =[0.01,0.845,0.635]= [-0.01, 0.845, 0.635]

t=1.0t = 1.0 (end):

z1=0[2.0,1.5,0.3]+1.0[1.0,2.0,0.8]=[1.0,2.0,0.8]z_1 = 0 \cdot [2.0, -1.5, 0.3] + 1.0 \cdot [-1.0, 2.0, 0.8] = [-1.0, 2.0, 0.8]

Summary of the interpolation path:

ttztz_tCharacter
0.00[2.0,1.5,0.3][2.0, -1.5, 0.3]Source (e.g., face A)
0.33[1.01,0.345,0.465][1.01, -0.345, 0.465]Blend, mostly A
0.67[0.01,0.845,0.635][-0.01, 0.845, 0.635]Blend, mostly B
1.00[1.0,2.0,0.8][-1.0, 2.0, 0.8]Target (e.g., face B)

Notice how the path crosses through the origin area at t0.67t \approx 0.67. In a well-structured latent space, this would produce a coherent intermediate output. In a poorly structured space, you might get artifacts here, especially if the path passes through low-density regions.

The norm of each point: z0=2.53\|z_0\| = 2.53, z0.33=1.14\|z_{0.33}\| = 1.14, z0.67=1.06\|z_{0.67}\| = 1.06, z1=2.37\|z_1\| = 2.37. The midpoints have smaller norms than the endpoints. This is why spherical interpolation can work better: it keeps the norm more constant.

Example 3: disentanglement

Consider a 2D latent space encoding two factors: shape (3 values: circle, square, triangle) and color (4 values: red, green, blue, yellow). That’s 3×4=123 \times 4 = 12 possible combinations.

Perfectly disentangled case:

Dimension 1 (z1z_1) encodes shape only:

  • Circle: z11z_1 \approx -1
  • Square: z10z_1 \approx 0
  • Triangle: z11z_1 \approx 1

Dimension 2 (z2z_2) encodes color only:

  • Red: z21.5z_2 \approx -1.5
  • Green: z20.5z_2 \approx -0.5
  • Blue: z20.5z_2 \approx 0.5
  • Yellow: z21.5z_2 \approx 1.5

A red circle is at (1,1.5)(-1, -1.5). A blue triangle is at (1,0.5)(1, 0.5). Changing z1z_1 changes shape without affecting color. The 12 combinations form a clean grid in 2D.

MIG for this representation: For the shape factor, z1z_1 has very high mutual information (it perfectly predicts shape), while z2z_2 has near-zero mutual information with shape. The gap is large. Similarly for the color factor. MIG would be close to 1.0 (the maximum).

Entangled case:

Both dimensions contribute to both factors:

  • Red circle: (1.2,0.8)(-1.2, -0.8)
  • Red square: (0.5,1.3)(-0.5, -1.3)
  • Green circle: (0.3,0.4)(0.3, -0.4)
  • Green square: (0.8,0.2)(0.8, 0.2)

Now z1z_1 is correlated with both shape and color. To predict shape, you need to look at both dimensions. MIG would be low because the top-1 and top-2 informative dimensions have similar mutual information with each factor. There’s no single dimension that clearly owns a factor.

In practice, achieving perfect disentanglement requires either supervision (telling the model what the factors are) or strong inductive biases. The β\beta-VAE encourages disentanglement by increasing the weight on the KL term in the ELBO, which pushes the posterior closer to a factorial prior. This comes at the cost of reconstruction quality, because the model is forced to use a simpler (more independent) latent code.

Why this matters for deep learning

Distributed representations are the reason deep learning works at all. Consider the alternative: if a network used one-hot representations internally, it would need to see every possible combination of features during training. With distributed representations, it can generalize from seen combinations to unseen ones.

A face recognition model doesn’t need to see every person under every lighting condition. If it learns separate features for identity and lighting (distributed and partially disentangled), seeing person A under light 1 and person B under light 2 helps it recognize person A under light 2. This combinatorial generalization is what makes neural networks data-efficient despite having millions of parameters.

The latent spaces of generative models take this further. They don’t just classify; they parameterize the data manifold. This enables generation, interpolation, and manipulation. Every advance in GAN quality, VAE expressiveness, or self-supervised representation learning is fundamentally an advance in learning better distributed representations.

Summary

Local (one-hot) representations are simple but wasteful and capture no relationships. Distributed representations spread meaning across many dimensions, enabling efficiency, similarity, and generalization. Learned latent spaces develop geometric structure: clustering, linear relationships, and smoothness. Disentangled representations cleanly separate factors of variation, enabling controllable generation. Interpolation and arithmetic in latent space demonstrate that these models learn meaningful, structured representations of their data.

What comes next

This article completes the core deep learning series on representations, generative models, and learning strategies. The next article on AutoML and neural architecture search explores how to automate the design of neural network architectures themselves, moving from hand-crafted designs to learned ones.

Start typing to search across all content
navigate Enter open Esc close