Jan 4, 2026 · 18 min read · Deep Learning

Distributed representations and latent spaces

In this series (25 parts)

Prerequisites

Before reading this article, make sure you are comfortable with:

Word embeddings: how words are mapped to dense vectors and why that’s useful
Variational autoencoders: the ELBO, encoder-decoder structure, and sampling from latent spaces
Representation learning: what makes a good representation, contrastive and self-supervised approaches

Local vs distributed representations

There are two fundamentally different ways to represent concepts as vectors.

Local representation (one-hot): Each concept gets its own dedicated dimension. “Cat” might be $[1, 0, 0, 0]$ , “dog” is $[0, 1, 0, 0]$ , “car” is $[0, 0, 1, 0]$ . One neuron fires per concept. Simple, but wasteful. With 50,000 concepts, you need 50,000 dimensions. And the representation says nothing about relationships: cat and dog are the same distance apart as cat and car.

Distributed representation: Each concept is a pattern of activation across many dimensions. “Cat” might be $[0.8, -0.2, 0.5, 0.1]$ . Multiple dimensions contribute to each concept, and each dimension participates in representing many concepts. This is what neural networks learn.

Why is distributed better? Three reasons:

Efficiency: $n$ binary dimensions can represent $2^n$ concepts. With 100 dimensions, you can distinguish more patterns than atoms in the universe. Local representations can only represent $n$ concepts with $n$ dimensions.
Generalization: Similar concepts get similar representations. A classifier that learns to recognize cats gets some ability to recognize dogs for free, because their representations overlap.
Compositionality: Dimensions can combine to express new concepts never seen in training. A model that has seen “red car” and “blue truck” might generalize to “red truck” because the color and vehicle-type features are partially separable.

flowchart LR
  subgraph Local["Local (one-hot)"]
      direction TB
      C1["cat = [1,0,0,0]"]
      C2["dog = [0,1,0,0]"]
      C3["car = [0,0,1,0]"]
      C4["truck = [0,0,0,1]"]
  end

  subgraph Distributed["Distributed"]
      direction TB
      D1["cat = [0.8, −0.2, 0.5, 0.1]"]
      D2["dog = [0.7, −0.1, 0.4, 0.2]"]
      D3["car = [−0.3, 0.9, −0.1, 0.6]"]
      D4["truck = [−0.2, 0.8, −0.2, 0.7]"]
  end

  Local -->|"No similarity
structure"| NOTE1["d(cat,dog) = d(cat,car) = √2"]
  Distributed -->|"Similar things
are close"| NOTE2["d(cat,dog) < d(cat,car)"]

  style NOTE1 fill:#ff6b6b,color:#fff
  style NOTE2 fill:#51cf66,color:#fff

Representation capacity: local vs distributed encoding (N=5 neurons)

Geometry of learned latent spaces

When neural networks learn representations, the resulting latent spaces develop geometric structure. This structure is not imposed by the architecture. It emerges from the data and the training objective.

Clustering: Samples from the same class cluster together. In the latent space of an image classifier, all cat images end up near each other, far from car images. This is what makes classification easy: a linear boundary can separate the clusters.

Linear structure: Many latent spaces exhibit approximate linear relationships between concepts. The most famous example is from word embeddings: $\vec{\text{king}} - \vec{\text{man}} + \vec{\text{woman}} \approx \vec{\text{queen}}$ . This linearity means that semantic relationships correspond to geometric directions.

Smoothness: In well-trained generative models (VAEs, GANs), nearby points in latent space produce similar outputs. If you smoothly move from one point to another, the generated output changes gradually. There are no sudden jumps. This smoothness enables interpolation.

Manifold structure: Real data often lives on a low-dimensional manifold within the high-dimensional input space. A face image has millions of pixels but only varies along a few meaningful dimensions (pose, expression, lighting). Good representations learn to parameterize this manifold.

Linear structure in word embeddings: analogies

Word embedding models like Word2Vec and GloVe learn distributed representations where linear relationships encode semantic analogies.

The analogy “king is to queen as man is to woman” translates to:

\vec{v}(\text{king}) - \vec{v}(\text{man}) + \vec{v}(\text{woman}) \approx \vec{v}(\text{queen})

Why does this work? The difference $\vec{v}(\text{king}) - \vec{v}(\text{man})$ captures the “royalty” direction. Adding this direction to $\vec{v}(\text{woman})$ moves along the same semantic axis, landing near $\vec{v}(\text{queen})$ .

This works for many relationship types:

Gender: $\vec{v}(\text{brother}) - \vec{v}(\text{sister}) \approx \vec{v}(\text{king}) - \vec{v}(\text{queen})$
Country-capital: $\vec{v}(\text{Paris}) - \vec{v}(\text{France}) \approx \vec{v}(\text{Tokyo}) - \vec{v}(\text{Japan})$
Tense: $\vec{v}(\text{walking}) - \vec{v}(\text{walked}) \approx \vec{v}(\text{swimming}) - \vec{v}(\text{swam})$

The linear structure isn’t perfect. It works best for common, well-represented relationships. Rare words or complex multi-step analogies often fail.

Disentangled representations

A disentangled representation is one where each dimension (or a small group of dimensions) corresponds to a single, independent factor of variation in the data.

Consider images of faces. The underlying factors might include:

Pose (left/right rotation)
Expression (happy, sad, neutral)
Lighting direction
Hair color
Age

In a perfectly disentangled representation, changing one latent dimension would change only one factor. Sliding the “pose” dimension rotates the face without changing expression. Sliding “expression” changes the smile without moving the head.

In an entangled representation, dimensions mix multiple factors. Changing one dimension might simultaneously rotate the face AND change the lighting. This makes the representation harder to interpret and control.

Why do we want disentanglement? It enables:

Controllable generation: change one attribute without affecting others
Interpretability: each dimension has a clear meaning
Better transfer: disentangled features often transfer better to new tasks
Data efficiency: downstream tasks need fewer examples when features are clean

Measuring disentanglement

Several metrics quantify how disentangled a representation is:

Mutual Information Gap (MIG): For each ground-truth factor $k$ , find the latent dimension with the highest mutual information with that factor. MIG is the gap between the top-1 and top-2 mutual informations, normalized:

\text{MIG} = \frac{1}{K} \sum_{k=1}^{K} \frac{I(z_{j^{(1)}}; v_k) - I(z_{j^{(2)}}; v_k)}{H(v_k)}

where $j^{(1)}$ is the latent dimension most informative about factor $k$ , and $j^{(2)}$ is the second most informative. Higher MIG means each factor is captured by a single dimension (large gap between top-1 and top-2).

Separated Attribute Predictability (SAP): Train a simple classifier to predict each factor from each latent dimension. SAP measures the gap between the most predictive and second most predictive dimension for each factor. Similar idea to MIG but uses predictive accuracy instead of mutual information.

Informally, both metrics ask: “Is each factor clearly captured by one dimension, not smeared across many?” A high score means yes.

Latent space interpolation

Given two points in latent space, $z_1$ and $z_2$ , interpolation generates intermediate points:

Linear interpolation (lerp):

z_t = (1 - t) \cdot z_1 + t \cdot z_2, \quad t \in [0, 1]

This is the simplest approach. For VAEs with Gaussian priors, spherical linear interpolation (slerp) often works better because it follows the surface of the hypersphere where most of the probability mass concentrates:

z_t = \frac{\sin((1-t)\theta)}{\sin\theta} z_1 + \frac{\sin(t\theta)}{\sin\theta} z_2

where $\theta = \arccos\left(\frac{z_1 \cdot z_2}{\|z_1\|\|z_2\|}\right)$ .

Good interpolation is a sign of a well-structured latent space. If intermediate points produce coherent outputs (e.g., smooth transitions between two faces), the model has learned a smooth manifold. If intermediate points produce garbage, the latent space has “holes” where no real data lives.

flowchart LR
  subgraph Latent["2D Latent Space"]
      Z1["z₁
(smiling face)"]
      ZM1["z₀.₃₃"]
      ZM2["z₀.₆₇"]
      Z2["z₂
(serious face)"]
      Z1 --> ZM1 --> ZM2 --> Z2
  end

  subgraph Outputs["Decoded Images"]
      I1["😊 Smiling"]
      I2["🙂 Slight smile"]
      I3["😐 Neutral"]
      I4["😑 Serious"]
  end

  Z1 -.-> I1
  ZM1 -.-> I2
  ZM2 -.-> I3
  Z2 -.-> I4

  style Z1 fill:#9775fa,color:#fff
  style Z2 fill:#ff6b6b,color:#fff
  style ZM1 fill:#b197fc,color:#fff
  style ZM2 fill:#ff8787,color:#fff

Latent arithmetic in GANs

GANs also learn latent spaces with linear structure, even though they’re trained very differently from word embedding models.

In the GAN latent space, you can find directions that correspond to visual attributes. For example, in a face GAN:

z_{\text{smiling}} = z_{\text{neutral face}} + \Delta z_{\text{smile}}

where $\Delta z_{\text{smile}}$ is a “smile direction” discovered by averaging the latent codes of smiling faces minus neutral faces.

StyleGAN makes this especially clean because its $W$ space is more disentangled than the raw $Z$ space. The mapping network $z \to w$ “unfolds” the latent space, making linear directions correspond more cleanly to single attributes.

Common latent arithmetic operations:

Attribute manipulation: add or subtract attribute directions
Style mixing: combine different layers’ latent codes from different images
Truncation: move latent codes toward the mean to increase quality at the cost of diversity

Representation types comparison

Type	Example	Dims needed for 50K concepts	Similarity built-in	Generalizes to unseen
One-hot (local)	Word indices	50,000	No (all equidistant)	No
Bag of words	Document vectors	50,000 (sparse)	Partial (overlap-based)	Somewhat
Distributed (dense)	Word2Vec, GloVe	100-300	Yes (cosine similarity)	Yes
Contextual	BERT, GPT embeddings	768-1024	Yes (context-dependent)	Yes
VAE latent	Image features	32-512	Yes (Euclidean distance)	Yes (interpolation)
GAN latent	StyleGAN $W$ space	512	Yes (linear structure)	Yes (arithmetic)

Example 1: word embedding analogy

Given the following 3-dimensional word vectors:

\vec{v}(\text{king}) = [0.8, 0.3, -0.1]

\vec{v}(\text{man}) = [0.4, 0.2, -0.2]

\vec{v}(\text{woman}) = [0.3, 0.5, 0.1]

\vec{v}(\text{queen}) = [0.7, 0.6, 0.0]

Analogy computation: king - man + woman = ?

\vec{v}(\text{king}) - \vec{v}(\text{man}) + \vec{v}(\text{woman})

= [0.8 - 0.4 + 0.3, \; 0.3 - 0.2 + 0.5, \; -0.1 - (-0.2) + 0.1]

= [0.7, \; 0.6, \; 0.2]

Now compare this result $[0.7, 0.6, 0.2]$ to candidate vectors using cosine similarity:

Cosine with queen $[0.7, 0.6, 0.0]$ :

\text{cos}(\mathbf{r}, \vec{v}(\text{queen})) = \frac{(0.7)(0.7) + (0.6)(0.6) + (0.2)(0.0)}{\sqrt{0.49 + 0.36 + 0.04} \cdot \sqrt{0.49 + 0.36 + 0.0}}

= \frac{0.49 + 0.36 + 0}{\sqrt{0.89} \cdot \sqrt{0.85}} = \frac{0.85}{0.9434 \times 0.9220} = \frac{0.85}{0.8698} = 0.9772

Distractor 1: $\vec{v}(\text{prince}) = [0.6, 0.1, 0.3]$

\text{cos}(\mathbf{r}, \vec{v}(\text{prince})) = \frac{0.42 + 0.06 + 0.06}{\sqrt{0.89} \cdot \sqrt{0.46}} = \frac{0.54}{0.9434 \times 0.6782} = \frac{0.54}{0.6398} = 0.8440

Distractor 2: $\vec{v}(\text{castle}) = [-0.1, 0.8, 0.4]$

\text{cos}(\mathbf{r}, \vec{v}(\text{castle})) = \frac{-0.07 + 0.48 + 0.08}{\sqrt{0.89} \cdot \sqrt{0.81}} = \frac{0.49}{0.9434 \times 0.9} = \frac{0.49}{0.8491} = 0.5771

Results: queen (0.977), prince (0.844), castle (0.577). The analogy correctly identifies “queen” as the nearest neighbor. The difference vector $\vec{v}(\text{king}) - \vec{v}(\text{man}) = [0.4, 0.1, 0.1]$ captures “royalty,” and adding it to “woman” lands closest to “queen.”

Example 2: latent space interpolation

Given two latent codes:

z_1 = [2.0, -1.5, 0.3]

z_2 = [-1.0, 2.0, 0.8]

Linear interpolation at 4 points ( $t = 0, 0.33, 0.67, 1.0$ ):

z_t = (1 - t) \cdot z_1 + t \cdot z_2

$t = 0$ (start):

z_0 = 1.0 \cdot [2.0, -1.5, 0.3] + 0 \cdot [-1.0, 2.0, 0.8] = [2.0, -1.5, 0.3]

$t = 0.33$ :

z_{0.33} = 0.67 \cdot [2.0, -1.5, 0.3] + 0.33 \cdot [-1.0, 2.0, 0.8]

= [1.34, -1.005, 0.201] + [-0.33, 0.66, 0.264]

= [1.01, -0.345, 0.465]

$t = 0.67$ :

z_{0.67} = 0.33 \cdot [2.0, -1.5, 0.3] + 0.67 \cdot [-1.0, 2.0, 0.8]

= [0.66, -0.495, 0.099] + [-0.67, 1.34, 0.536]

= [-0.01, 0.845, 0.635]

$t = 1.0$ (end):

z_1 = 0 \cdot [2.0, -1.5, 0.3] + 1.0 \cdot [-1.0, 2.0, 0.8] = [-1.0, 2.0, 0.8]

Summary of the interpolation path:

$t$	$z_t$	Character
0.00	$[2.0, -1.5, 0.3]$	Source (e.g., face A)
0.33	$[1.01, -0.345, 0.465]$	Blend, mostly A
0.67	$[-0.01, 0.845, 0.635]$	Blend, mostly B
1.00	$[-1.0, 2.0, 0.8]$	Target (e.g., face B)

Notice how the path crosses through the origin area at $t \approx 0.67$ . In a well-structured latent space, this would produce a coherent intermediate output. In a poorly structured space, you might get artifacts here, especially if the path passes through low-density regions.

The norm of each point: $\|z_0\| = 2.53$ , $\|z_{0.33}\| = 1.14$ , $\|z_{0.67}\| = 1.06$ , $\|z_1\| = 2.37$ . The midpoints have smaller norms than the endpoints. This is why spherical interpolation can work better: it keeps the norm more constant.

Example 3: disentanglement

Consider a 2D latent space encoding two factors: shape (3 values: circle, square, triangle) and color (4 values: red, green, blue, yellow). That’s $3 \times 4 = 12$ possible combinations.

Perfectly disentangled case:

Dimension 1 ( $z_1$ ) encodes shape only:

Circle: $z_1 \approx -1$
Square: $z_1 \approx 0$
Triangle: $z_1 \approx 1$

Dimension 2 ( $z_2$ ) encodes color only:

Red: $z_2 \approx -1.5$
Green: $z_2 \approx -0.5$
Blue: $z_2 \approx 0.5$
Yellow: $z_2 \approx 1.5$

A red circle is at $(-1, -1.5)$ . A blue triangle is at $(1, 0.5)$ . Changing $z_1$ changes shape without affecting color. The 12 combinations form a clean grid in 2D.

MIG for this representation: For the shape factor, $z_1$ has very high mutual information (it perfectly predicts shape), while $z_2$ has near-zero mutual information with shape. The gap is large. Similarly for the color factor. MIG would be close to 1.0 (the maximum).

Entangled case:

Both dimensions contribute to both factors:

Red circle: $(-1.2, -0.8)$
Red square: $(-0.5, -1.3)$
Green circle: $(0.3, -0.4)$
Green square: $(0.8, 0.2)$

Now $z_1$ is correlated with both shape and color. To predict shape, you need to look at both dimensions. MIG would be low because the top-1 and top-2 informative dimensions have similar mutual information with each factor. There’s no single dimension that clearly owns a factor.

In practice, achieving perfect disentanglement requires either supervision (telling the model what the factors are) or strong inductive biases. The $\beta$ -VAE encourages disentanglement by increasing the weight on the KL term in the ELBO, which pushes the posterior closer to a factorial prior. This comes at the cost of reconstruction quality, because the model is forced to use a simpler (more independent) latent code.

Why this matters for deep learning

Distributed representations are the reason deep learning works at all. Consider the alternative: if a network used one-hot representations internally, it would need to see every possible combination of features during training. With distributed representations, it can generalize from seen combinations to unseen ones.

A face recognition model doesn’t need to see every person under every lighting condition. If it learns separate features for identity and lighting (distributed and partially disentangled), seeing person A under light 1 and person B under light 2 helps it recognize person A under light 2. This combinatorial generalization is what makes neural networks data-efficient despite having millions of parameters.

The latent spaces of generative models take this further. They don’t just classify; they parameterize the data manifold. This enables generation, interpolation, and manipulation. Every advance in GAN quality, VAE expressiveness, or self-supervised representation learning is fundamentally an advance in learning better distributed representations.

Summary

Local (one-hot) representations are simple but wasteful and capture no relationships. Distributed representations spread meaning across many dimensions, enabling efficiency, similarity, and generalization. Learned latent spaces develop geometric structure: clustering, linear relationships, and smoothness. Disentangled representations cleanly separate factors of variation, enabling controllable generation. Interpolation and arithmetic in latent space demonstrate that these models learn meaningful, structured representations of their data.

What comes next

This article completes the core deep learning series on representations, generative models, and learning strategies. The next article on AutoML and neural architecture search explores how to automate the design of neural network architectures themselves, moving from hand-crafted designs to learned ones.

← Back to all series