Dec 10, 2025 · 22 min read · Deep Learning

Variational Autoencoders

In this series (25 parts)

Prerequisites: Encoder-decoder architectures, Information theory, and Bayes’ theorem.

What is a VAE, intuitively?

A VAE learns to compress data into a small code, then reconstruct it. The code captures meaningful features. Unlike a plain autoencoder that maps each input to a single point, a VAE maps each input to a region in latent space. This makes generation possible: sample any point from that space and decode it into something plausible.

Imagine describing faces with just 5 numbers. Each number controls one feature:

Latent Dimension	Meaning	Example Value	Effect
z1	Face width	0.8	Wide face
z2	Nose size	-0.3	Small nose
z3	Hair darkness	1.2	Dark hair
z4	Smile intensity	0.5	Slight smile
z5	Age factor	-1.0	Young

Change z4 from 0.5 to 2.0 and you get a big grin. Change z5 from -1.0 to 1.5 and the face looks older. The VAE discovers these dimensions automatically from data, without labels telling it what each dimension should mean.

VAE pipeline: encode to latent code, decode back

graph LR
  IMG["Input Image"] --> ENC["Encoder Network"]
  ENC --> MU["Mean vector"]
  ENC --> SIG["Std dev vector"]
  MU --> Z["Latent code z"]
  SIG --> Z
  NOISE["Random noise"] -.-> Z
  Z --> DEC["Decoder Network"]
  DEC --> OUT["Reconstructed Image"]

  style Z fill:#ff9,stroke:#333,color:#000
  style NOISE fill:#f0f0f0,stroke:#999,color:#000

The encoder outputs a mean and a standard deviation. Instead of picking a single point, the VAE samples from this distribution. That sampling step requires a clever trick (reparameterization) to train with gradient descent.

Now let’s formalize the generative model and derive the ELBO objective.

A Variational Autoencoder (VAE) is a generative model that learns the distribution of data $P(x)$ by introducing latent variables $z$ . The encoder maps data to a distribution over $z$ . The decoder maps samples of $z$ back to data. The training objective, the ELBO, balances reconstruction quality against latent space regularity.

The generative model

We assume the data is generated by a two-step process:

Sample a latent variable $z$ from a prior $P(z)$ , typically $\mathcal{N}(0, I)$ .
Sample the data $x$ from a conditional distribution $P_\theta(x \mid z)$ , parameterized by the decoder network.

The marginal likelihood of the data is:

$P_\theta(x) = \int P_\theta(x \mid z) \, P(z) \, dz$

This integral is intractable for neural network decoders because it requires summing over all possible $z$ values. We cannot compute $P_\theta(x)$ directly, and we cannot compute the true posterior $P_\theta(z \mid x)$ either (since it requires $P_\theta(x)$ via Bayes’ theorem).

The encoder as approximate posterior

Since the true posterior $P_\theta(z \mid x)$ is intractable, we introduce an encoder network $q_\phi(z \mid x)$ that approximates it. For each input $x$ , the encoder outputs the parameters of a Gaussian:

$q_\phi(z \mid x) = \mathcal{N}(\mu_\phi(x),\; \text{diag}(\sigma_\phi^2(x)))$

The encoder network takes $x$ and produces two vectors: $\mu$ (the mean) and $\sigma$ (the standard deviation) of the approximate posterior.

The decoder as likelihood

The decoder network takes a latent vector $z$ and produces parameters for $P_\theta(x \mid z)$ . For continuous data (like images), this is often a Gaussian. For binary data, it is a Bernoulli. The decoder outputs either pixel means or logits, and the reconstruction loss follows accordingly.

graph LR
  X["Input x"] --> ENC["Encoder"]
  ENC --> MU["μ"]
  ENC --> SIGMA["σ"]
  MU --> SAMPLE["z = μ + σ ⊙ ε"]
  SIGMA --> SAMPLE
  EPS["ε ~ N(0,I)"] -.-> SAMPLE
  SAMPLE --> DEC["Decoder"]
  DEC --> XHAT["x̂"]
  style SAMPLE fill:#ff9,stroke:#333,color:#000
  style EPS fill:#f0f0f0,stroke:#999,color:#000

Figure 1: VAE architecture. The encoder produces mean μ and standard deviation σ. A sample z is drawn using the reparameterization trick (deterministic path through μ and σ, randomness from ε). The decoder reconstructs x from z.

Deriving the ELBO

We want to maximize $\log P_\theta(x)$ . Start with an identity from KL divergence:

$\log P_\theta(x) = \text{ELBO} + D_{KL}(q_\phi(z \mid x) \| P_\theta(z \mid x))$

Since KL divergence is always non-negative, the ELBO is a lower bound on the log-likelihood:

$\log P_\theta(x) \geq \text{ELBO}$

The ELBO (Evidence Lower BOund) can be written as:

$\text{ELBO} = \mathbb{E}_{q_\phi(z \mid x)}[\log P_\theta(x \mid z)] - D_{KL}(q_\phi(z \mid x) \| P(z))$

This has two terms:

Reconstruction term: $\mathbb{E}_{q_\phi(z|x)}[\log P_\theta(x|z)]$ measures how well the decoder reconstructs $x$ from $z$ .
KL term: $D_{KL}(q_\phi(z|x) \| P(z))$ measures how close the encoder’s approximate posterior is to the prior.

Maximizing the ELBO pushes both: better reconstruction and a more regular latent space.

The KL divergence to a standard Gaussian

When the prior is $P(z) = \mathcal{N}(0, I)$ and the approximate posterior is $q_\phi(z \mid x) = \mathcal{N}(\mu, \text{diag}(\sigma^2))$ , the KL divergence has a closed-form solution:

$D_{KL} = \frac{1}{2} \sum_{j=1}^{d} \left(\mu_j^2 + \sigma_j^2 - \ln \sigma_j^2 - 1\right)$

where $d$ is the dimensionality of $z$ . No sampling needed. This is one of the reasons the Gaussian choice is so popular.

The reparameterization trick

To optimize the ELBO with gradient descent, we need gradients with respect to $\phi$ (encoder parameters). The problem: the reconstruction term involves sampling $z \sim q_\phi(z \mid x)$ , and you cannot backpropagate through a random sampling operation.

The reparameterization trick rewrites the sampling as a deterministic function of the parameters plus external noise:

Reparameterization: moving randomness outside the learnable path

graph LR
  MU["Learnable: mu"] --> Z["z = mu + sigma * epsilon"]
  SIGMA["Learnable: sigma"] --> Z
  EPS["Fixed: epsilon ~ N(0,I)"] -.-> Z
  Z --> DEC["Decoder"]
  DEC -->|"Gradient flows back"| Z
  Z -->|"dz/dmu = 1"| MU
  Z -->|"dz/dsigma = epsilon"| SIGMA

  style EPS fill:#f0f0f0,stroke:#999,color:#000
  style Z fill:#ff9,stroke:#333,color:#000

Randomness enters only through epsilon, which has no learnable parameters. Gradients pass through the deterministic path from decoder back to mu and sigma.

$z = \mu_\phi(x) + \sigma_\phi(x) \odot \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)$

Now $z$ is a deterministic function of $\mu$ , $\sigma$ , and $\epsilon$ . The randomness is in $\epsilon$ , which does not depend on any learnable parameter. Gradients flow through the deterministic path:

$\frac{\partial z}{\partial \mu} = I, \quad \frac{\partial z}{\partial \sigma} = \text{diag}(\epsilon)$

graph TD
  subgraph Without["Without reparameterization"]
      A1["μ, σ"] --> A2["Sample z ~ N(μ,σ²)"]
      A2 --> A3["decoder(z)"]
      A2 -.->|"✗ no gradient"| A1
  end
  subgraph With["With reparameterization"]
      B1["μ, σ"] --> B3["z = μ + σ⊙ε"]
      B2["ε ~ N(0,I)"] -.-> B3
      B3 --> B4["decoder(z)"]
      B4 -->|"✓ gradient flows"| B3
      B3 -->|"✓ ∂z/∂μ = 1"| B1
  end
  style Without fill:#ffe6e6,stroke:#333,color:#000
  style With fill:#e6ffe6,stroke:#333,color:#000

Figure 2: The reparameterization trick. Without it, sampling blocks gradient flow. With it, the sampling is rewritten as a deterministic transformation of learnable parameters plus fixed noise, and gradients flow through normally.

Example 2: Reparameterization in action

Given encoder outputs $\mu = 0.5$ , $\sigma = 0.8$ , and a noise sample $\epsilon = 1.2$ :

$z = \mu + \sigma \cdot \epsilon = 0.5 + 0.8 \times 1.2 = 0.5 + 0.96 = 1.46$

The gradients are:

$\frac{\partial z}{\partial \mu} = 1 \qquad \frac{\partial z}{\partial \sigma} = \epsilon = 1.2$

If the decoder’s loss gives us $\frac{\partial \mathcal{L}}{\partial z} = 0.3$ , then by the chain rule:

$\frac{\partial \mathcal{L}}{\partial \mu} = \frac{\partial \mathcal{L}}{\partial z} \cdot \frac{\partial z}{\partial \mu} = 0.3 \times 1 = 0.3$

$\frac{\partial \mathcal{L}}{\partial \sigma} = \frac{\partial \mathcal{L}}{\partial z} \cdot \frac{\partial z}{\partial \sigma} = 0.3 \times 1.2 = 0.36$

Without reparameterization, these gradients would not exist. The sampling operation $z \sim \mathcal{N}(\mu, \sigma^2)$ has no gradient with respect to $\mu$ or $\sigma$ because sampling is not a differentiable function. The reparameterization trick turns it into one.

VAE loss components

The total loss is:

$\mathcal{L}_{\text{VAE}} = -\text{ELBO} = \underbrace{-\mathbb{E}_{q_\phi}[\log P_\theta(x \mid z)]}_{\text{reconstruction loss}} + \underbrace{D_{KL}(q_\phi(z \mid x) \| P(z))}_{\text{KL loss}}$

Term	Formula	Role	Effect When Too High	Effect When Too Low
Reconstruction	$-\mathbb{E}[\log P_\theta(x \mid z)]$	Measures decoding quality	Blurry outputs (underfitting)	Overfitting to training data
KL divergence	$D_{KL}(q \\| p)$	Regularizes latent space	Posterior collapse (ignores $z$ )	Irregular latent space, poor generation
Total (negative ELBO)	Reconstruction + KL	Balanced training	Model ignores data	Model memorizes data

VAE loss: two forces in balance

graph LR
  LOSS["Total VAE Loss"] --> RECON["Reconstruction Loss
(how well does the
decoder rebuild x?)"]
  LOSS --> KL["KL Divergence
(how far is the encoder
from the prior?)"]
  RECON --> SHARP["Push: make outputs
crisp and accurate"]
  KL --> SMOOTH["Push: keep latent space
smooth and regular"]

  style RECON fill:#4a9eff,color:#fff
  style KL fill:#ff9,stroke:#333,color:#000
  style SHARP fill:#e6f3ff,stroke:#333,color:#000
  style SMOOTH fill:#fff3e6,stroke:#333,color:#000

These two forces compete. If reconstruction dominates, the encoder maps each input to a tight, distant point and generation suffers. If KL dominates, the encoder ignores the input and maps everything to the same region (posterior collapse).

Example 1: Computing the ELBO

Suppose for a given input $x$ :

Encoder outputs $\mu = 0.5$ , $\sigma = 0.8$ (1D latent space for simplicity)
Reconstruction loss (negative log-likelihood): 0.3

Step 1: Compute the KL divergence.

$D_{KL} = \frac{1}{2}(\mu^2 + \sigma^2 - \ln \sigma^2 - 1)$ $= \frac{1}{2}(0.25 + 0.64 - \ln(0.64) - 1)$ $= \frac{1}{2}(0.25 + 0.64 - (-0.446) - 1)$ $= \frac{1}{2}(0.25 + 0.64 + 0.446 - 1)$ $= \frac{1}{2}(0.336) = 0.168$

Step 2: Compute the ELBO.

$\text{ELBO} = \underbrace{(-0.3)}_{\text{reconstruction}} - \underbrace{0.168}_{\text{KL}} = -0.468$

Step 3: The total loss (negative ELBO).

$\mathcal{L}_{\text{VAE}} = -\text{ELBO} = 0.468$

The reconstruction loss (0.3) dominates over the KL loss (0.168). If we lower $\sigma$ toward 0 and push $\mu$ toward 0, we reduce the KL term but might hurt reconstruction (the encoder would be forced to map everything to the same region). The ELBO balances these two forces.

Note: if the encoder output were exactly $\mu = 0, \sigma = 1$ (matching the prior), then $D_{KL} = 0$ . The KL term penalizes any deviation from the standard Gaussian prior.

Posterior collapse

Sometimes the KL term wins too easily. The encoder learns to set $q_\phi(z \mid x) = P(z) = \mathcal{N}(0, I)$ for all inputs, making the KL term zero. The decoder then ignores $z$ entirely and generates output based only on its own internal biases.

This is posterior collapse. The latent variable becomes useless. The model degenerates into a standard autoregressive decoder (if the decoder is powerful enough).

Common remedies:

KL annealing: start training with a small weight on the KL term and gradually increase it to 1.
Free bits: set a minimum KL value per dimension, so the model is forced to use each latent dimension.
Weaker decoders: use simpler decoders so the model must rely on $z$ to encode information.

VAE vs standard autoencoder

A standard autoencoder maps $x \to z \to \hat{x}$ with a deterministic encoder and decoder, minimizing reconstruction loss only. It has no probabilistic interpretation and no regularization of the latent space.

A VAE adds two crucial elements:

The encoder outputs a distribution (not a point), so the latent space is smooth.
The KL term forces the latent space to be organized (close to $\mathcal{N}(0, I)$ ).

These two properties make the VAE latent space meaningful. You can interpolate between points, sample new data, and perform arithmetic in latent space. A standard autoencoder’s latent space has gaps and irregular structure that makes generation unreliable.

Autoencoder vs VAE: point vs distribution

graph TD
  subgraph AE["Standard Autoencoder"]
      AE_IN["Input x"] --> AE_ENC["Encoder"]
      AE_ENC --> AE_Z["Single point z"]
      AE_Z --> AE_DEC["Decoder"]
      AE_DEC --> AE_OUT["Reconstruction"]
  end
  subgraph VAE_BLOCK["Variational Autoencoder"]
      VAE_IN["Input x"] --> VAE_ENC["Encoder"]
      VAE_ENC --> VAE_DIST["Distribution N(mu, sigma)"]
      VAE_DIST --> VAE_SAMPLE["Sample z"]
      VAE_SAMPLE --> VAE_DEC["Decoder"]
      VAE_DEC --> VAE_OUT["Reconstruction"]
  end

  style AE_Z fill:#ff6b6b,color:#fff
  style VAE_DIST fill:#51cf66,color:#fff
  style AE fill:#ffe6e6,stroke:#333,color:#000
  style VAE_BLOCK fill:#e6ffe6,stroke:#333,color:#000

The autoencoder maps to a single point. Gaps between points in latent space decode to garbage. The VAE maps to a distribution, and the KL penalty forces these distributions to overlap. Every region of the latent space decodes to something meaningful.

Latent space properties

2D latent space of a VAE trained on digits 0 to 4. Each cluster corresponds to a different digit class, with smooth interpolation between them.

Because the KL term pushes $q_\phi(z \mid x)$ toward $\mathcal{N}(0, I)$ , the latent space is:

Continuous: nearby points in $z$ -space decode to similar outputs.
Complete: every point in $z$ -space decodes to something plausible.

These properties enable interpolation: given two data points $x_1$ and $x_2$ , encode them to $z_1$ and $z_2$ , interpolate in latent space, and decode. The intermediate points should smoothly transition between the two inputs.

Example 3: Latent space interpolation

Given two latent codes:

$z_1 = [1.2, -0.5], \quad z_2 = [-0.8, 1.1]$

Linear interpolation at 5 equally spaced points ( $\alpha = 0, 0.25, 0.5, 0.75, 1.0$ ):

$z(\alpha) = (1 - \alpha) \cdot z_1 + \alpha \cdot z_2$

$\alpha$	$z(\alpha)$
0.00	$[1.200, -0.500]$
0.25	$[0.700, -0.100]$
0.50	$[0.200, \;\;\;0.300]$
0.75	$[-0.300, \;\;\;0.700]$
1.00	$[-0.800, \;\;\;1.100]$

If $z_1$ encodes a smiling face and $z_2$ encodes a frowning face, the intermediate points should show gradually changing expressions. The smoothness of this transition is a direct consequence of the KL regularization.

In practice, spherical interpolation (slerp) often works better than linear interpolation because high-dimensional Gaussians concentrate their mass on a thin shell, not near the origin. Linear interpolation passes through the low-density center.

Training in practice

A typical VAE training loop:

Sample a mini-batch of data $\{x_i\}$ .
For each $x_i$ , encode to get $\mu_i, \sigma_i$ .
Sample $\epsilon_i \sim \mathcal{N}(0, I)$ and compute $z_i = \mu_i + \sigma_i \odot \epsilon_i$ .
Decode $z_i$ to get $\hat{x}_i$ .
Compute the reconstruction loss and KL loss.
Backpropagate through the entire network (reparameterization makes this possible).
Update encoder and decoder parameters with Adam or another optimizer.

The reconstruction loss depends on the data type. For binary data (like binarized MNIST), use binary cross-entropy. For continuous data, use MSE or a Gaussian log-likelihood.

Common extensions

$\beta$ -VAE: scales the KL term by a factor $\beta > 1$ to encourage disentanglement in the latent space. Each latent dimension should capture an independent factor of variation.

VQ-VAE: replaces the continuous latent space with a discrete codebook. Instead of sampling from a Gaussian, the encoder picks the nearest codebook vector. This avoids posterior collapse and produces sharper outputs.

Conditional VAE: conditions both encoder and decoder on additional information (class label, text, etc.). Generates data conditioned on a specific attribute.

Hierarchical VAE: uses multiple layers of latent variables, with each layer capturing structure at a different scale. This is the idea behind models like NVAE and VDVAE that produce high-quality images.

Why VAE outputs tend to be blurry

VAEs trained with pixel-wise MSE or Gaussian log-likelihood produce blurry images. This is not a bug; it is a direct consequence of the objective. When the decoder must explain the data with a single Gaussian per pixel, it averages over all plausible values.

Consider a dataset where a pixel is sometimes black and sometimes white. The optimal Gaussian mean is gray, the average. The model assigns maximum likelihood to the mean, producing a blurry compromise rather than a crisp sample.

Possible fixes include using more expressive decoders (autoregressive decoders, PixelCNN), using perceptual loss instead of pixel-wise loss, or switching to a discrete latent space (VQ-VAE). Each approach trades some of the VAE’s simplicity for sharper outputs.

Evaluating VAEs

ELBO as a metric. The ELBO itself serves as a training metric, but it is a lower bound, not the true log-likelihood. Two models with the same ELBO might have very different true likelihoods if one has a tighter bound.

Importance-weighted estimate. To get a tighter bound, use importance weighting with $K$ samples:

$\log P(x) \geq \mathbb{E}\!\left[\log \frac{1}{K} \sum_{k=1}^{K} \frac{P_\theta(x \mid z_k) P(z_k)}{q_\phi(z_k \mid x)}\right]$

This is called the IWAE bound. As $K$ increases, it approaches the true log-likelihood.

Reconstruction quality. Encode test images, decode them, and visually inspect. Good VAEs preserve identity and structure, even if details are smoothed.

Latent space structure. Encode the test set, color-code by class, and plot in 2D (using PCA or t-SNE if the latent dimension is greater than 2). A well-trained VAE clusters similar data together while keeping the overall distribution close to the prior.

What comes next

VAEs give us a principled way to learn latent representations and generate data, but their outputs tend to be blurry because the reconstruction loss (often MSE) averages over possible outputs. The next step is to look at models that can produce sharper results. GANs take a completely different approach: instead of maximizing likelihood, they train a generator to fool a discriminator, and the adversarial game produces remarkably sharp samples.

← Back to all series