Search…

Variational Autoencoders

In this series (25 parts)
  1. Neural networks: the basic building block
  2. Forward pass and backpropagation
  3. Training neural networks: a practical guide
  4. Convolutional neural networks
  5. Recurrent neural networks and LSTMs
  6. Attention mechanism and transformers
  7. Word embeddings: from one-hot to dense representations
  8. Transfer learning and fine-tuning
  9. Optimization techniques for deep networks
  10. Regularization for deep networks
  11. Encoder-decoder architectures
  12. Generative models: an overview
  13. Restricted Boltzmann Machines
  14. Deep Belief Networks
  15. Variational Autoencoders
  16. Generative Adversarial Networks: training and theory
  17. DCGAN, conditional GANs, and GAN variants
  18. Representation learning and self-supervised learning
  19. Domain adaptation and fine-tuning strategies
  20. Distributed representations and latent spaces
  21. AutoML and hyperparameter optimization
  22. Neural architecture search
  23. Network compression and efficient inference
  24. Graph neural networks
  25. Practical deep learning: debugging and tuning

Prerequisites: Encoder-decoder architectures, Information theory, and Bayes’ theorem.

What is a VAE, intuitively?

A VAE learns to compress data into a small code, then reconstruct it. The code captures meaningful features. Unlike a plain autoencoder that maps each input to a single point, a VAE maps each input to a region in latent space. This makes generation possible: sample any point from that space and decode it into something plausible.

Imagine describing faces with just 5 numbers. Each number controls one feature:

Latent DimensionMeaningExample ValueEffect
z1Face width0.8Wide face
z2Nose size-0.3Small nose
z3Hair darkness1.2Dark hair
z4Smile intensity0.5Slight smile
z5Age factor-1.0Young

Change z4 from 0.5 to 2.0 and you get a big grin. Change z5 from -1.0 to 1.5 and the face looks older. The VAE discovers these dimensions automatically from data, without labels telling it what each dimension should mean.

VAE pipeline: encode to latent code, decode back

graph LR
  IMG["Input Image"] --> ENC["Encoder Network"]
  ENC --> MU["Mean vector"]
  ENC --> SIG["Std dev vector"]
  MU --> Z["Latent code z"]
  SIG --> Z
  NOISE["Random noise"] -.-> Z
  Z --> DEC["Decoder Network"]
  DEC --> OUT["Reconstructed Image"]

  style Z fill:#ff9,stroke:#333,color:#000
  style NOISE fill:#f0f0f0,stroke:#999,color:#000

The encoder outputs a mean and a standard deviation. Instead of picking a single point, the VAE samples from this distribution. That sampling step requires a clever trick (reparameterization) to train with gradient descent.

Now let’s formalize the generative model and derive the ELBO objective.

A Variational Autoencoder (VAE) is a generative model that learns the distribution of data P(x)P(x) by introducing latent variables zz. The encoder maps data to a distribution over zz. The decoder maps samples of zz back to data. The training objective, the ELBO, balances reconstruction quality against latent space regularity.

The generative model

We assume the data is generated by a two-step process:

  1. Sample a latent variable zz from a prior P(z)P(z), typically N(0,I)\mathcal{N}(0, I).
  2. Sample the data xx from a conditional distribution Pθ(xz)P_\theta(x \mid z), parameterized by the decoder network.

The marginal likelihood of the data is:

Pθ(x)=Pθ(xz)P(z)dzP_\theta(x) = \int P_\theta(x \mid z) \, P(z) \, dz

This integral is intractable for neural network decoders because it requires summing over all possible zz values. We cannot compute Pθ(x)P_\theta(x) directly, and we cannot compute the true posterior Pθ(zx)P_\theta(z \mid x) either (since it requires Pθ(x)P_\theta(x) via Bayes’ theorem).

The encoder as approximate posterior

Since the true posterior Pθ(zx)P_\theta(z \mid x) is intractable, we introduce an encoder network qϕ(zx)q_\phi(z \mid x) that approximates it. For each input xx, the encoder outputs the parameters of a Gaussian:

qϕ(zx)=N(μϕ(x),  diag(σϕ2(x)))q_\phi(z \mid x) = \mathcal{N}(\mu_\phi(x),\; \text{diag}(\sigma_\phi^2(x)))

The encoder network takes xx and produces two vectors: μ\mu (the mean) and σ\sigma (the standard deviation) of the approximate posterior.

The decoder as likelihood

The decoder network takes a latent vector zz and produces parameters for Pθ(xz)P_\theta(x \mid z). For continuous data (like images), this is often a Gaussian. For binary data, it is a Bernoulli. The decoder outputs either pixel means or logits, and the reconstruction loss follows accordingly.

graph LR
  X["Input x"] --> ENC["Encoder"]
  ENC --> MU["μ"]
  ENC --> SIGMA["σ"]
  MU --> SAMPLE["z = μ + σ ⊙ ε"]
  SIGMA --> SAMPLE
  EPS["ε ~ N(0,I)"] -.-> SAMPLE
  SAMPLE --> DEC["Decoder"]
  DEC --> XHAT["x̂"]
  style SAMPLE fill:#ff9,stroke:#333,color:#000
  style EPS fill:#f0f0f0,stroke:#999,color:#000

Figure 1: VAE architecture. The encoder produces mean μ and standard deviation σ. A sample z is drawn using the reparameterization trick (deterministic path through μ and σ, randomness from ε). The decoder reconstructs x from z.

Deriving the ELBO

We want to maximize logPθ(x)\log P_\theta(x). Start with an identity from KL divergence:

logPθ(x)=ELBO+DKL(qϕ(zx)Pθ(zx))\log P_\theta(x) = \text{ELBO} + D_{KL}(q_\phi(z \mid x) \| P_\theta(z \mid x))

Since KL divergence is always non-negative, the ELBO is a lower bound on the log-likelihood:

logPθ(x)ELBO\log P_\theta(x) \geq \text{ELBO}

The ELBO (Evidence Lower BOund) can be written as:

ELBO=Eqϕ(zx)[logPθ(xz)]DKL(qϕ(zx)P(z))\text{ELBO} = \mathbb{E}_{q_\phi(z \mid x)}[\log P_\theta(x \mid z)] - D_{KL}(q_\phi(z \mid x) \| P(z))

This has two terms:

  1. Reconstruction term: Eqϕ(zx)[logPθ(xz)]\mathbb{E}_{q_\phi(z|x)}[\log P_\theta(x|z)] measures how well the decoder reconstructs xx from zz.
  2. KL term: DKL(qϕ(zx)P(z))D_{KL}(q_\phi(z|x) \| P(z)) measures how close the encoder’s approximate posterior is to the prior.

Maximizing the ELBO pushes both: better reconstruction and a more regular latent space.

The KL divergence to a standard Gaussian

When the prior is P(z)=N(0,I)P(z) = \mathcal{N}(0, I) and the approximate posterior is qϕ(zx)=N(μ,diag(σ2))q_\phi(z \mid x) = \mathcal{N}(\mu, \text{diag}(\sigma^2)), the KL divergence has a closed-form solution:

DKL=12j=1d(μj2+σj2lnσj21)D_{KL} = \frac{1}{2} \sum_{j=1}^{d} \left(\mu_j^2 + \sigma_j^2 - \ln \sigma_j^2 - 1\right)

where dd is the dimensionality of zz. No sampling needed. This is one of the reasons the Gaussian choice is so popular.

The reparameterization trick

To optimize the ELBO with gradient descent, we need gradients with respect to ϕ\phi (encoder parameters). The problem: the reconstruction term involves sampling zqϕ(zx)z \sim q_\phi(z \mid x), and you cannot backpropagate through a random sampling operation.

The reparameterization trick rewrites the sampling as a deterministic function of the parameters plus external noise:

Reparameterization: moving randomness outside the learnable path

graph LR
  MU["Learnable: mu"] --> Z["z = mu + sigma * epsilon"]
  SIGMA["Learnable: sigma"] --> Z
  EPS["Fixed: epsilon ~ N(0,I)"] -.-> Z
  Z --> DEC["Decoder"]
  DEC -->|"Gradient flows back"| Z
  Z -->|"dz/dmu = 1"| MU
  Z -->|"dz/dsigma = epsilon"| SIGMA

  style EPS fill:#f0f0f0,stroke:#999,color:#000
  style Z fill:#ff9,stroke:#333,color:#000

Randomness enters only through epsilon, which has no learnable parameters. Gradients pass through the deterministic path from decoder back to mu and sigma.

z=μϕ(x)+σϕ(x)ϵ,ϵN(0,I)z = \mu_\phi(x) + \sigma_\phi(x) \odot \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)

Now zz is a deterministic function of μ\mu, σ\sigma, and ϵ\epsilon. The randomness is in ϵ\epsilon, which does not depend on any learnable parameter. Gradients flow through the deterministic path:

zμ=I,zσ=diag(ϵ)\frac{\partial z}{\partial \mu} = I, \quad \frac{\partial z}{\partial \sigma} = \text{diag}(\epsilon)

graph TD
  subgraph Without["Without reparameterization"]
      A1["μ, σ"] --> A2["Sample z ~ N(μ,σ²)"]
      A2 --> A3["decoder(z)"]
      A2 -.->|"✗ no gradient"| A1
  end
  subgraph With["With reparameterization"]
      B1["μ, σ"] --> B3["z = μ + σ⊙ε"]
      B2["ε ~ N(0,I)"] -.-> B3
      B3 --> B4["decoder(z)"]
      B4 -->|"✓ gradient flows"| B3
      B3 -->|"✓ ∂z/∂μ = 1"| B1
  end
  style Without fill:#ffe6e6,stroke:#333,color:#000
  style With fill:#e6ffe6,stroke:#333,color:#000

Figure 2: The reparameterization trick. Without it, sampling blocks gradient flow. With it, the sampling is rewritten as a deterministic transformation of learnable parameters plus fixed noise, and gradients flow through normally.

Example 2: Reparameterization in action

Given encoder outputs μ=0.5\mu = 0.5, σ=0.8\sigma = 0.8, and a noise sample ϵ=1.2\epsilon = 1.2:

z=μ+σϵ=0.5+0.8×1.2=0.5+0.96=1.46z = \mu + \sigma \cdot \epsilon = 0.5 + 0.8 \times 1.2 = 0.5 + 0.96 = 1.46

The gradients are:

zμ=1zσ=ϵ=1.2\frac{\partial z}{\partial \mu} = 1 \qquad \frac{\partial z}{\partial \sigma} = \epsilon = 1.2

If the decoder’s loss gives us Lz=0.3\frac{\partial \mathcal{L}}{\partial z} = 0.3, then by the chain rule:

Lμ=Lzzμ=0.3×1=0.3\frac{\partial \mathcal{L}}{\partial \mu} = \frac{\partial \mathcal{L}}{\partial z} \cdot \frac{\partial z}{\partial \mu} = 0.3 \times 1 = 0.3

Lσ=Lzzσ=0.3×1.2=0.36\frac{\partial \mathcal{L}}{\partial \sigma} = \frac{\partial \mathcal{L}}{\partial z} \cdot \frac{\partial z}{\partial \sigma} = 0.3 \times 1.2 = 0.36

Without reparameterization, these gradients would not exist. The sampling operation zN(μ,σ2)z \sim \mathcal{N}(\mu, \sigma^2) has no gradient with respect to μ\mu or σ\sigma because sampling is not a differentiable function. The reparameterization trick turns it into one.

VAE loss components

The total loss is:

LVAE=ELBO=Eqϕ[logPθ(xz)]reconstruction loss+DKL(qϕ(zx)P(z))KL loss\mathcal{L}_{\text{VAE}} = -\text{ELBO} = \underbrace{-\mathbb{E}_{q_\phi}[\log P_\theta(x \mid z)]}_{\text{reconstruction loss}} + \underbrace{D_{KL}(q_\phi(z \mid x) \| P(z))}_{\text{KL loss}}

TermFormulaRoleEffect When Too HighEffect When Too Low
ReconstructionE[logPθ(xz)]-\mathbb{E}[\log P_\theta(x \mid z)]Measures decoding qualityBlurry outputs (underfitting)Overfitting to training data
KL divergenceDKL(qp)D_{KL}(q \| p)Regularizes latent spacePosterior collapse (ignores zz)Irregular latent space, poor generation
Total (negative ELBO)Reconstruction + KLBalanced trainingModel ignores dataModel memorizes data

VAE loss: two forces in balance

graph LR
  LOSS["Total VAE Loss"] --> RECON["Reconstruction Loss
(how well does the
decoder rebuild x?)"]
  LOSS --> KL["KL Divergence
(how far is the encoder
from the prior?)"]
  RECON --> SHARP["Push: make outputs
crisp and accurate"]
  KL --> SMOOTH["Push: keep latent space
smooth and regular"]

  style RECON fill:#4a9eff,color:#fff
  style KL fill:#ff9,stroke:#333,color:#000
  style SHARP fill:#e6f3ff,stroke:#333,color:#000
  style SMOOTH fill:#fff3e6,stroke:#333,color:#000

These two forces compete. If reconstruction dominates, the encoder maps each input to a tight, distant point and generation suffers. If KL dominates, the encoder ignores the input and maps everything to the same region (posterior collapse).

Example 1: Computing the ELBO

Suppose for a given input xx:

  • Encoder outputs μ=0.5\mu = 0.5, σ=0.8\sigma = 0.8 (1D latent space for simplicity)
  • Reconstruction loss (negative log-likelihood): 0.3

Step 1: Compute the KL divergence.

DKL=12(μ2+σ2lnσ21)D_{KL} = \frac{1}{2}(\mu^2 + \sigma^2 - \ln \sigma^2 - 1) =12(0.25+0.64ln(0.64)1)= \frac{1}{2}(0.25 + 0.64 - \ln(0.64) - 1) =12(0.25+0.64(0.446)1)= \frac{1}{2}(0.25 + 0.64 - (-0.446) - 1) =12(0.25+0.64+0.4461)= \frac{1}{2}(0.25 + 0.64 + 0.446 - 1) =12(0.336)=0.168= \frac{1}{2}(0.336) = 0.168

Step 2: Compute the ELBO.

ELBO=(0.3)reconstruction0.168KL=0.468\text{ELBO} = \underbrace{(-0.3)}_{\text{reconstruction}} - \underbrace{0.168}_{\text{KL}} = -0.468

Step 3: The total loss (negative ELBO).

LVAE=ELBO=0.468\mathcal{L}_{\text{VAE}} = -\text{ELBO} = 0.468

The reconstruction loss (0.3) dominates over the KL loss (0.168). If we lower σ\sigma toward 0 and push μ\mu toward 0, we reduce the KL term but might hurt reconstruction (the encoder would be forced to map everything to the same region). The ELBO balances these two forces.

Note: if the encoder output were exactly μ=0,σ=1\mu = 0, \sigma = 1 (matching the prior), then DKL=0D_{KL} = 0. The KL term penalizes any deviation from the standard Gaussian prior.

Posterior collapse

Sometimes the KL term wins too easily. The encoder learns to set qϕ(zx)=P(z)=N(0,I)q_\phi(z \mid x) = P(z) = \mathcal{N}(0, I) for all inputs, making the KL term zero. The decoder then ignores zz entirely and generates output based only on its own internal biases.

This is posterior collapse. The latent variable becomes useless. The model degenerates into a standard autoregressive decoder (if the decoder is powerful enough).

Common remedies:

  • KL annealing: start training with a small weight on the KL term and gradually increase it to 1.
  • Free bits: set a minimum KL value per dimension, so the model is forced to use each latent dimension.
  • Weaker decoders: use simpler decoders so the model must rely on zz to encode information.

VAE vs standard autoencoder

A standard autoencoder maps xzx^x \to z \to \hat{x} with a deterministic encoder and decoder, minimizing reconstruction loss only. It has no probabilistic interpretation and no regularization of the latent space.

A VAE adds two crucial elements:

  1. The encoder outputs a distribution (not a point), so the latent space is smooth.
  2. The KL term forces the latent space to be organized (close to N(0,I)\mathcal{N}(0, I)).

These two properties make the VAE latent space meaningful. You can interpolate between points, sample new data, and perform arithmetic in latent space. A standard autoencoder’s latent space has gaps and irregular structure that makes generation unreliable.

Autoencoder vs VAE: point vs distribution

graph TD
  subgraph AE["Standard Autoencoder"]
      AE_IN["Input x"] --> AE_ENC["Encoder"]
      AE_ENC --> AE_Z["Single point z"]
      AE_Z --> AE_DEC["Decoder"]
      AE_DEC --> AE_OUT["Reconstruction"]
  end
  subgraph VAE_BLOCK["Variational Autoencoder"]
      VAE_IN["Input x"] --> VAE_ENC["Encoder"]
      VAE_ENC --> VAE_DIST["Distribution N(mu, sigma)"]
      VAE_DIST --> VAE_SAMPLE["Sample z"]
      VAE_SAMPLE --> VAE_DEC["Decoder"]
      VAE_DEC --> VAE_OUT["Reconstruction"]
  end

  style AE_Z fill:#ff6b6b,color:#fff
  style VAE_DIST fill:#51cf66,color:#fff
  style AE fill:#ffe6e6,stroke:#333,color:#000
  style VAE_BLOCK fill:#e6ffe6,stroke:#333,color:#000

The autoencoder maps to a single point. Gaps between points in latent space decode to garbage. The VAE maps to a distribution, and the KL penalty forces these distributions to overlap. Every region of the latent space decodes to something meaningful.

Latent space properties

2D latent space of a VAE trained on digits 0 to 4. Each cluster corresponds to a different digit class, with smooth interpolation between them.

Because the KL term pushes qϕ(zx)q_\phi(z \mid x) toward N(0,I)\mathcal{N}(0, I), the latent space is:

  • Continuous: nearby points in zz-space decode to similar outputs.
  • Complete: every point in zz-space decodes to something plausible.

These properties enable interpolation: given two data points x1x_1 and x2x_2, encode them to z1z_1 and z2z_2, interpolate in latent space, and decode. The intermediate points should smoothly transition between the two inputs.

Example 3: Latent space interpolation

Given two latent codes:

z1=[1.2,0.5],z2=[0.8,1.1]z_1 = [1.2, -0.5], \quad z_2 = [-0.8, 1.1]

Linear interpolation at 5 equally spaced points (α=0,0.25,0.5,0.75,1.0\alpha = 0, 0.25, 0.5, 0.75, 1.0):

z(α)=(1α)z1+αz2z(\alpha) = (1 - \alpha) \cdot z_1 + \alpha \cdot z_2

α\alphaz(α)z(\alpha)
0.00[1.200,0.500][1.200, -0.500]
0.25[0.700,0.100][0.700, -0.100]
0.50[0.200,      0.300][0.200, \;\;\;0.300]
0.75[0.300,      0.700][-0.300, \;\;\;0.700]
1.00[0.800,      1.100][-0.800, \;\;\;1.100]

If z1z_1 encodes a smiling face and z2z_2 encodes a frowning face, the intermediate points should show gradually changing expressions. The smoothness of this transition is a direct consequence of the KL regularization.

In practice, spherical interpolation (slerp) often works better than linear interpolation because high-dimensional Gaussians concentrate their mass on a thin shell, not near the origin. Linear interpolation passes through the low-density center.

Training in practice

A typical VAE training loop:

  1. Sample a mini-batch of data {xi}\{x_i\}.
  2. For each xix_i, encode to get μi,σi\mu_i, \sigma_i.
  3. Sample ϵiN(0,I)\epsilon_i \sim \mathcal{N}(0, I) and compute zi=μi+σiϵiz_i = \mu_i + \sigma_i \odot \epsilon_i.
  4. Decode ziz_i to get x^i\hat{x}_i.
  5. Compute the reconstruction loss and KL loss.
  6. Backpropagate through the entire network (reparameterization makes this possible).
  7. Update encoder and decoder parameters with Adam or another optimizer.

The reconstruction loss depends on the data type. For binary data (like binarized MNIST), use binary cross-entropy. For continuous data, use MSE or a Gaussian log-likelihood.

Common extensions

β\beta-VAE: scales the KL term by a factor β>1\beta > 1 to encourage disentanglement in the latent space. Each latent dimension should capture an independent factor of variation.

VQ-VAE: replaces the continuous latent space with a discrete codebook. Instead of sampling from a Gaussian, the encoder picks the nearest codebook vector. This avoids posterior collapse and produces sharper outputs.

Conditional VAE: conditions both encoder and decoder on additional information (class label, text, etc.). Generates data conditioned on a specific attribute.

Hierarchical VAE: uses multiple layers of latent variables, with each layer capturing structure at a different scale. This is the idea behind models like NVAE and VDVAE that produce high-quality images.

Why VAE outputs tend to be blurry

VAEs trained with pixel-wise MSE or Gaussian log-likelihood produce blurry images. This is not a bug; it is a direct consequence of the objective. When the decoder must explain the data with a single Gaussian per pixel, it averages over all plausible values.

Consider a dataset where a pixel is sometimes black and sometimes white. The optimal Gaussian mean is gray, the average. The model assigns maximum likelihood to the mean, producing a blurry compromise rather than a crisp sample.

Possible fixes include using more expressive decoders (autoregressive decoders, PixelCNN), using perceptual loss instead of pixel-wise loss, or switching to a discrete latent space (VQ-VAE). Each approach trades some of the VAE’s simplicity for sharper outputs.

Evaluating VAEs

ELBO as a metric. The ELBO itself serves as a training metric, but it is a lower bound, not the true log-likelihood. Two models with the same ELBO might have very different true likelihoods if one has a tighter bound.

Importance-weighted estimate. To get a tighter bound, use importance weighting with KK samples:

logP(x)E ⁣[log1Kk=1KPθ(xzk)P(zk)qϕ(zkx)]\log P(x) \geq \mathbb{E}\!\left[\log \frac{1}{K} \sum_{k=1}^{K} \frac{P_\theta(x \mid z_k) P(z_k)}{q_\phi(z_k \mid x)}\right]

This is called the IWAE bound. As KK increases, it approaches the true log-likelihood.

Reconstruction quality. Encode test images, decode them, and visually inspect. Good VAEs preserve identity and structure, even if details are smoothed.

Latent space structure. Encode the test set, color-code by class, and plot in 2D (using PCA or t-SNE if the latent dimension is greater than 2). A well-trained VAE clusters similar data together while keeping the overall distribution close to the prior.

What comes next

VAEs give us a principled way to learn latent representations and generate data, but their outputs tend to be blurry because the reconstruction loss (often MSE) averages over possible outputs. The next step is to look at models that can produce sharper results. GANs take a completely different approach: instead of maximizing likelihood, they train a generator to fool a discriminator, and the adversarial game produces remarkably sharp samples.

Start typing to search across all content
navigate Enter open Esc close