Variational Autoencoders
In this series (25 parts)
- Neural networks: the basic building block
- Forward pass and backpropagation
- Training neural networks: a practical guide
- Convolutional neural networks
- Recurrent neural networks and LSTMs
- Attention mechanism and transformers
- Word embeddings: from one-hot to dense representations
- Transfer learning and fine-tuning
- Optimization techniques for deep networks
- Regularization for deep networks
- Encoder-decoder architectures
- Generative models: an overview
- Restricted Boltzmann Machines
- Deep Belief Networks
- Variational Autoencoders
- Generative Adversarial Networks: training and theory
- DCGAN, conditional GANs, and GAN variants
- Representation learning and self-supervised learning
- Domain adaptation and fine-tuning strategies
- Distributed representations and latent spaces
- AutoML and hyperparameter optimization
- Neural architecture search
- Network compression and efficient inference
- Graph neural networks
- Practical deep learning: debugging and tuning
Prerequisites: Encoder-decoder architectures, Information theory, and Bayes’ theorem.
What is a VAE, intuitively?
A VAE learns to compress data into a small code, then reconstruct it. The code captures meaningful features. Unlike a plain autoencoder that maps each input to a single point, a VAE maps each input to a region in latent space. This makes generation possible: sample any point from that space and decode it into something plausible.
Imagine describing faces with just 5 numbers. Each number controls one feature:
| Latent Dimension | Meaning | Example Value | Effect |
|---|---|---|---|
| z1 | Face width | 0.8 | Wide face |
| z2 | Nose size | -0.3 | Small nose |
| z3 | Hair darkness | 1.2 | Dark hair |
| z4 | Smile intensity | 0.5 | Slight smile |
| z5 | Age factor | -1.0 | Young |
Change z4 from 0.5 to 2.0 and you get a big grin. Change z5 from -1.0 to 1.5 and the face looks older. The VAE discovers these dimensions automatically from data, without labels telling it what each dimension should mean.
VAE pipeline: encode to latent code, decode back
graph LR IMG["Input Image"] --> ENC["Encoder Network"] ENC --> MU["Mean vector"] ENC --> SIG["Std dev vector"] MU --> Z["Latent code z"] SIG --> Z NOISE["Random noise"] -.-> Z Z --> DEC["Decoder Network"] DEC --> OUT["Reconstructed Image"] style Z fill:#ff9,stroke:#333,color:#000 style NOISE fill:#f0f0f0,stroke:#999,color:#000
The encoder outputs a mean and a standard deviation. Instead of picking a single point, the VAE samples from this distribution. That sampling step requires a clever trick (reparameterization) to train with gradient descent.
Now let’s formalize the generative model and derive the ELBO objective.
A Variational Autoencoder (VAE) is a generative model that learns the distribution of data by introducing latent variables . The encoder maps data to a distribution over . The decoder maps samples of back to data. The training objective, the ELBO, balances reconstruction quality against latent space regularity.
The generative model
We assume the data is generated by a two-step process:
- Sample a latent variable from a prior , typically .
- Sample the data from a conditional distribution , parameterized by the decoder network.
The marginal likelihood of the data is:
This integral is intractable for neural network decoders because it requires summing over all possible values. We cannot compute directly, and we cannot compute the true posterior either (since it requires via Bayes’ theorem).
The encoder as approximate posterior
Since the true posterior is intractable, we introduce an encoder network that approximates it. For each input , the encoder outputs the parameters of a Gaussian:
The encoder network takes and produces two vectors: (the mean) and (the standard deviation) of the approximate posterior.
The decoder as likelihood
The decoder network takes a latent vector and produces parameters for . For continuous data (like images), this is often a Gaussian. For binary data, it is a Bernoulli. The decoder outputs either pixel means or logits, and the reconstruction loss follows accordingly.
graph LR X["Input x"] --> ENC["Encoder"] ENC --> MU["μ"] ENC --> SIGMA["σ"] MU --> SAMPLE["z = μ + σ ⊙ ε"] SIGMA --> SAMPLE EPS["ε ~ N(0,I)"] -.-> SAMPLE SAMPLE --> DEC["Decoder"] DEC --> XHAT["x̂"] style SAMPLE fill:#ff9,stroke:#333,color:#000 style EPS fill:#f0f0f0,stroke:#999,color:#000
Figure 1: VAE architecture. The encoder produces mean μ and standard deviation σ. A sample z is drawn using the reparameterization trick (deterministic path through μ and σ, randomness from ε). The decoder reconstructs x from z.
Deriving the ELBO
We want to maximize . Start with an identity from KL divergence:
Since KL divergence is always non-negative, the ELBO is a lower bound on the log-likelihood:
The ELBO (Evidence Lower BOund) can be written as:
This has two terms:
- Reconstruction term: measures how well the decoder reconstructs from .
- KL term: measures how close the encoder’s approximate posterior is to the prior.
Maximizing the ELBO pushes both: better reconstruction and a more regular latent space.
The KL divergence to a standard Gaussian
When the prior is and the approximate posterior is , the KL divergence has a closed-form solution:
where is the dimensionality of . No sampling needed. This is one of the reasons the Gaussian choice is so popular.
The reparameterization trick
To optimize the ELBO with gradient descent, we need gradients with respect to (encoder parameters). The problem: the reconstruction term involves sampling , and you cannot backpropagate through a random sampling operation.
The reparameterization trick rewrites the sampling as a deterministic function of the parameters plus external noise:
Reparameterization: moving randomness outside the learnable path
graph LR MU["Learnable: mu"] --> Z["z = mu + sigma * epsilon"] SIGMA["Learnable: sigma"] --> Z EPS["Fixed: epsilon ~ N(0,I)"] -.-> Z Z --> DEC["Decoder"] DEC -->|"Gradient flows back"| Z Z -->|"dz/dmu = 1"| MU Z -->|"dz/dsigma = epsilon"| SIGMA style EPS fill:#f0f0f0,stroke:#999,color:#000 style Z fill:#ff9,stroke:#333,color:#000
Randomness enters only through epsilon, which has no learnable parameters. Gradients pass through the deterministic path from decoder back to mu and sigma.
Now is a deterministic function of , , and . The randomness is in , which does not depend on any learnable parameter. Gradients flow through the deterministic path:
graph TD
subgraph Without["Without reparameterization"]
A1["μ, σ"] --> A2["Sample z ~ N(μ,σ²)"]
A2 --> A3["decoder(z)"]
A2 -.->|"✗ no gradient"| A1
end
subgraph With["With reparameterization"]
B1["μ, σ"] --> B3["z = μ + σ⊙ε"]
B2["ε ~ N(0,I)"] -.-> B3
B3 --> B4["decoder(z)"]
B4 -->|"✓ gradient flows"| B3
B3 -->|"✓ ∂z/∂μ = 1"| B1
end
style Without fill:#ffe6e6,stroke:#333,color:#000
style With fill:#e6ffe6,stroke:#333,color:#000
Figure 2: The reparameterization trick. Without it, sampling blocks gradient flow. With it, the sampling is rewritten as a deterministic transformation of learnable parameters plus fixed noise, and gradients flow through normally.
Example 2: Reparameterization in action
Given encoder outputs , , and a noise sample :
The gradients are:
If the decoder’s loss gives us , then by the chain rule:
Without reparameterization, these gradients would not exist. The sampling operation has no gradient with respect to or because sampling is not a differentiable function. The reparameterization trick turns it into one.
VAE loss components
The total loss is:
| Term | Formula | Role | Effect When Too High | Effect When Too Low |
|---|---|---|---|---|
| Reconstruction | Measures decoding quality | Blurry outputs (underfitting) | Overfitting to training data | |
| KL divergence | Regularizes latent space | Posterior collapse (ignores ) | Irregular latent space, poor generation | |
| Total (negative ELBO) | Reconstruction + KL | Balanced training | Model ignores data | Model memorizes data |
VAE loss: two forces in balance
graph LR LOSS["Total VAE Loss"] --> RECON["Reconstruction Loss (how well does the decoder rebuild x?)"] LOSS --> KL["KL Divergence (how far is the encoder from the prior?)"] RECON --> SHARP["Push: make outputs crisp and accurate"] KL --> SMOOTH["Push: keep latent space smooth and regular"] style RECON fill:#4a9eff,color:#fff style KL fill:#ff9,stroke:#333,color:#000 style SHARP fill:#e6f3ff,stroke:#333,color:#000 style SMOOTH fill:#fff3e6,stroke:#333,color:#000
These two forces compete. If reconstruction dominates, the encoder maps each input to a tight, distant point and generation suffers. If KL dominates, the encoder ignores the input and maps everything to the same region (posterior collapse).
Example 1: Computing the ELBO
Suppose for a given input :
- Encoder outputs , (1D latent space for simplicity)
- Reconstruction loss (negative log-likelihood): 0.3
Step 1: Compute the KL divergence.
Step 2: Compute the ELBO.
Step 3: The total loss (negative ELBO).
The reconstruction loss (0.3) dominates over the KL loss (0.168). If we lower toward 0 and push toward 0, we reduce the KL term but might hurt reconstruction (the encoder would be forced to map everything to the same region). The ELBO balances these two forces.
Note: if the encoder output were exactly (matching the prior), then . The KL term penalizes any deviation from the standard Gaussian prior.
Posterior collapse
Sometimes the KL term wins too easily. The encoder learns to set for all inputs, making the KL term zero. The decoder then ignores entirely and generates output based only on its own internal biases.
This is posterior collapse. The latent variable becomes useless. The model degenerates into a standard autoregressive decoder (if the decoder is powerful enough).
Common remedies:
- KL annealing: start training with a small weight on the KL term and gradually increase it to 1.
- Free bits: set a minimum KL value per dimension, so the model is forced to use each latent dimension.
- Weaker decoders: use simpler decoders so the model must rely on to encode information.
VAE vs standard autoencoder
A standard autoencoder maps with a deterministic encoder and decoder, minimizing reconstruction loss only. It has no probabilistic interpretation and no regularization of the latent space.
A VAE adds two crucial elements:
- The encoder outputs a distribution (not a point), so the latent space is smooth.
- The KL term forces the latent space to be organized (close to ).
These two properties make the VAE latent space meaningful. You can interpolate between points, sample new data, and perform arithmetic in latent space. A standard autoencoder’s latent space has gaps and irregular structure that makes generation unreliable.
Autoencoder vs VAE: point vs distribution
graph TD
subgraph AE["Standard Autoencoder"]
AE_IN["Input x"] --> AE_ENC["Encoder"]
AE_ENC --> AE_Z["Single point z"]
AE_Z --> AE_DEC["Decoder"]
AE_DEC --> AE_OUT["Reconstruction"]
end
subgraph VAE_BLOCK["Variational Autoencoder"]
VAE_IN["Input x"] --> VAE_ENC["Encoder"]
VAE_ENC --> VAE_DIST["Distribution N(mu, sigma)"]
VAE_DIST --> VAE_SAMPLE["Sample z"]
VAE_SAMPLE --> VAE_DEC["Decoder"]
VAE_DEC --> VAE_OUT["Reconstruction"]
end
style AE_Z fill:#ff6b6b,color:#fff
style VAE_DIST fill:#51cf66,color:#fff
style AE fill:#ffe6e6,stroke:#333,color:#000
style VAE_BLOCK fill:#e6ffe6,stroke:#333,color:#000
The autoencoder maps to a single point. Gaps between points in latent space decode to garbage. The VAE maps to a distribution, and the KL penalty forces these distributions to overlap. Every region of the latent space decodes to something meaningful.
Latent space properties
2D latent space of a VAE trained on digits 0 to 4. Each cluster corresponds to a different digit class, with smooth interpolation between them.
Because the KL term pushes toward , the latent space is:
- Continuous: nearby points in -space decode to similar outputs.
- Complete: every point in -space decodes to something plausible.
These properties enable interpolation: given two data points and , encode them to and , interpolate in latent space, and decode. The intermediate points should smoothly transition between the two inputs.
Example 3: Latent space interpolation
Given two latent codes:
Linear interpolation at 5 equally spaced points ():
| 0.00 | |
| 0.25 | |
| 0.50 | |
| 0.75 | |
| 1.00 |
If encodes a smiling face and encodes a frowning face, the intermediate points should show gradually changing expressions. The smoothness of this transition is a direct consequence of the KL regularization.
In practice, spherical interpolation (slerp) often works better than linear interpolation because high-dimensional Gaussians concentrate their mass on a thin shell, not near the origin. Linear interpolation passes through the low-density center.
Training in practice
A typical VAE training loop:
- Sample a mini-batch of data .
- For each , encode to get .
- Sample and compute .
- Decode to get .
- Compute the reconstruction loss and KL loss.
- Backpropagate through the entire network (reparameterization makes this possible).
- Update encoder and decoder parameters with Adam or another optimizer.
The reconstruction loss depends on the data type. For binary data (like binarized MNIST), use binary cross-entropy. For continuous data, use MSE or a Gaussian log-likelihood.
Common extensions
-VAE: scales the KL term by a factor to encourage disentanglement in the latent space. Each latent dimension should capture an independent factor of variation.
VQ-VAE: replaces the continuous latent space with a discrete codebook. Instead of sampling from a Gaussian, the encoder picks the nearest codebook vector. This avoids posterior collapse and produces sharper outputs.
Conditional VAE: conditions both encoder and decoder on additional information (class label, text, etc.). Generates data conditioned on a specific attribute.
Hierarchical VAE: uses multiple layers of latent variables, with each layer capturing structure at a different scale. This is the idea behind models like NVAE and VDVAE that produce high-quality images.
Why VAE outputs tend to be blurry
VAEs trained with pixel-wise MSE or Gaussian log-likelihood produce blurry images. This is not a bug; it is a direct consequence of the objective. When the decoder must explain the data with a single Gaussian per pixel, it averages over all plausible values.
Consider a dataset where a pixel is sometimes black and sometimes white. The optimal Gaussian mean is gray, the average. The model assigns maximum likelihood to the mean, producing a blurry compromise rather than a crisp sample.
Possible fixes include using more expressive decoders (autoregressive decoders, PixelCNN), using perceptual loss instead of pixel-wise loss, or switching to a discrete latent space (VQ-VAE). Each approach trades some of the VAE’s simplicity for sharper outputs.
Evaluating VAEs
ELBO as a metric. The ELBO itself serves as a training metric, but it is a lower bound, not the true log-likelihood. Two models with the same ELBO might have very different true likelihoods if one has a tighter bound.
Importance-weighted estimate. To get a tighter bound, use importance weighting with samples:
This is called the IWAE bound. As increases, it approaches the true log-likelihood.
Reconstruction quality. Encode test images, decode them, and visually inspect. Good VAEs preserve identity and structure, even if details are smoothed.
Latent space structure. Encode the test set, color-code by class, and plot in 2D (using PCA or t-SNE if the latent dimension is greater than 2). A well-trained VAE clusters similar data together while keeping the overall distribution close to the prior.
What comes next
VAEs give us a principled way to learn latent representations and generate data, but their outputs tend to be blurry because the reconstruction loss (often MSE) averages over possible outputs. The next step is to look at models that can produce sharper results. GANs take a completely different approach: instead of maximizing likelihood, they train a generator to fool a discriminator, and the adversarial game produces remarkably sharp samples.