Search…

Generative models: an overview

In this series (25 parts)
  1. Neural networks: the basic building block
  2. Forward pass and backpropagation
  3. Training neural networks: a practical guide
  4. Convolutional neural networks
  5. Recurrent neural networks and LSTMs
  6. Attention mechanism and transformers
  7. Word embeddings: from one-hot to dense representations
  8. Transfer learning and fine-tuning
  9. Optimization techniques for deep networks
  10. Regularization for deep networks
  11. Encoder-decoder architectures
  12. Generative models: an overview
  13. Restricted Boltzmann Machines
  14. Deep Belief Networks
  15. Variational Autoencoders
  16. Generative Adversarial Networks: training and theory
  17. DCGAN, conditional GANs, and GAN variants
  18. Representation learning and self-supervised learning
  19. Domain adaptation and fine-tuning strategies
  20. Distributed representations and latent spaces
  21. AutoML and hyperparameter optimization
  22. Neural architecture search
  23. Network compression and efficient inference
  24. Graph neural networks
  25. Practical deep learning: debugging and tuning

Prerequisites: Encoder-decoder architectures, Information theory, and Bayes’ theorem.

Discriminative models learn to distinguish between classes. Generative models learn to create new data. That difference sounds simple, but it changes everything about how you train, evaluate, and use the model.

Discriminative vs generative

A discriminative model learns P(yx)P(y \mid x): given an input, what is the label? A logistic regression classifier, an image classifier, a sentiment analyzer. These models draw decision boundaries.

A generative model learns P(x)P(x): what does the data itself look like? Or equivalently, it learns the joint P(x,y)P(x, y) and can derive P(xy)P(x \mid y) to generate data conditioned on a class. If you train a generative model on face images, you can sample new faces that never existed.

Why is generative modeling harder? Discriminative models only need to learn the boundary between classes. Generative models need to learn the full distribution of the data, which lives in a much higher-dimensional space. A 256x256 RGB image has 196,608 dimensions. Learning a distribution over that space is extremely challenging.

At a glance: four generative model families

ModelApproachStrengthsWeaknesses
VAEEncode input to latent space, decode backSmooth latent space, fast samplingBlurry outputs
GANGenerator vs discriminator gameSharp, realistic samplesUnstable training, mode collapse
AutoregressivePredict one element at a timeExact likelihood, stable trainingSlow sequential generation
Flow-basedInvertible transformationsExact likelihood, fast samplingRestrictive architecture constraints

Each family makes different tradeoffs between sample quality, training stability, and computational cost. The sections below explain how.

Three families of generative models

Generative models split into three broad families based on how they represent or approximate P(x)P(x).

graph TD
  G["Generative Models"] --> AR["Autoregressive"]
  G --> LV["Latent Variable"]
  G --> IMP["Implicit"]
  AR --> AR1["PixelRNN/CNN"]
  AR --> AR2["GPT"]
  AR --> AR3["WaveNet"]
  LV --> LV1["VAE"]
  LV --> LV2["Flow-based"]
  LV --> LV3["Diffusion"]
  IMP --> IMP1["GAN"]
  IMP --> IMP2["Implicit VAE"]
  style G fill:#ff9,stroke:#333,color:#000

Figure 1: Taxonomy of generative model families. Autoregressive models factor the joint distribution into conditionals. Latent variable models introduce hidden variables. Implicit models learn to sample without an explicit density.

Autoregressive models

These factor the joint distribution using the chain rule of probability:

P(x)=P(x1)P(x2x1)P(x3x1,x2)P(xTx<T)P(x) = P(x_1) \cdot P(x_2 \mid x_1) \cdot P(x_3 \mid x_1, x_2) \cdots P(x_T \mid x_{<T})

Each factor is a conditional distribution, and the model learns each one. Generation proceeds left to right: sample x1x_1, then x2x_2 given x1x_1, and so on.

Strengths: Exact likelihood computation. Stable training. No mode collapse. Weaknesses: Sequential generation is slow. No latent representation.

Examples: GPT (text), PixelCNN (images), WaveNet (audio).

Autoregressive vs latent variable models

graph TD
  A["Autoregressive"] --> B["Models P(x) directly
Factors into conditionals"]
  B --> C["Generates left to right
Each token depends on
all previous tokens"]
  D["Latent Variable"] --> E["Introduces hidden z
Models P(x) = integral P(x|z)P(z)"]
  E --> F["Generates by sampling z
then decoding to x
in one pass"]

Latent variable models

These introduce a hidden variable zz and model:

P(x)=P(xz)P(z)dzP(x) = \int P(x \mid z) \, P(z) \, dz

The latent zz captures the underlying structure. A face image might have latent variables for pose, lighting, and expression. You generate by sampling zz from the prior P(z)P(z) and then computing P(xz)P(x \mid z).

The integral is usually intractable, so you approximate it. VAEs use variational inference. Flow-based models use invertible transformations to make it exact. Diffusion models gradually add and remove noise.

The generative process with latent variables

graph LR
  A["Sample z
from prior P(z)"] --> B["Decoder network
P(x | z)"]
  B --> C["Generated sample x"]
  D["Prior: simple distribution
e.g. standard normal"] --> A

You sample a latent code z from a simple distribution (usually a Gaussian), feed it through a decoder network, and get a data sample. The decoder learns to map points in latent space to realistic data. Nearby points in latent space produce similar outputs.

Strengths: Meaningful latent space. Fast parallel generation (for some variants). Weaknesses: Training can be unstable. Approximate inference introduces error.

Implicit models (GANs)

Generative Adversarial Networks do not model P(x)P(x) at all. Instead, they train a generator GG that transforms noise zz into samples, and a discriminator DD that tries to distinguish real from generated data. The two play a minimax game:

minGmaxD  Expdata[logD(x)]+Ezp(z)[log(1D(G(z)))]\min_G \max_D \; \mathbb{E}_{x \sim p_{\text{data}}}[\log D(x)] + \mathbb{E}_{z \sim p(z)}[\log(1 - D(G(z)))]

The generator never computes a likelihood. It just learns to fool the discriminator.

Strengths: Often produces the sharpest, most realistic samples. Weaknesses: No likelihood. Training is notoriously unstable. Mode collapse.

Training procedures compared

graph LR
  subgraph Autoregressive
      A1["Input x"] --> A2["Predict x_t given x < t"]
      A2 --> A3["Cross-entropy loss"]
  end
  subgraph VAE
      B1["Input x"] --> B2["Encode → μ, σ"]
      B2 --> B3["Sample z"]
      B3 --> B4["Decode → x̂"]
      B4 --> B5["Reconstruction + KL loss"]
  end
  subgraph GAN
      C1["Noise z"] --> C2["Generator → fake x"]
      C2 --> C3["Discriminator: real or fake?"]
      C3 --> C4["Adversarial loss"]
  end

Figure 2: Training procedures. Autoregressive models maximize next-token likelihood. VAEs minimize reconstruction error plus KL divergence. GANs play a minimax game between generator and discriminator.

Evaluating generative models

Evaluation is one of the hardest parts of generative modeling. There is no single metric that captures everything. Here are the most common ones.

Log-likelihood. Measures how probable the real data is under the model. Higher is better. Only available for models with explicit densities (autoregressive, VAE, flows). You compute:

L=1Ni=1NlogPθ(xi)\mathcal{L} = \frac{1}{N} \sum_{i=1}^{N} \log P_\theta(x_i)

Maximum likelihood: the training idea

graph LR
  A["Real data
distribution"] --> B["Compare"]
  C["Model distribution
P_theta"] --> B
  B --> D["Adjust theta to
minimize gap"]
  D --> C

Maximum likelihood training adjusts the model parameters until the model’s distribution assigns high probability to the real data. You measure the gap with log-likelihood, and gradient descent closes it iteratively.

Frechet Inception Distance (FID). Compares the distribution of generated images to real images in the feature space of a pretrained Inception network. Fits a Gaussian to each set of features and computes:

FID=μrμg2+Tr ⁣(Σr+Σg2(ΣrΣg)1/2)\text{FID} = \|\mu_r - \mu_g\|^2 + \text{Tr}\!\left(\Sigma_r + \Sigma_g - 2(\Sigma_r \Sigma_g)^{1/2}\right)

Lower FID means the generated distribution is closer to the real one.

Inception Score (IS). Measures both quality (each image should look like a clear class) and diversity (the set of images should cover many classes). Higher is better.

Precision and Recall. Precision measures what fraction of generated samples fall within the real data distribution (quality). Recall measures what fraction of the real distribution is covered by generated samples (diversity).

MetricWhat It MeasuresNeeds Real Data?Captures Diversity?Captures Quality?Available For
Log-likelihoodHow probable real data isYesPartiallyPartiallyAutoregressive, VAE, flows
FIDDistribution distanceYesYesYesAll (needs samples)
ISQuality + diversityNo (uses classifier)YesYesImage models
Precision/RecallQuality / CoverageYesRecall onlyPrecision onlyAll (needs samples)

Example 1: FID between two 1D Gaussians

Suppose the real data follows N(0,1)\mathcal{N}(0, 1) and the generated data follows N(0.5,1.44)\mathcal{N}(0.5, 1.44) (i.e., μg=0.5\mu_g = 0.5, σg2=1.44\sigma_g^2 = 1.44, so σg=1.2\sigma_g = 1.2).

In one dimension, the FID formula simplifies. The covariance matrices are scalars (σ2\sigma^2), and the matrix square root is just σ\sigma:

FID=(μrμg)2+σr2+σg22σrσg\text{FID} = (\mu_r - \mu_g)^2 + \sigma_r^2 + \sigma_g^2 - 2\,\sigma_r \, \sigma_g

Plugging in:

FID=(00.5)2+12+1.222(1)(1.2)\text{FID} = (0 - 0.5)^2 + 1^2 + 1.2^2 - 2(1)(1.2) =0.25+1+1.442.4=0.29= 0.25 + 1 + 1.44 - 2.4 = 0.29

The FID is 0.29. If the generated distribution were exactly N(0,1)\mathcal{N}(0, 1), FID would be 0. The nonzero value comes from both the shifted mean (contributes 0.25) and the different variance (contributes 0.04).

Example 2: Temperature in autoregressive sampling

A character-level language model predicts the next character from vocabulary {a,b,c}\{a, b, c\} with logits [2.0,1.0,0.5][2.0, 1.0, 0.5].

Temperature T=1.0T = 1.0 (standard softmax):

Pi=eli/1.0jelj/1.0P_i = \frac{e^{l_i / 1.0}}{\sum_j e^{l_j / 1.0}}

e2.0=7.389,e1.0=2.718,e0.5=1.649e^{2.0} = 7.389, \quad e^{1.0} = 2.718, \quad e^{0.5} = 1.649 Z=7.389+2.718+1.649=11.756Z = 7.389 + 2.718 + 1.649 = 11.756 P=[0.628,  0.231,  0.140]P = [0.628,\; 0.231,\; 0.140]

Temperature T=0.5T = 0.5 (sharper):

Pi=eli/0.5jelj/0.5=e2lije2ljP_i = \frac{e^{l_i / 0.5}}{\sum_j e^{l_j / 0.5}} = \frac{e^{2 l_i}}{\sum_j e^{2 l_j}}

e4.0=54.598,e2.0=7.389,e1.0=2.718e^{4.0} = 54.598, \quad e^{2.0} = 7.389, \quad e^{1.0} = 2.718 Z=54.598+7.389+2.718=64.705Z = 54.598 + 7.389 + 2.718 = 64.705 P=[0.844,  0.114,  0.042]P = [0.844,\; 0.114,\; 0.042]

Lower temperature makes the distribution sharper. At T=0.5T = 0.5, character “a” gets 84.4% probability instead of 62.8%. Sampling at low temperature produces more predictable, repetitive text. High temperature produces more diverse but less coherent text.

To sample a 3-character sequence, you repeat this process: sample the first character, feed it back as context, compute new logits for the second character, sample, and continue.

Example 3: Mode dropping and evaluation metrics

Consider a real distribution with two clusters: half the data comes from cluster A (cats) and half from cluster B (dogs).

ModelBehaviorLog-likelihoodFIDPrecisionRecall
Model 1Generates both cats and dogs accuratelyHighLow (good)HighHigh
Model 2Generates only cats, but perfect catsModerate (zero prob on dogs)ModerateHighLow (0.5)
Model 3Generates blurry images covering bothLow (spread too thin)ModerateLowHigh

Model 2 demonstrates mode dropping: it ignores half the distribution. Its precision is high (every generated image is a valid cat), but its recall is only 0.5 (dogs are never generated). Log-likelihood on dog images is extremely low, which penalizes the overall score. FID captures the mismatch in means.

Model 3 demonstrates the opposite problem: it covers everything but nothing looks sharp. No single metric tells the full story. That is why researchers report multiple metrics.

Sample quality comparison by FID score (lower is better)

Generative model comparison

TypeExplicit DensityTraining ObjectiveSample QualityInference SpeedStabilityRepresentative Model
AutoregressiveYesMax likelihoodGoodSlow (sequential)StableGPT, PixelCNN
VAEYes (lower bound)ELBOModerate (blurry)Fast (single pass)StableVAE, VQ-VAE
Flow-basedYes (exact)Max likelihoodGoodFast (single pass)StableGlow, RealNVP
GANNoAdversarial gameExcellentFast (single pass)UnstableStyleGAN, BigGAN
DiffusionYes (lower bound)Denoising scoreExcellentSlow (many steps)StableDDPM, Stable Diffusion

Why generative modeling is hard

Three fundamental challenges make generative modeling difficult:

High dimensionality. A 256x256 image has ~200,000 dimensions. The model must learn correlations between all of them. A pixel in the top-left corner of a face image is correlated with a pixel in the bottom-right corner (both are part of the same face), and the model needs to capture this.

Evaluation. Unlike classification where accuracy is clear, there is no single number that says “this generative model is good.” You need multiple metrics, and they often disagree.

Training instability. GANs suffer from mode collapse (generator ignores parts of the distribution) and training oscillation (generator and discriminator chase each other). VAEs can suffer from posterior collapse (the model ignores the latent variable). Even autoregressive models face exposure bias.

Despite these challenges, generative models have become remarkably capable. Large language models generate coherent text. Diffusion models generate photorealistic images. The field continues to advance rapidly.

Likelihood-based vs likelihood-free

One useful way to categorize models is by whether they provide an explicit density.

Likelihood-based models (autoregressive, VAE, flows) can compute Pθ(x)P_\theta(x) or a lower bound on it. This lets you do model comparison: train two models, pick the one that assigns higher likelihood to a held-out test set. You can also detect out-of-distribution data by checking if new inputs have low likelihood.

Likelihood-free models (GANs) cannot compute Pθ(x)P_\theta(x). You can only draw samples. This makes evaluation harder, but GANs compensate with exceptional sample quality. The adversarial training signal is a different kind of feedback than maximum likelihood, and it often produces sharper images.

Some hybrid approaches exist. Adversarial autoencoders combine VAE-like structure with GAN-like training. Wasserstein GANs reformulate the adversarial objective using optimal transport, which provides a more stable training signal.

The mode coverage vs quality tradeoff

A recurring theme across all generative model families is the tension between coverage and quality.

Mode coverage means the model generates samples from all parts of the real distribution. A face generator with good coverage produces faces of all ages, ethnicities, and expressions.

Sample quality means each individual sample looks realistic. A model with high quality generates sharp, convincing images, but it might only produce a narrow range of faces.

Autoregressive models and VAEs tend toward good coverage but sometimes lower quality (blurry samples for VAEs). GANs tend toward high quality but sometimes poor coverage (mode collapse). Diffusion models have recently shown that you can achieve both, at the cost of slow sampling.

Understanding this tradeoff helps you pick the right model for your application. If you need diversity (drug discovery, creative tools), prioritize coverage. If you need realism (photo editing, super-resolution), prioritize quality.

What comes next

This article gave you the landscape. Now we will go deep into specific generative model families. We start with one of the earliest successful approaches: Restricted Boltzmann Machines. They introduced ideas about energy-based modeling and unsupervised feature learning that influenced everything that came after.

Start typing to search across all content
navigate Enter open Esc close