Nov 25, 2025 · 18 min read · Deep Learning

Generative models: an overview

In this series (25 parts)

Prerequisites: Encoder-decoder architectures, Information theory, and Bayes’ theorem.

Discriminative models learn to distinguish between classes. Generative models learn to create new data. That difference sounds simple, but it changes everything about how you train, evaluate, and use the model.

Discriminative vs generative

A discriminative model learns $P(y \mid x)$ : given an input, what is the label? A logistic regression classifier, an image classifier, a sentiment analyzer. These models draw decision boundaries.

A generative model learns $P(x)$ : what does the data itself look like? Or equivalently, it learns the joint $P(x, y)$ and can derive $P(x \mid y)$ to generate data conditioned on a class. If you train a generative model on face images, you can sample new faces that never existed.

Why is generative modeling harder? Discriminative models only need to learn the boundary between classes. Generative models need to learn the full distribution of the data, which lives in a much higher-dimensional space. A 256x256 RGB image has 196,608 dimensions. Learning a distribution over that space is extremely challenging.

At a glance: four generative model families

Model	Approach	Strengths	Weaknesses
VAE	Encode input to latent space, decode back	Smooth latent space, fast sampling	Blurry outputs
GAN	Generator vs discriminator game	Sharp, realistic samples	Unstable training, mode collapse
Autoregressive	Predict one element at a time	Exact likelihood, stable training	Slow sequential generation
Flow-based	Invertible transformations	Exact likelihood, fast sampling	Restrictive architecture constraints

Each family makes different tradeoffs between sample quality, training stability, and computational cost. The sections below explain how.

Three families of generative models

Generative models split into three broad families based on how they represent or approximate $P(x)$ .

graph TD
  G["Generative Models"] --> AR["Autoregressive"]
  G --> LV["Latent Variable"]
  G --> IMP["Implicit"]
  AR --> AR1["PixelRNN/CNN"]
  AR --> AR2["GPT"]
  AR --> AR3["WaveNet"]
  LV --> LV1["VAE"]
  LV --> LV2["Flow-based"]
  LV --> LV3["Diffusion"]
  IMP --> IMP1["GAN"]
  IMP --> IMP2["Implicit VAE"]
  style G fill:#ff9,stroke:#333,color:#000

Figure 1: Taxonomy of generative model families. Autoregressive models factor the joint distribution into conditionals. Latent variable models introduce hidden variables. Implicit models learn to sample without an explicit density.

Autoregressive models

These factor the joint distribution using the chain rule of probability:

$P(x) = P(x_1) \cdot P(x_2 \mid x_1) \cdot P(x_3 \mid x_1, x_2) \cdots P(x_T \mid x_{<T})$

Each factor is a conditional distribution, and the model learns each one. Generation proceeds left to right: sample $x_1$ , then $x_2$ given $x_1$ , and so on.

Strengths: Exact likelihood computation. Stable training. No mode collapse. Weaknesses: Sequential generation is slow. No latent representation.

Examples: GPT (text), PixelCNN (images), WaveNet (audio).

Autoregressive vs latent variable models

graph TD
  A["Autoregressive"] --> B["Models P(x) directly
Factors into conditionals"]
  B --> C["Generates left to right
Each token depends on
all previous tokens"]
  D["Latent Variable"] --> E["Introduces hidden z
Models P(x) = integral P(x|z)P(z)"]
  E --> F["Generates by sampling z
then decoding to x
in one pass"]

Latent variable models

These introduce a hidden variable $z$ and model:

$P(x) = \int P(x \mid z) \, P(z) \, dz$

The latent $z$ captures the underlying structure. A face image might have latent variables for pose, lighting, and expression. You generate by sampling $z$ from the prior $P(z)$ and then computing $P(x \mid z)$ .

The integral is usually intractable, so you approximate it. VAEs use variational inference. Flow-based models use invertible transformations to make it exact. Diffusion models gradually add and remove noise.

The generative process with latent variables

graph LR
  A["Sample z
from prior P(z)"] --> B["Decoder network
P(x | z)"]
  B --> C["Generated sample x"]
  D["Prior: simple distribution
e.g. standard normal"] --> A

You sample a latent code z from a simple distribution (usually a Gaussian), feed it through a decoder network, and get a data sample. The decoder learns to map points in latent space to realistic data. Nearby points in latent space produce similar outputs.

Strengths: Meaningful latent space. Fast parallel generation (for some variants). Weaknesses: Training can be unstable. Approximate inference introduces error.

Implicit models (GANs)

Generative Adversarial Networks do not model $P(x)$ at all. Instead, they train a generator $G$ that transforms noise $z$ into samples, and a discriminator $D$ that tries to distinguish real from generated data. The two play a minimax game:

$\min_G \max_D \; \mathbb{E}_{x \sim p_{\text{data}}}[\log D(x)] + \mathbb{E}_{z \sim p(z)}[\log(1 - D(G(z)))]$

The generator never computes a likelihood. It just learns to fool the discriminator.

Strengths: Often produces the sharpest, most realistic samples. Weaknesses: No likelihood. Training is notoriously unstable. Mode collapse.

Training procedures compared

graph LR
  subgraph Autoregressive
      A1["Input x"] --> A2["Predict x_t given x < t"]
      A2 --> A3["Cross-entropy loss"]
  end
  subgraph VAE
      B1["Input x"] --> B2["Encode → μ, σ"]
      B2 --> B3["Sample z"]
      B3 --> B4["Decode → x̂"]
      B4 --> B5["Reconstruction + KL loss"]
  end
  subgraph GAN
      C1["Noise z"] --> C2["Generator → fake x"]
      C2 --> C3["Discriminator: real or fake?"]
      C3 --> C4["Adversarial loss"]
  end

Figure 2: Training procedures. Autoregressive models maximize next-token likelihood. VAEs minimize reconstruction error plus KL divergence. GANs play a minimax game between generator and discriminator.

Evaluating generative models

Evaluation is one of the hardest parts of generative modeling. There is no single metric that captures everything. Here are the most common ones.

Log-likelihood. Measures how probable the real data is under the model. Higher is better. Only available for models with explicit densities (autoregressive, VAE, flows). You compute:

$\mathcal{L} = \frac{1}{N} \sum_{i=1}^{N} \log P_\theta(x_i)$

Maximum likelihood: the training idea

graph LR
  A["Real data
distribution"] --> B["Compare"]
  C["Model distribution
P_theta"] --> B
  B --> D["Adjust theta to
minimize gap"]
  D --> C

Maximum likelihood training adjusts the model parameters until the model’s distribution assigns high probability to the real data. You measure the gap with log-likelihood, and gradient descent closes it iteratively.

Frechet Inception Distance (FID). Compares the distribution of generated images to real images in the feature space of a pretrained Inception network. Fits a Gaussian to each set of features and computes:

$\text{FID} = \|\mu_r - \mu_g\|^2 + \text{Tr}\!\left(\Sigma_r + \Sigma_g - 2(\Sigma_r \Sigma_g)^{1/2}\right)$

Lower FID means the generated distribution is closer to the real one.

Inception Score (IS). Measures both quality (each image should look like a clear class) and diversity (the set of images should cover many classes). Higher is better.

Precision and Recall. Precision measures what fraction of generated samples fall within the real data distribution (quality). Recall measures what fraction of the real distribution is covered by generated samples (diversity).

Metric	What It Measures	Needs Real Data?	Captures Diversity?	Captures Quality?	Available For
Log-likelihood	How probable real data is	Yes	Partially	Partially	Autoregressive, VAE, flows
FID	Distribution distance	Yes	Yes	Yes	All (needs samples)
IS	Quality + diversity	No (uses classifier)	Yes	Yes	Image models
Precision/Recall	Quality / Coverage	Yes	Recall only	Precision only	All (needs samples)

Example 1: FID between two 1D Gaussians

Suppose the real data follows $\mathcal{N}(0, 1)$ and the generated data follows $\mathcal{N}(0.5, 1.44)$ (i.e., $\mu_g = 0.5$ , $\sigma_g^2 = 1.44$ , so $\sigma_g = 1.2$ ).

In one dimension, the FID formula simplifies. The covariance matrices are scalars ( $\sigma^2$ ), and the matrix square root is just $\sigma$ :

$\text{FID} = (\mu_r - \mu_g)^2 + \sigma_r^2 + \sigma_g^2 - 2\,\sigma_r \, \sigma_g$

Plugging in:

$\text{FID} = (0 - 0.5)^2 + 1^2 + 1.2^2 - 2(1)(1.2)$ $= 0.25 + 1 + 1.44 - 2.4 = 0.29$

The FID is 0.29. If the generated distribution were exactly $\mathcal{N}(0, 1)$ , FID would be 0. The nonzero value comes from both the shifted mean (contributes 0.25) and the different variance (contributes 0.04).

Example 2: Temperature in autoregressive sampling

A character-level language model predicts the next character from vocabulary $\{a, b, c\}$ with logits $[2.0, 1.0, 0.5]$ .

Temperature $T = 1.0$ (standard softmax):

$P_i = \frac{e^{l_i / 1.0}}{\sum_j e^{l_j / 1.0}}$

$e^{2.0} = 7.389, \quad e^{1.0} = 2.718, \quad e^{0.5} = 1.649$ $Z = 7.389 + 2.718 + 1.649 = 11.756$ $P = [0.628,\; 0.231,\; 0.140]$

Temperature $T = 0.5$ (sharper):

$P_i = \frac{e^{l_i / 0.5}}{\sum_j e^{l_j / 0.5}} = \frac{e^{2 l_i}}{\sum_j e^{2 l_j}}$

$e^{4.0} = 54.598, \quad e^{2.0} = 7.389, \quad e^{1.0} = 2.718$ $Z = 54.598 + 7.389 + 2.718 = 64.705$ $P = [0.844,\; 0.114,\; 0.042]$

Lower temperature makes the distribution sharper. At $T = 0.5$ , character “a” gets 84.4% probability instead of 62.8%. Sampling at low temperature produces more predictable, repetitive text. High temperature produces more diverse but less coherent text.

To sample a 3-character sequence, you repeat this process: sample the first character, feed it back as context, compute new logits for the second character, sample, and continue.

Example 3: Mode dropping and evaluation metrics

Consider a real distribution with two clusters: half the data comes from cluster A (cats) and half from cluster B (dogs).

Model	Behavior	Log-likelihood	FID	Precision	Recall
Model 1	Generates both cats and dogs accurately	High	Low (good)	High	High
Model 2	Generates only cats, but perfect cats	Moderate (zero prob on dogs)	Moderate	High	Low (0.5)
Model 3	Generates blurry images covering both	Low (spread too thin)	Moderate	Low	High

Model 2 demonstrates mode dropping: it ignores half the distribution. Its precision is high (every generated image is a valid cat), but its recall is only 0.5 (dogs are never generated). Log-likelihood on dog images is extremely low, which penalizes the overall score. FID captures the mismatch in means.

Model 3 demonstrates the opposite problem: it covers everything but nothing looks sharp. No single metric tells the full story. That is why researchers report multiple metrics.

Sample quality comparison by FID score (lower is better)

Generative model comparison

Type	Explicit Density	Training Objective	Sample Quality	Inference Speed	Stability	Representative Model
Autoregressive	Yes	Max likelihood	Good	Slow (sequential)	Stable	GPT, PixelCNN
VAE	Yes (lower bound)	ELBO	Moderate (blurry)	Fast (single pass)	Stable	VAE, VQ-VAE
Flow-based	Yes (exact)	Max likelihood	Good	Fast (single pass)	Stable	Glow, RealNVP
GAN	No	Adversarial game	Excellent	Fast (single pass)	Unstable	StyleGAN, BigGAN
Diffusion	Yes (lower bound)	Denoising score	Excellent	Slow (many steps)	Stable	DDPM, Stable Diffusion

Why generative modeling is hard

Three fundamental challenges make generative modeling difficult:

High dimensionality. A 256x256 image has ~200,000 dimensions. The model must learn correlations between all of them. A pixel in the top-left corner of a face image is correlated with a pixel in the bottom-right corner (both are part of the same face), and the model needs to capture this.

Evaluation. Unlike classification where accuracy is clear, there is no single number that says “this generative model is good.” You need multiple metrics, and they often disagree.

Training instability. GANs suffer from mode collapse (generator ignores parts of the distribution) and training oscillation (generator and discriminator chase each other). VAEs can suffer from posterior collapse (the model ignores the latent variable). Even autoregressive models face exposure bias.

Despite these challenges, generative models have become remarkably capable. Large language models generate coherent text. Diffusion models generate photorealistic images. The field continues to advance rapidly.

Likelihood-based vs likelihood-free

One useful way to categorize models is by whether they provide an explicit density.

Likelihood-based models (autoregressive, VAE, flows) can compute $P_\theta(x)$ or a lower bound on it. This lets you do model comparison: train two models, pick the one that assigns higher likelihood to a held-out test set. You can also detect out-of-distribution data by checking if new inputs have low likelihood.

Likelihood-free models (GANs) cannot compute $P_\theta(x)$ . You can only draw samples. This makes evaluation harder, but GANs compensate with exceptional sample quality. The adversarial training signal is a different kind of feedback than maximum likelihood, and it often produces sharper images.

Some hybrid approaches exist. Adversarial autoencoders combine VAE-like structure with GAN-like training. Wasserstein GANs reformulate the adversarial objective using optimal transport, which provides a more stable training signal.

The mode coverage vs quality tradeoff

A recurring theme across all generative model families is the tension between coverage and quality.

Mode coverage means the model generates samples from all parts of the real distribution. A face generator with good coverage produces faces of all ages, ethnicities, and expressions.

Sample quality means each individual sample looks realistic. A model with high quality generates sharp, convincing images, but it might only produce a narrow range of faces.

Autoregressive models and VAEs tend toward good coverage but sometimes lower quality (blurry samples for VAEs). GANs tend toward high quality but sometimes poor coverage (mode collapse). Diffusion models have recently shown that you can achieve both, at the cost of slow sampling.

Understanding this tradeoff helps you pick the right model for your application. If you need diversity (drug discovery, creative tools), prioritize coverage. If you need realism (photo editing, super-resolution), prioritize quality.

What comes next

This article gave you the landscape. Now we will go deep into specific generative model families. We start with one of the earliest successful approaches: Restricted Boltzmann Machines. They introduced ideas about energy-based modeling and unsupervised feature learning that influenced everything that came after.

← Back to all series