Generative models: an overview
In this series (25 parts)
- Neural networks: the basic building block
- Forward pass and backpropagation
- Training neural networks: a practical guide
- Convolutional neural networks
- Recurrent neural networks and LSTMs
- Attention mechanism and transformers
- Word embeddings: from one-hot to dense representations
- Transfer learning and fine-tuning
- Optimization techniques for deep networks
- Regularization for deep networks
- Encoder-decoder architectures
- Generative models: an overview
- Restricted Boltzmann Machines
- Deep Belief Networks
- Variational Autoencoders
- Generative Adversarial Networks: training and theory
- DCGAN, conditional GANs, and GAN variants
- Representation learning and self-supervised learning
- Domain adaptation and fine-tuning strategies
- Distributed representations and latent spaces
- AutoML and hyperparameter optimization
- Neural architecture search
- Network compression and efficient inference
- Graph neural networks
- Practical deep learning: debugging and tuning
Prerequisites: Encoder-decoder architectures, Information theory, and Bayes’ theorem.
Discriminative models learn to distinguish between classes. Generative models learn to create new data. That difference sounds simple, but it changes everything about how you train, evaluate, and use the model.
Discriminative vs generative
A discriminative model learns : given an input, what is the label? A logistic regression classifier, an image classifier, a sentiment analyzer. These models draw decision boundaries.
A generative model learns : what does the data itself look like? Or equivalently, it learns the joint and can derive to generate data conditioned on a class. If you train a generative model on face images, you can sample new faces that never existed.
Why is generative modeling harder? Discriminative models only need to learn the boundary between classes. Generative models need to learn the full distribution of the data, which lives in a much higher-dimensional space. A 256x256 RGB image has 196,608 dimensions. Learning a distribution over that space is extremely challenging.
At a glance: four generative model families
| Model | Approach | Strengths | Weaknesses |
|---|---|---|---|
| VAE | Encode input to latent space, decode back | Smooth latent space, fast sampling | Blurry outputs |
| GAN | Generator vs discriminator game | Sharp, realistic samples | Unstable training, mode collapse |
| Autoregressive | Predict one element at a time | Exact likelihood, stable training | Slow sequential generation |
| Flow-based | Invertible transformations | Exact likelihood, fast sampling | Restrictive architecture constraints |
Each family makes different tradeoffs between sample quality, training stability, and computational cost. The sections below explain how.
Three families of generative models
Generative models split into three broad families based on how they represent or approximate .
graph TD G["Generative Models"] --> AR["Autoregressive"] G --> LV["Latent Variable"] G --> IMP["Implicit"] AR --> AR1["PixelRNN/CNN"] AR --> AR2["GPT"] AR --> AR3["WaveNet"] LV --> LV1["VAE"] LV --> LV2["Flow-based"] LV --> LV3["Diffusion"] IMP --> IMP1["GAN"] IMP --> IMP2["Implicit VAE"] style G fill:#ff9,stroke:#333,color:#000
Figure 1: Taxonomy of generative model families. Autoregressive models factor the joint distribution into conditionals. Latent variable models introduce hidden variables. Implicit models learn to sample without an explicit density.
Autoregressive models
These factor the joint distribution using the chain rule of probability:
Each factor is a conditional distribution, and the model learns each one. Generation proceeds left to right: sample , then given , and so on.
Strengths: Exact likelihood computation. Stable training. No mode collapse. Weaknesses: Sequential generation is slow. No latent representation.
Examples: GPT (text), PixelCNN (images), WaveNet (audio).
Autoregressive vs latent variable models
graph TD A["Autoregressive"] --> B["Models P(x) directly Factors into conditionals"] B --> C["Generates left to right Each token depends on all previous tokens"] D["Latent Variable"] --> E["Introduces hidden z Models P(x) = integral P(x|z)P(z)"] E --> F["Generates by sampling z then decoding to x in one pass"]
Latent variable models
These introduce a hidden variable and model:
The latent captures the underlying structure. A face image might have latent variables for pose, lighting, and expression. You generate by sampling from the prior and then computing .
The integral is usually intractable, so you approximate it. VAEs use variational inference. Flow-based models use invertible transformations to make it exact. Diffusion models gradually add and remove noise.
The generative process with latent variables
graph LR A["Sample z from prior P(z)"] --> B["Decoder network P(x | z)"] B --> C["Generated sample x"] D["Prior: simple distribution e.g. standard normal"] --> A
You sample a latent code z from a simple distribution (usually a Gaussian), feed it through a decoder network, and get a data sample. The decoder learns to map points in latent space to realistic data. Nearby points in latent space produce similar outputs.
Strengths: Meaningful latent space. Fast parallel generation (for some variants). Weaknesses: Training can be unstable. Approximate inference introduces error.
Implicit models (GANs)
Generative Adversarial Networks do not model at all. Instead, they train a generator that transforms noise into samples, and a discriminator that tries to distinguish real from generated data. The two play a minimax game:
The generator never computes a likelihood. It just learns to fool the discriminator.
Strengths: Often produces the sharpest, most realistic samples. Weaknesses: No likelihood. Training is notoriously unstable. Mode collapse.
Training procedures compared
graph LR
subgraph Autoregressive
A1["Input x"] --> A2["Predict x_t given x < t"]
A2 --> A3["Cross-entropy loss"]
end
subgraph VAE
B1["Input x"] --> B2["Encode → μ, σ"]
B2 --> B3["Sample z"]
B3 --> B4["Decode → x̂"]
B4 --> B5["Reconstruction + KL loss"]
end
subgraph GAN
C1["Noise z"] --> C2["Generator → fake x"]
C2 --> C3["Discriminator: real or fake?"]
C3 --> C4["Adversarial loss"]
end
Figure 2: Training procedures. Autoregressive models maximize next-token likelihood. VAEs minimize reconstruction error plus KL divergence. GANs play a minimax game between generator and discriminator.
Evaluating generative models
Evaluation is one of the hardest parts of generative modeling. There is no single metric that captures everything. Here are the most common ones.
Log-likelihood. Measures how probable the real data is under the model. Higher is better. Only available for models with explicit densities (autoregressive, VAE, flows). You compute:
Maximum likelihood: the training idea
graph LR A["Real data distribution"] --> B["Compare"] C["Model distribution P_theta"] --> B B --> D["Adjust theta to minimize gap"] D --> C
Maximum likelihood training adjusts the model parameters until the model’s distribution assigns high probability to the real data. You measure the gap with log-likelihood, and gradient descent closes it iteratively.
Frechet Inception Distance (FID). Compares the distribution of generated images to real images in the feature space of a pretrained Inception network. Fits a Gaussian to each set of features and computes:
Lower FID means the generated distribution is closer to the real one.
Inception Score (IS). Measures both quality (each image should look like a clear class) and diversity (the set of images should cover many classes). Higher is better.
Precision and Recall. Precision measures what fraction of generated samples fall within the real data distribution (quality). Recall measures what fraction of the real distribution is covered by generated samples (diversity).
| Metric | What It Measures | Needs Real Data? | Captures Diversity? | Captures Quality? | Available For |
|---|---|---|---|---|---|
| Log-likelihood | How probable real data is | Yes | Partially | Partially | Autoregressive, VAE, flows |
| FID | Distribution distance | Yes | Yes | Yes | All (needs samples) |
| IS | Quality + diversity | No (uses classifier) | Yes | Yes | Image models |
| Precision/Recall | Quality / Coverage | Yes | Recall only | Precision only | All (needs samples) |
Example 1: FID between two 1D Gaussians
Suppose the real data follows and the generated data follows (i.e., , , so ).
In one dimension, the FID formula simplifies. The covariance matrices are scalars (), and the matrix square root is just :
Plugging in:
The FID is 0.29. If the generated distribution were exactly , FID would be 0. The nonzero value comes from both the shifted mean (contributes 0.25) and the different variance (contributes 0.04).
Example 2: Temperature in autoregressive sampling
A character-level language model predicts the next character from vocabulary with logits .
Temperature (standard softmax):
Temperature (sharper):
Lower temperature makes the distribution sharper. At , character “a” gets 84.4% probability instead of 62.8%. Sampling at low temperature produces more predictable, repetitive text. High temperature produces more diverse but less coherent text.
To sample a 3-character sequence, you repeat this process: sample the first character, feed it back as context, compute new logits for the second character, sample, and continue.
Example 3: Mode dropping and evaluation metrics
Consider a real distribution with two clusters: half the data comes from cluster A (cats) and half from cluster B (dogs).
| Model | Behavior | Log-likelihood | FID | Precision | Recall |
|---|---|---|---|---|---|
| Model 1 | Generates both cats and dogs accurately | High | Low (good) | High | High |
| Model 2 | Generates only cats, but perfect cats | Moderate (zero prob on dogs) | Moderate | High | Low (0.5) |
| Model 3 | Generates blurry images covering both | Low (spread too thin) | Moderate | Low | High |
Model 2 demonstrates mode dropping: it ignores half the distribution. Its precision is high (every generated image is a valid cat), but its recall is only 0.5 (dogs are never generated). Log-likelihood on dog images is extremely low, which penalizes the overall score. FID captures the mismatch in means.
Model 3 demonstrates the opposite problem: it covers everything but nothing looks sharp. No single metric tells the full story. That is why researchers report multiple metrics.
Sample quality comparison by FID score (lower is better)
Generative model comparison
| Type | Explicit Density | Training Objective | Sample Quality | Inference Speed | Stability | Representative Model |
|---|---|---|---|---|---|---|
| Autoregressive | Yes | Max likelihood | Good | Slow (sequential) | Stable | GPT, PixelCNN |
| VAE | Yes (lower bound) | ELBO | Moderate (blurry) | Fast (single pass) | Stable | VAE, VQ-VAE |
| Flow-based | Yes (exact) | Max likelihood | Good | Fast (single pass) | Stable | Glow, RealNVP |
| GAN | No | Adversarial game | Excellent | Fast (single pass) | Unstable | StyleGAN, BigGAN |
| Diffusion | Yes (lower bound) | Denoising score | Excellent | Slow (many steps) | Stable | DDPM, Stable Diffusion |
Why generative modeling is hard
Three fundamental challenges make generative modeling difficult:
High dimensionality. A 256x256 image has ~200,000 dimensions. The model must learn correlations between all of them. A pixel in the top-left corner of a face image is correlated with a pixel in the bottom-right corner (both are part of the same face), and the model needs to capture this.
Evaluation. Unlike classification where accuracy is clear, there is no single number that says “this generative model is good.” You need multiple metrics, and they often disagree.
Training instability. GANs suffer from mode collapse (generator ignores parts of the distribution) and training oscillation (generator and discriminator chase each other). VAEs can suffer from posterior collapse (the model ignores the latent variable). Even autoregressive models face exposure bias.
Despite these challenges, generative models have become remarkably capable. Large language models generate coherent text. Diffusion models generate photorealistic images. The field continues to advance rapidly.
Likelihood-based vs likelihood-free
One useful way to categorize models is by whether they provide an explicit density.
Likelihood-based models (autoregressive, VAE, flows) can compute or a lower bound on it. This lets you do model comparison: train two models, pick the one that assigns higher likelihood to a held-out test set. You can also detect out-of-distribution data by checking if new inputs have low likelihood.
Likelihood-free models (GANs) cannot compute . You can only draw samples. This makes evaluation harder, but GANs compensate with exceptional sample quality. The adversarial training signal is a different kind of feedback than maximum likelihood, and it often produces sharper images.
Some hybrid approaches exist. Adversarial autoencoders combine VAE-like structure with GAN-like training. Wasserstein GANs reformulate the adversarial objective using optimal transport, which provides a more stable training signal.
The mode coverage vs quality tradeoff
A recurring theme across all generative model families is the tension between coverage and quality.
Mode coverage means the model generates samples from all parts of the real distribution. A face generator with good coverage produces faces of all ages, ethnicities, and expressions.
Sample quality means each individual sample looks realistic. A model with high quality generates sharp, convincing images, but it might only produce a narrow range of faces.
Autoregressive models and VAEs tend toward good coverage but sometimes lower quality (blurry samples for VAEs). GANs tend toward high quality but sometimes poor coverage (mode collapse). Diffusion models have recently shown that you can achieve both, at the cost of slow sampling.
Understanding this tradeoff helps you pick the right model for your application. If you need diversity (drug discovery, creative tools), prioritize coverage. If you need realism (photo editing, super-resolution), prioritize quality.
What comes next
This article gave you the landscape. Now we will go deep into specific generative model families. We start with one of the earliest successful approaches: Restricted Boltzmann Machines. They introduced ideas about energy-based modeling and unsupervised feature learning that influenced everything that came after.