Generative Adversarial Networks: training and theory
In this series (25 parts)
- Neural networks: the basic building block
- Forward pass and backpropagation
- Training neural networks: a practical guide
- Convolutional neural networks
- Recurrent neural networks and LSTMs
- Attention mechanism and transformers
- Word embeddings: from one-hot to dense representations
- Transfer learning and fine-tuning
- Optimization techniques for deep networks
- Regularization for deep networks
- Encoder-decoder architectures
- Generative models: an overview
- Restricted Boltzmann Machines
- Deep Belief Networks
- Variational Autoencoders
- Generative Adversarial Networks: training and theory
- DCGAN, conditional GANs, and GAN variants
- Representation learning and self-supervised learning
- Domain adaptation and fine-tuning strategies
- Distributed representations and latent spaces
- AutoML and hyperparameter optimization
- Neural architecture search
- Network compression and efficient inference
- Graph neural networks
- Practical deep learning: debugging and tuning
Prerequisites
Before reading this article, make sure you are comfortable with:
- Generative models: what it means to learn a data distribution and sample from it
- Cross-entropy and KL divergence: how we measure the difference between two distributions
- Convexity and optimization: why loss landscape shape matters for training
What is a GAN, intuitively?
A GAN has two networks training against each other: a counterfeiter (generator) tries to make fake data, and a detective (discriminator) tries to catch the fakes. The generator never sees real data directly. It only learns through the discriminator’s feedback.
Picture an art forger and a museum curator. Early on, the forger’s paintings are crude and obviously fake. The curator catches them easily. But the forger studies why the paintings were rejected and improves. Over many rounds, the fakes become so good that even the curator struggles.
| Training Round | Generator Quality | Discriminator Accuracy | Status |
|---|---|---|---|
| Round 1 | Crude fakes, random noise | 98% correct | Easy to detect |
| Round 5 | Blurry but recognizable shapes | 85% correct | Getting harder |
| Round 10 | Realistic structure, minor flaws | 70% correct | Challenging |
| Round 20 | Near-perfect fakes | 52% correct | Almost indistinguishable |
At equilibrium, the discriminator outputs 50% for everything. It truly cannot tell real from fake.
GAN training loop
graph LR NOISE["Random Noise z"] --> G["Generator"] G --> FAKE["Fake Data"] REAL["Real Data"] --> D["Discriminator"] FAKE --> D D --> DLOSS["D Loss: real vs fake"] D --> GLOSS["G Loss: fool D"] DLOSS -.->|"Update D"| D GLOSS -.->|"Update G"| G style G fill:#4a9eff,color:#fff style D fill:#ff6b6b,color:#fff
The generator transforms random noise into realistic data. The discriminator acts as a learned loss function that improves alongside the generator. This adversarial dynamic is both the source of GANs’ power and the cause of their training difficulties.
Now let’s formalize the minimax objective and derive the theoretical results.
The two-player game
Generator and discriminator loss over 100 training iterations. The discriminator loss stabilizes near 0.5, while the generator loss gradually decreases as it learns to fool the discriminator.
A GAN has two neural networks that train against each other. The generator takes random noise and produces fake data . The discriminator takes data (real or fake) and outputs a probability that the input is real.
Think of it like a counterfeiter and a detective. The counterfeiter () tries to produce bills that look real. The detective () tries to tell real bills from fake ones. Both get better over time. Training succeeds when the counterfeiter’s fakes are indistinguishable from real bills.
The noise is typically sampled from a simple distribution, like or . The generator learns a mapping from this simple space to the complex data distribution.
The minimax objective
The GAN training objective is:
Let’s break this down term by term.
First term: . Sample real data and ask the discriminator to classify it. should be close to 1 for real data, making close to 0 (its maximum). The discriminator wants to maximize this.
Second term: . Sample noise , generate fake data , and ask the discriminator to classify it. For the discriminator, should be close to 0 (correctly identifying fakes), making close to 0. The discriminator maximizes this. The generator wants close to 1 (fooling the discriminator), which makes very negative. So the generator minimizes the whole expression.
The minimax game: opposing objectives
graph TD OBJ["V(D, G)"] --> DGOAL["Discriminator: maximize V"] OBJ --> GGOAL["Generator: minimize V"] DGOAL --> EQUIL["Equilibrium: D outputs 0.5 everywhere"] GGOAL --> EQUIL style DGOAL fill:#ff6b6b,color:#fff style GGOAL fill:#4a9eff,color:#fff style EQUIL fill:#51cf66,color:#fff
D pushes V up by getting better at detecting fakes. G pushes V down by producing more convincing fakes. When neither can improve, the generator’s distribution matches the real data distribution.
The optimal discriminator
For a fixed generator , we can solve for the discriminator that maximizes . The optimal discriminator is:
where is the distribution of generated samples. This result comes from calculus. For each point , the integrand in is:
Taking the derivative with respect to and setting it to zero gives .
When , we get everywhere. The discriminator can’t tell real from fake, it’s just guessing.
Connection to JS divergence
Plugging back into the objective, we get:
where is the Jensen-Shannon divergence. The JS divergence is symmetric and bounded between 0 and . It equals zero only when .
So the GAN game, at its theoretical optimum, is minimizing the JS divergence between the real and generated distributions. The global minimum of is , achieved when .
Saturating vs non-saturating loss
The original (saturating) generator loss is:
The problem: early in training, is terrible and easily rejects fakes. So , and . The gradient is tiny. The generator gets almost no learning signal when it needs it most.
The non-saturating alternative flips the objective:
Now when , the loss is , giving strong gradients. Same fixed point (both are minimized when ), but much better gradients early in training.
In practice, almost everyone uses the non-saturating loss.
Mode collapse
Mode collapse is the most common GAN failure mode. The generator finds a few outputs that fool the discriminator and keeps producing only those. Instead of learning the full data distribution, it “collapses” to a few modes.
Imagine training on images of digits 0 through 9. A mode-collapsed generator might produce only 3s and 7s, because it found that these fool the discriminator reliably. The discriminator eventually catches on, so the generator might switch to only producing 1s and 5s. This cycling behavior never converges.
Why does this happen? The generator objective doesn’t directly penalize lack of diversity. It only cares about fooling . Producing one very convincing sample is a valid (if degenerate) strategy.
Mode collapse: the generator takes shortcuts
graph TD REAL["Real Distribution (digits 0 through 9)"] --> IDEAL["Ideal Generator (produces all 10 digits)"] REAL --> COLLAPSED["Collapsed Generator (only produces 3s and 7s)"] IDEAL --> DIVERSE["Diverse, realistic samples"] COLLAPSED --> LIMITED["High quality but no variety"] LIMITED --> CYCLE["D catches on, G switches to different modes, repeats"] style IDEAL fill:#51cf66,color:#fff style DIVERSE fill:#51cf66,color:#fff style COLLAPSED fill:#ff6b6b,color:#fff style LIMITED fill:#ff6b6b,color:#fff style CYCLE fill:#ff9,stroke:#333,color:#000
The generator’s loss rewards only fooling the discriminator, not diversity. Producing a few perfect samples is easier than covering the full data distribution.
Training instability
Beyond mode collapse, GANs face several instability issues:
Vanishing gradients: If becomes too strong, it perfectly classifies everything, and the generator’s gradients vanish. The loss saturates.
Oscillation: The two networks chase each other without converging. adapts to ‘s outputs, then shifts, then re-adapts. There’s no guarantee of convergence with gradient descent on this non-convex, two-player game.
Sensitivity to hyperparameters: Learning rates, architecture choices, and batch sizes all require careful tuning. Small changes can cause training to diverge.
These problems motivated a search for better loss functions and training procedures.
Techniques for stable GAN training
graph TD
subgraph LossDesign["Loss Design"]
LS["Label Smoothing:
use 0.9 instead of 1.0
for real labels"]
NS["Non-Saturating Loss:
flip G objective for
stronger early gradients"]
end
subgraph ArchTricks["Architecture"]
SN["Spectral Normalization:
control Lipschitz constant
of D weights"]
PG["Progressive Growing:
start small, add
layers gradually"]
end
subgraph TrainStrategy["Training Strategy"]
CR["Critic Ratio:
train D more often
than G (e.g. 5:1)"]
GP["Gradient Penalty:
penalize D gradient norm
deviating from 1"]
end
style LossDesign fill:#e6f3ff,stroke:#333,color:#000
style ArchTricks fill:#fff3e6,stroke:#333,color:#000
style TrainStrategy fill:#e6ffe6,stroke:#333,color:#000
No single trick solves GAN training. Practitioners combine multiple stabilization methods. WGAN-GP uses the Wasserstein loss with gradient penalty and a higher critic-to-generator training ratio.
flowchart TD Z["Random noise z"] --> G["Generator G"] G --> FAKE["Fake samples G(z)"] REAL["Real data x"] --> D["Discriminator D"] FAKE --> D D --> DLOSS["D loss: classify real vs fake"] D --> GLOSS["G loss: fool discriminator"] DLOSS -->|"∇ update D"| D GLOSS -->|"∇ update G"| G style G fill:#4a9eff,color:#fff style D fill:#ff6b6b,color:#fff style DLOSS fill:#ffa94d,color:#fff style GLOSS fill:#51cf66,color:#fff
Wasserstein GAN: Earth Mover’s Distance
The Wasserstein-1 distance (Earth Mover’s Distance) between distributions and is:
Think of it as: if is a pile of dirt and is where you want the dirt, is the minimum total work to move the dirt. Unlike JS divergence, the Wasserstein distance provides smooth gradients even when the distributions don’t overlap.
This matters because early in training, and often have disjoint supports (they don’t overlap in high-dimensional space). JS divergence is constant (and maxed out) in this case, giving zero gradient. Wasserstein distance still tells you how far apart the distributions are.
By the Kantorovich-Rubinstein duality, we can rewrite this as:
where the supremum is over all 1-Lipschitz functions . In WGAN, the discriminator (now called a critic) approximates this . Its output is no longer a probability; it’s an unbounded score.
The WGAN critic loss is:
The generator loss is:
flowchart LR
subgraph VanillaGAN["Vanilla GAN"]
A1["D output: sigmoid → probability"] --> A2["Loss: BCE"]
A2 --> A3["Gradient vanishes when distributions don't overlap"]
end
subgraph WGAN["Wasserstein GAN"]
B1["Critic output: unbounded score"] --> B2["Loss: Wasserstein distance"]
B2 --> B3["Smooth gradients even with disjoint supports"]
end
style A3 fill:#ff6b6b,color:#fff
style B3 fill:#51cf66,color:#fff
Gradient penalty (WGAN-GP)
The original WGAN enforced the Lipschitz constraint by weight clipping: after each gradient update, clamp all critic weights to . This works but causes problems. It biases the critic toward simple functions and can lead to exploding or vanishing gradients depending on .
WGAN-GP replaces weight clipping with a gradient penalty. The idea: a 1-Lipschitz function has gradients with norm at most 1 everywhere. So we penalize the critic when its gradient norm deviates from 1.
We sample interpolated points between real and fake data:
Then add the penalty:
The full WGAN-GP critic loss is:
Typical . WGAN-GP also removes batch normalization from the critic, since BN introduces correlations between samples in a batch, which conflicts with the per-sample gradient penalty.
GAN variants comparison
| Variant | Loss function change | Key fix | Training stability | Year |
|---|---|---|---|---|
| Vanilla GAN | BCE (minimax) | Original formulation | Unstable, mode collapse | 2014 |
| WGAN | Wasserstein distance | Smooth gradients via EMD | Better, but weight clipping issues | 2017 |
| WGAN-GP | Wasserstein + gradient penalty | Replaces weight clipping | Significantly more stable | 2017 |
| Spectral Norm GAN | BCE + spectral normalization | Controls Lipschitz of D | Stable, simple to implement | 2018 |
| Hinge GAN | Hinge loss | Bounded D loss | Stable for large-scale models | 2017 |
| Relativistic GAN | Relativistic avg discriminator | D compares real vs fake | Improved convergence | 2018 |
Example 1: discriminator and generator loss
Suppose we have a mini-batch of 3 real samples and 3 fake samples. The discriminator outputs:
- Real samples: , ,
- Fake samples: , ,
Discriminator loss (BCE):
For each real sample, the target is 1. For each fake sample, the target is 0.
Real part:
Fake part:
Total discriminator loss:
Generator loss (non-saturating):
The generator loss is much higher than the discriminator loss. That makes sense: the discriminator is already doing a decent job (high on reals, low on fakes), while the generator hasn’t fooled it yet.
Example 2: optimal discriminator
Suppose at a particular point in data space:
The optimal discriminator at that point:
The discriminator assigns 60% probability to “real.” Since the real distribution has more mass here, that’s correct.
Now suppose training has converged and , so as well:
The discriminator outputs 0.5, meaning it truly cannot distinguish real from fake. This is the equilibrium. At every point in data space, .
Let’s also check a point where the generator has too much mass. Say and :
The discriminator correctly says “probably fake” because the generator places too much density here relative to the real data.
Example 3: WGAN-GP gradient penalty
Given:
- Real sample:
- Fake sample:
- Interpolation coefficient:
Step 1: Compute interpolated point
Step 2: Get gradient of critic at interpolated point
Suppose we compute the critic’s output at and backpropagate to get:
Step 3: Compute gradient norm
Step 4: Compute gradient penalty
The penalty is 1.955. The gradient norm is 1.44, which exceeds the target of 1.0, so the penalty pushes the critic toward having smaller gradients. If the gradient norm were exactly 1.0, the penalty would be zero.
This penalty gets added to the critic loss. If the Wasserstein loss terms sum to, say, , then the total critic loss is .
Practical training tips
-
Train the critic more than the generator. In WGAN-GP, it’s common to do 5 critic updates per 1 generator update. The critic needs to be a good approximation of the Wasserstein distance before the generator uses its gradients.
-
Use Adam with low learning rates. Typical: , , for WGAN-GP. The low is deliberate, momentum can destabilize GAN training.
-
Monitor both losses and generated samples. Unlike supervised learning, a decreasing loss doesn’t always mean improvement. Look at the actual outputs.
-
Spectral normalization is an alternative to gradient penalty that’s simpler to implement. It normalizes each weight matrix by its largest singular value, enforcing the Lipschitz constraint directly on the network weights.
-
Use the non-saturating loss for vanilla GANs. For Wasserstein-based models, use the WGAN or WGAN-GP critic loss.
Summary
GANs formulate generative modeling as a two-player minimax game. The theoretical foundation connects to JS divergence (vanilla GAN) or Wasserstein distance (WGAN). The key practical challenges are mode collapse and training instability. WGAN-GP addresses these with the Earth Mover’s Distance and gradient penalty, giving smoother gradients and more stable training.
The core ideas: the generator never sees real data directly. It only learns through the discriminator’s gradients. The discriminator acts as a learned loss function that adapts during training. This is powerful but fragile, which is why so much research has gone into stabilizing the training process.
What comes next
Now that you understand GAN theory and training dynamics, the next article on DCGAN, conditional GANs, and GAN variants covers practical architectures. You’ll see how convolutional GANs generate images, how to condition generation on labels or other images, and how models like CycleGAN and StyleGAN push the boundaries of what GANs can create.