Dec 15, 2025 · 22 min read · Deep Learning

Generative Adversarial Networks: training and theory

In this series (25 parts)

Prerequisites

Before reading this article, make sure you are comfortable with:

Generative models: what it means to learn a data distribution and sample from it
Cross-entropy and KL divergence: how we measure the difference between two distributions
Convexity and optimization: why loss landscape shape matters for training

What is a GAN, intuitively?

A GAN has two networks training against each other: a counterfeiter (generator) tries to make fake data, and a detective (discriminator) tries to catch the fakes. The generator never sees real data directly. It only learns through the discriminator’s feedback.

Picture an art forger and a museum curator. Early on, the forger’s paintings are crude and obviously fake. The curator catches them easily. But the forger studies why the paintings were rejected and improves. Over many rounds, the fakes become so good that even the curator struggles.

Training Round	Generator Quality	Discriminator Accuracy	Status
Round 1	Crude fakes, random noise	98% correct	Easy to detect
Round 5	Blurry but recognizable shapes	85% correct	Getting harder
Round 10	Realistic structure, minor flaws	70% correct	Challenging
Round 20	Near-perfect fakes	52% correct	Almost indistinguishable

At equilibrium, the discriminator outputs 50% for everything. It truly cannot tell real from fake.

GAN training loop

graph LR
  NOISE["Random Noise z"] --> G["Generator"]
  G --> FAKE["Fake Data"]
  REAL["Real Data"] --> D["Discriminator"]
  FAKE --> D
  D --> DLOSS["D Loss: real vs fake"]
  D --> GLOSS["G Loss: fool D"]
  DLOSS -.->|"Update D"| D
  GLOSS -.->|"Update G"| G

  style G fill:#4a9eff,color:#fff
  style D fill:#ff6b6b,color:#fff

The generator transforms random noise into realistic data. The discriminator acts as a learned loss function that improves alongside the generator. This adversarial dynamic is both the source of GANs’ power and the cause of their training difficulties.

Now let’s formalize the minimax objective and derive the theoretical results.

The two-player game

Generator and discriminator loss over 100 training iterations. The discriminator loss stabilizes near 0.5, while the generator loss gradually decreases as it learns to fool the discriminator.

A GAN has two neural networks that train against each other. The generator $G$ takes random noise $z$ and produces fake data $G(z)$ . The discriminator $D$ takes data (real or fake) and outputs a probability that the input is real.

Think of it like a counterfeiter and a detective. The counterfeiter ( $G$ ) tries to produce bills that look real. The detective ( $D$ ) tries to tell real bills from fake ones. Both get better over time. Training succeeds when the counterfeiter’s fakes are indistinguishable from real bills.

The noise $z$ is typically sampled from a simple distribution, like $z \sim \mathcal{N}(0, I)$ or $z \sim \text{Uniform}(-1, 1)$ . The generator learns a mapping from this simple space to the complex data distribution.

The minimax objective

The GAN training objective is:

\min_G \max_D V(D, G) = \mathbb{E}_{x \sim p_{\text{data}}}[\log D(x)] + \mathbb{E}_{z \sim p_z}[\log(1 - D(G(z)))]

Let’s break this down term by term.

First term: $\mathbb{E}_{x \sim p_{\text{data}}}[\log D(x)]$ . Sample real data $x$ and ask the discriminator to classify it. $D(x)$ should be close to 1 for real data, making $\log D(x)$ close to 0 (its maximum). The discriminator wants to maximize this.

Second term: $\mathbb{E}_{z \sim p_z}[\log(1 - D(G(z)))]$ . Sample noise $z$ , generate fake data $G(z)$ , and ask the discriminator to classify it. For the discriminator, $D(G(z))$ should be close to 0 (correctly identifying fakes), making $\log(1 - D(G(z)))$ close to 0. The discriminator maximizes this. The generator wants $D(G(z))$ close to 1 (fooling the discriminator), which makes $\log(1 - D(G(z)))$ very negative. So the generator minimizes the whole expression.

The minimax game: opposing objectives

graph TD
  OBJ["V(D, G)"] --> DGOAL["Discriminator:
maximize V"]
  OBJ --> GGOAL["Generator:
minimize V"]
  DGOAL --> EQUIL["Equilibrium:
D outputs 0.5 everywhere"]
  GGOAL --> EQUIL

  style DGOAL fill:#ff6b6b,color:#fff
  style GGOAL fill:#4a9eff,color:#fff
  style EQUIL fill:#51cf66,color:#fff

D pushes V up by getting better at detecting fakes. G pushes V down by producing more convincing fakes. When neither can improve, the generator’s distribution matches the real data distribution.

The optimal discriminator

For a fixed generator $G$ , we can solve for the discriminator that maximizes $V(D, G)$ . The optimal discriminator is:

D^*(x) = \frac{p_{\text{data}}(x)}{p_{\text{data}}(x) + p_g(x)}

where $p_g$ is the distribution of generated samples. This result comes from calculus. For each point $x$ , the integrand in $V$ is:

p_{\text{data}}(x) \log D(x) + p_g(x) \log(1 - D(x))

Taking the derivative with respect to $D(x)$ and setting it to zero gives $D^*(x)$ .

When $p_g = p_{\text{data}}$ , we get $D^*(x) = \frac{1}{2}$ everywhere. The discriminator can’t tell real from fake, it’s just guessing.

Connection to JS divergence

Plugging $D^*$ back into the objective, we get:

V(D^*, G) = -\log 4 + 2 \cdot D_{JS}(p_{\text{data}} \| p_g)

where $D_{JS}$ is the Jensen-Shannon divergence. The JS divergence is symmetric and bounded between 0 and $\log 2$ . It equals zero only when $p_g = p_{\text{data}}$ .

So the GAN game, at its theoretical optimum, is minimizing the JS divergence between the real and generated distributions. The global minimum of $V$ is $-\log 4$ , achieved when $p_g = p_{\text{data}}$ .

Saturating vs non-saturating loss

The original (saturating) generator loss is:

\mathcal{L}_G^{\text{sat}} = \log(1 - D(G(z)))

The problem: early in training, $G$ is terrible and $D$ easily rejects fakes. So $D(G(z)) \approx 0$ , and $\log(1 - 0) \approx 0$ . The gradient is tiny. The generator gets almost no learning signal when it needs it most.

The non-saturating alternative flips the objective:

\mathcal{L}_G^{\text{ns}} = -\log(D(G(z)))

Now when $D(G(z)) \approx 0$ , the loss is $-\log(0) \to \infty$ , giving strong gradients. Same fixed point (both are minimized when $D(G(z)) = 1$ ), but much better gradients early in training.

In practice, almost everyone uses the non-saturating loss.

Mode collapse

Mode collapse is the most common GAN failure mode. The generator finds a few outputs that fool the discriminator and keeps producing only those. Instead of learning the full data distribution, it “collapses” to a few modes.

Imagine training on images of digits 0 through 9. A mode-collapsed generator might produce only 3s and 7s, because it found that these fool the discriminator reliably. The discriminator eventually catches on, so the generator might switch to only producing 1s and 5s. This cycling behavior never converges.

Why does this happen? The generator objective doesn’t directly penalize lack of diversity. It only cares about fooling $D$ . Producing one very convincing sample is a valid (if degenerate) strategy.

Mode collapse: the generator takes shortcuts

graph TD
  REAL["Real Distribution
(digits 0 through 9)"] --> IDEAL["Ideal Generator
(produces all 10 digits)"]
  REAL --> COLLAPSED["Collapsed Generator
(only produces 3s and 7s)"]
  IDEAL --> DIVERSE["Diverse, realistic samples"]
  COLLAPSED --> LIMITED["High quality but no variety"]
  LIMITED --> CYCLE["D catches on, G switches
to different modes, repeats"]

  style IDEAL fill:#51cf66,color:#fff
  style DIVERSE fill:#51cf66,color:#fff
  style COLLAPSED fill:#ff6b6b,color:#fff
  style LIMITED fill:#ff6b6b,color:#fff
  style CYCLE fill:#ff9,stroke:#333,color:#000

The generator’s loss rewards only fooling the discriminator, not diversity. Producing a few perfect samples is easier than covering the full data distribution.

Training instability

Beyond mode collapse, GANs face several instability issues:

Vanishing gradients: If $D$ becomes too strong, it perfectly classifies everything, and the generator’s gradients vanish. The loss saturates.

Oscillation: The two networks chase each other without converging. $D$ adapts to $G$ ‘s outputs, then $G$ shifts, then $D$ re-adapts. There’s no guarantee of convergence with gradient descent on this non-convex, two-player game.

Sensitivity to hyperparameters: Learning rates, architecture choices, and batch sizes all require careful tuning. Small changes can cause training to diverge.

These problems motivated a search for better loss functions and training procedures.

Techniques for stable GAN training

graph TD
  subgraph LossDesign["Loss Design"]
      LS["Label Smoothing:
use 0.9 instead of 1.0
for real labels"]
      NS["Non-Saturating Loss:
flip G objective for
stronger early gradients"]
  end
  subgraph ArchTricks["Architecture"]
      SN["Spectral Normalization:
control Lipschitz constant
of D weights"]
      PG["Progressive Growing:
start small, add
layers gradually"]
  end
  subgraph TrainStrategy["Training Strategy"]
      CR["Critic Ratio:
train D more often
than G (e.g. 5:1)"]
      GP["Gradient Penalty:
penalize D gradient norm
deviating from 1"]
  end

  style LossDesign fill:#e6f3ff,stroke:#333,color:#000
  style ArchTricks fill:#fff3e6,stroke:#333,color:#000
  style TrainStrategy fill:#e6ffe6,stroke:#333,color:#000

No single trick solves GAN training. Practitioners combine multiple stabilization methods. WGAN-GP uses the Wasserstein loss with gradient penalty and a higher critic-to-generator training ratio.

flowchart TD
  Z["Random noise z"] --> G["Generator G"]
  G --> FAKE["Fake samples G(z)"]
  REAL["Real data x"] --> D["Discriminator D"]
  FAKE --> D
  D --> DLOSS["D loss: classify real vs fake"]
  D --> GLOSS["G loss: fool discriminator"]
  DLOSS -->|"∇ update D"| D
  GLOSS -->|"∇ update G"| G

  style G fill:#4a9eff,color:#fff
  style D fill:#ff6b6b,color:#fff
  style DLOSS fill:#ffa94d,color:#fff
  style GLOSS fill:#51cf66,color:#fff

Wasserstein GAN: Earth Mover’s Distance

The Wasserstein-1 distance (Earth Mover’s Distance) between distributions $p$ and $q$ is:

W(p, q) = \inf_{\gamma \in \Pi(p, q)} \mathbb{E}_{(x, y) \sim \gamma}[\|x - y\|]

Think of it as: if $p$ is a pile of dirt and $q$ is where you want the dirt, $W$ is the minimum total work to move the dirt. Unlike JS divergence, the Wasserstein distance provides smooth gradients even when the distributions don’t overlap.

This matters because early in training, $p_g$ and $p_{\text{data}}$ often have disjoint supports (they don’t overlap in high-dimensional space). JS divergence is constant (and maxed out) in this case, giving zero gradient. Wasserstein distance still tells you how far apart the distributions are.

By the Kantorovich-Rubinstein duality, we can rewrite this as:

W(p_{\text{data}}, p_g) = \sup_{\|f\|_L \leq 1} \mathbb{E}_{x \sim p_{\text{data}}}[f(x)] - \mathbb{E}_{x \sim p_g}[f(x)]

where the supremum is over all 1-Lipschitz functions $f$ . In WGAN, the discriminator (now called a critic) approximates this $f$ . Its output is no longer a probability; it’s an unbounded score.

The WGAN critic loss is:

\mathcal{L}_{\text{critic}} = \mathbb{E}_{x \sim p_g}[D(x)] - \mathbb{E}_{x \sim p_{\text{data}}}[D(x)]

The generator loss is:

\mathcal{L}_G = -\mathbb{E}_{z \sim p_z}[D(G(z))]

flowchart LR
  subgraph VanillaGAN["Vanilla GAN"]
      A1["D output: sigmoid → probability"] --> A2["Loss: BCE"]
      A2 --> A3["Gradient vanishes when distributions don't overlap"]
  end

  subgraph WGAN["Wasserstein GAN"]
      B1["Critic output: unbounded score"] --> B2["Loss: Wasserstein distance"]
      B2 --> B3["Smooth gradients even with disjoint supports"]
  end

  style A3 fill:#ff6b6b,color:#fff
  style B3 fill:#51cf66,color:#fff

Gradient penalty (WGAN-GP)

The original WGAN enforced the Lipschitz constraint by weight clipping: after each gradient update, clamp all critic weights to $[-c, c]$ . This works but causes problems. It biases the critic toward simple functions and can lead to exploding or vanishing gradients depending on $c$ .

WGAN-GP replaces weight clipping with a gradient penalty. The idea: a 1-Lipschitz function has gradients with norm at most 1 everywhere. So we penalize the critic when its gradient norm deviates from 1.

We sample interpolated points between real and fake data:

\hat{x} = \alpha x_{\text{real}} + (1 - \alpha) x_{\text{fake}}, \quad \alpha \sim \text{Uniform}(0, 1)

Then add the penalty:

\text{GP} = \lambda \, \mathbb{E}_{\hat{x}}\left[(\|\nabla_{\hat{x}} D(\hat{x})\|_2 - 1)^2\right]

The full WGAN-GP critic loss is:

\mathcal{L}_{\text{critic}} = \mathbb{E}_{\tilde{x} \sim p_g}[D(\tilde{x})] - \mathbb{E}_{x \sim p_{\text{data}}}[D(x)] + \lambda \, \mathbb{E}_{\hat{x}}[(\|\nabla_{\hat{x}} D(\hat{x})\|_2 - 1)^2]

Typical $\lambda = 10$ . WGAN-GP also removes batch normalization from the critic, since BN introduces correlations between samples in a batch, which conflicts with the per-sample gradient penalty.

GAN variants comparison

Variant	Loss function change	Key fix	Training stability	Year
Vanilla GAN	BCE (minimax)	Original formulation	Unstable, mode collapse	2014
WGAN	Wasserstein distance	Smooth gradients via EMD	Better, but weight clipping issues	2017
WGAN-GP	Wasserstein + gradient penalty	Replaces weight clipping	Significantly more stable	2017
Spectral Norm GAN	BCE + spectral normalization	Controls Lipschitz of D	Stable, simple to implement	2018
Hinge GAN	Hinge loss	Bounded D loss	Stable for large-scale models	2017
Relativistic GAN	Relativistic avg discriminator	D compares real vs fake	Improved convergence	2018

Example 1: discriminator and generator loss

Suppose we have a mini-batch of 3 real samples and 3 fake samples. The discriminator outputs:

Real samples: $D(x_1) = 0.9$ , $D(x_2) = 0.8$ , $D(x_3) = 0.7$
Fake samples: $D(G(z_1)) = 0.3$ , $D(G(z_2)) = 0.4$ , $D(G(z_3)) = 0.2$

Discriminator loss (BCE):

For each real sample, the target is 1. For each fake sample, the target is 0.

\mathcal{L}_D = -\frac{1}{3}\sum_{i=1}^{3}\log D(x_i) - \frac{1}{3}\sum_{j=1}^{3}\log(1 - D(G(z_j)))

Real part:

-\frac{1}{3}[\log(0.9) + \log(0.8) + \log(0.7)] = -\frac{1}{3}[-0.1054 - 0.2231 - 0.3567]

= -\frac{1}{3}(-0.6852) = 0.2284

Fake part:

-\frac{1}{3}[\log(1-0.3) + \log(1-0.4) + \log(1-0.2)] = -\frac{1}{3}[\log(0.7) + \log(0.6) + \log(0.8)]

= -\frac{1}{3}[-0.3567 - 0.5108 - 0.2231] = -\frac{1}{3}(-1.0906) = 0.3635

Total discriminator loss:

\mathcal{L}_D = 0.2284 + 0.3635 = 0.5919

Generator loss (non-saturating):

\mathcal{L}_G = -\frac{1}{3}\sum_{j=1}^{3}\log D(G(z_j)) = -\frac{1}{3}[\log(0.3) + \log(0.4) + \log(0.2)]

= -\frac{1}{3}[-1.2040 - 0.9163 - 1.6094] = -\frac{1}{3}(-3.7297) = 1.2432

The generator loss is much higher than the discriminator loss. That makes sense: the discriminator is already doing a decent job (high on reals, low on fakes), while the generator hasn’t fooled it yet.

Example 2: optimal discriminator

Suppose at a particular point $x$ in data space:

$p_{\text{data}}(x) = 0.6$
$p_g(x) = 0.4$

The optimal discriminator at that point:

D^*(x) = \frac{p_{\text{data}}(x)}{p_{\text{data}}(x) + p_g(x)} = \frac{0.6}{0.6 + 0.4} = \frac{0.6}{1.0} = 0.6

The discriminator assigns 60% probability to “real.” Since the real distribution has more mass here, that’s correct.

Now suppose training has converged and $p_g = p_{\text{data}}$ , so $p_g(x) = 0.6$ as well:

D^*(x) = \frac{0.6}{0.6 + 0.6} = \frac{0.6}{1.2} = 0.5

The discriminator outputs 0.5, meaning it truly cannot distinguish real from fake. This is the equilibrium. At every point in data space, $D^*(x) = 0.5$ .

Let’s also check a point where the generator has too much mass. Say $p_{\text{data}}(x') = 0.2$ and $p_g(x') = 0.8$ :

D^*(x') = \frac{0.2}{0.2 + 0.8} = \frac{0.2}{1.0} = 0.2

The discriminator correctly says “probably fake” because the generator places too much density here relative to the real data.

Example 3: WGAN-GP gradient penalty

Given:

Real sample: $x_r = [1.5, 2.0]$
Fake sample: $x_f = [0.5, 1.0]$
Interpolation coefficient: $\alpha = 0.4$

Step 1: Compute interpolated point

\hat{x} = \alpha \, x_r + (1 - \alpha) \, x_f = 0.4 \cdot [1.5, 2.0] + 0.6 \cdot [0.5, 1.0]

= [0.6, 0.8] + [0.3, 0.6] = [0.9, 1.4]

Step 2: Get gradient of critic at interpolated point

Suppose we compute the critic’s output at $\hat{x}$ and backpropagate to get:

\nabla_{\hat{x}} D(\hat{x}) = [1.2, 0.8]

Step 3: Compute gradient norm

\|\nabla_{\hat{x}} D(\hat{x})\|_2 = \sqrt{1.2^2 + 0.8^2} = \sqrt{1.44 + 0.64} = \sqrt{2.08} \approx 1.4422

Step 4: Compute gradient penalty

\text{GP} = \lambda \cdot (\|\nabla_{\hat{x}} D(\hat{x})\|_2 - 1)^2 = 10 \cdot (1.4422 - 1)^2 = 10 \cdot (0.4422)^2

= 10 \cdot 0.1955 = 1.955

The penalty is 1.955. The gradient norm is 1.44, which exceeds the target of 1.0, so the penalty pushes the critic toward having smaller gradients. If the gradient norm were exactly 1.0, the penalty would be zero.

This penalty gets added to the critic loss. If the Wasserstein loss terms sum to, say, $-0.5$ , then the total critic loss is $-0.5 + 1.955 = 1.455$ .

Practical training tips

Train the critic more than the generator. In WGAN-GP, it’s common to do 5 critic updates per 1 generator update. The critic needs to be a good approximation of the Wasserstein distance before the generator uses its gradients.
Use Adam with low learning rates. Typical: $\text{lr} = 0.0001$ , $\beta_1 = 0.0$ , $\beta_2 = 0.9$ for WGAN-GP. The low $\beta_1$ is deliberate, momentum can destabilize GAN training.
Monitor both losses and generated samples. Unlike supervised learning, a decreasing loss doesn’t always mean improvement. Look at the actual outputs.
Spectral normalization is an alternative to gradient penalty that’s simpler to implement. It normalizes each weight matrix by its largest singular value, enforcing the Lipschitz constraint directly on the network weights.
Use the non-saturating loss for vanilla GANs. For Wasserstein-based models, use the WGAN or WGAN-GP critic loss.

Summary

GANs formulate generative modeling as a two-player minimax game. The theoretical foundation connects to JS divergence (vanilla GAN) or Wasserstein distance (WGAN). The key practical challenges are mode collapse and training instability. WGAN-GP addresses these with the Earth Mover’s Distance and gradient penalty, giving smoother gradients and more stable training.

The core ideas: the generator never sees real data directly. It only learns through the discriminator’s gradients. The discriminator acts as a learned loss function that adapts during training. This is powerful but fragile, which is why so much research has gone into stabilizing the training process.

What comes next

Now that you understand GAN theory and training dynamics, the next article on DCGAN, conditional GANs, and GAN variants covers practical architectures. You’ll see how convolutional GANs generate images, how to condition generation on labels or other images, and how models like CycleGAN and StyleGAN push the boundaries of what GANs can create.

← Back to all series