Search…

Generative Adversarial Networks: training and theory

In this series (25 parts)
  1. Neural networks: the basic building block
  2. Forward pass and backpropagation
  3. Training neural networks: a practical guide
  4. Convolutional neural networks
  5. Recurrent neural networks and LSTMs
  6. Attention mechanism and transformers
  7. Word embeddings: from one-hot to dense representations
  8. Transfer learning and fine-tuning
  9. Optimization techniques for deep networks
  10. Regularization for deep networks
  11. Encoder-decoder architectures
  12. Generative models: an overview
  13. Restricted Boltzmann Machines
  14. Deep Belief Networks
  15. Variational Autoencoders
  16. Generative Adversarial Networks: training and theory
  17. DCGAN, conditional GANs, and GAN variants
  18. Representation learning and self-supervised learning
  19. Domain adaptation and fine-tuning strategies
  20. Distributed representations and latent spaces
  21. AutoML and hyperparameter optimization
  22. Neural architecture search
  23. Network compression and efficient inference
  24. Graph neural networks
  25. Practical deep learning: debugging and tuning

Prerequisites

Before reading this article, make sure you are comfortable with:

What is a GAN, intuitively?

A GAN has two networks training against each other: a counterfeiter (generator) tries to make fake data, and a detective (discriminator) tries to catch the fakes. The generator never sees real data directly. It only learns through the discriminator’s feedback.

Picture an art forger and a museum curator. Early on, the forger’s paintings are crude and obviously fake. The curator catches them easily. But the forger studies why the paintings were rejected and improves. Over many rounds, the fakes become so good that even the curator struggles.

Training RoundGenerator QualityDiscriminator AccuracyStatus
Round 1Crude fakes, random noise98% correctEasy to detect
Round 5Blurry but recognizable shapes85% correctGetting harder
Round 10Realistic structure, minor flaws70% correctChallenging
Round 20Near-perfect fakes52% correctAlmost indistinguishable

At equilibrium, the discriminator outputs 50% for everything. It truly cannot tell real from fake.

GAN training loop

graph LR
  NOISE["Random Noise z"] --> G["Generator"]
  G --> FAKE["Fake Data"]
  REAL["Real Data"] --> D["Discriminator"]
  FAKE --> D
  D --> DLOSS["D Loss: real vs fake"]
  D --> GLOSS["G Loss: fool D"]
  DLOSS -.->|"Update D"| D
  GLOSS -.->|"Update G"| G

  style G fill:#4a9eff,color:#fff
  style D fill:#ff6b6b,color:#fff

The generator transforms random noise into realistic data. The discriminator acts as a learned loss function that improves alongside the generator. This adversarial dynamic is both the source of GANs’ power and the cause of their training difficulties.

Now let’s formalize the minimax objective and derive the theoretical results.

The two-player game

Generator and discriminator loss over 100 training iterations. The discriminator loss stabilizes near 0.5, while the generator loss gradually decreases as it learns to fool the discriminator.

A GAN has two neural networks that train against each other. The generator GG takes random noise zz and produces fake data G(z)G(z). The discriminator DD takes data (real or fake) and outputs a probability that the input is real.

Think of it like a counterfeiter and a detective. The counterfeiter (GG) tries to produce bills that look real. The detective (DD) tries to tell real bills from fake ones. Both get better over time. Training succeeds when the counterfeiter’s fakes are indistinguishable from real bills.

The noise zz is typically sampled from a simple distribution, like zN(0,I)z \sim \mathcal{N}(0, I) or zUniform(1,1)z \sim \text{Uniform}(-1, 1). The generator learns a mapping from this simple space to the complex data distribution.

The minimax objective

The GAN training objective is:

minGmaxDV(D,G)=Expdata[logD(x)]+Ezpz[log(1D(G(z)))]\min_G \max_D V(D, G) = \mathbb{E}_{x \sim p_{\text{data}}}[\log D(x)] + \mathbb{E}_{z \sim p_z}[\log(1 - D(G(z)))]

Let’s break this down term by term.

First term: Expdata[logD(x)]\mathbb{E}_{x \sim p_{\text{data}}}[\log D(x)]. Sample real data xx and ask the discriminator to classify it. D(x)D(x) should be close to 1 for real data, making logD(x)\log D(x) close to 0 (its maximum). The discriminator wants to maximize this.

Second term: Ezpz[log(1D(G(z)))]\mathbb{E}_{z \sim p_z}[\log(1 - D(G(z)))]. Sample noise zz, generate fake data G(z)G(z), and ask the discriminator to classify it. For the discriminator, D(G(z))D(G(z)) should be close to 0 (correctly identifying fakes), making log(1D(G(z)))\log(1 - D(G(z))) close to 0. The discriminator maximizes this. The generator wants D(G(z))D(G(z)) close to 1 (fooling the discriminator), which makes log(1D(G(z)))\log(1 - D(G(z))) very negative. So the generator minimizes the whole expression.

The minimax game: opposing objectives

graph TD
  OBJ["V(D, G)"] --> DGOAL["Discriminator:
maximize V"]
  OBJ --> GGOAL["Generator:
minimize V"]
  DGOAL --> EQUIL["Equilibrium:
D outputs 0.5 everywhere"]
  GGOAL --> EQUIL

  style DGOAL fill:#ff6b6b,color:#fff
  style GGOAL fill:#4a9eff,color:#fff
  style EQUIL fill:#51cf66,color:#fff

D pushes V up by getting better at detecting fakes. G pushes V down by producing more convincing fakes. When neither can improve, the generator’s distribution matches the real data distribution.

The optimal discriminator

For a fixed generator GG, we can solve for the discriminator that maximizes V(D,G)V(D, G). The optimal discriminator is:

D(x)=pdata(x)pdata(x)+pg(x)D^*(x) = \frac{p_{\text{data}}(x)}{p_{\text{data}}(x) + p_g(x)}

where pgp_g is the distribution of generated samples. This result comes from calculus. For each point xx, the integrand in VV is:

pdata(x)logD(x)+pg(x)log(1D(x))p_{\text{data}}(x) \log D(x) + p_g(x) \log(1 - D(x))

Taking the derivative with respect to D(x)D(x) and setting it to zero gives D(x)D^*(x).

When pg=pdatap_g = p_{\text{data}}, we get D(x)=12D^*(x) = \frac{1}{2} everywhere. The discriminator can’t tell real from fake, it’s just guessing.

Connection to JS divergence

Plugging DD^* back into the objective, we get:

V(D,G)=log4+2DJS(pdatapg)V(D^*, G) = -\log 4 + 2 \cdot D_{JS}(p_{\text{data}} \| p_g)

where DJSD_{JS} is the Jensen-Shannon divergence. The JS divergence is symmetric and bounded between 0 and log2\log 2. It equals zero only when pg=pdatap_g = p_{\text{data}}.

So the GAN game, at its theoretical optimum, is minimizing the JS divergence between the real and generated distributions. The global minimum of VV is log4-\log 4, achieved when pg=pdatap_g = p_{\text{data}}.

Saturating vs non-saturating loss

The original (saturating) generator loss is:

LGsat=log(1D(G(z)))\mathcal{L}_G^{\text{sat}} = \log(1 - D(G(z)))

The problem: early in training, GG is terrible and DD easily rejects fakes. So D(G(z))0D(G(z)) \approx 0, and log(10)0\log(1 - 0) \approx 0. The gradient is tiny. The generator gets almost no learning signal when it needs it most.

The non-saturating alternative flips the objective:

LGns=log(D(G(z)))\mathcal{L}_G^{\text{ns}} = -\log(D(G(z)))

Now when D(G(z))0D(G(z)) \approx 0, the loss is log(0)-\log(0) \to \infty, giving strong gradients. Same fixed point (both are minimized when D(G(z))=1D(G(z)) = 1), but much better gradients early in training.

In practice, almost everyone uses the non-saturating loss.

Mode collapse

Mode collapse is the most common GAN failure mode. The generator finds a few outputs that fool the discriminator and keeps producing only those. Instead of learning the full data distribution, it “collapses” to a few modes.

Imagine training on images of digits 0 through 9. A mode-collapsed generator might produce only 3s and 7s, because it found that these fool the discriminator reliably. The discriminator eventually catches on, so the generator might switch to only producing 1s and 5s. This cycling behavior never converges.

Why does this happen? The generator objective doesn’t directly penalize lack of diversity. It only cares about fooling DD. Producing one very convincing sample is a valid (if degenerate) strategy.

Mode collapse: the generator takes shortcuts

graph TD
  REAL["Real Distribution
(digits 0 through 9)"] --> IDEAL["Ideal Generator
(produces all 10 digits)"]
  REAL --> COLLAPSED["Collapsed Generator
(only produces 3s and 7s)"]
  IDEAL --> DIVERSE["Diverse, realistic samples"]
  COLLAPSED --> LIMITED["High quality but no variety"]
  LIMITED --> CYCLE["D catches on, G switches
to different modes, repeats"]

  style IDEAL fill:#51cf66,color:#fff
  style DIVERSE fill:#51cf66,color:#fff
  style COLLAPSED fill:#ff6b6b,color:#fff
  style LIMITED fill:#ff6b6b,color:#fff
  style CYCLE fill:#ff9,stroke:#333,color:#000

The generator’s loss rewards only fooling the discriminator, not diversity. Producing a few perfect samples is easier than covering the full data distribution.

Training instability

Beyond mode collapse, GANs face several instability issues:

Vanishing gradients: If DD becomes too strong, it perfectly classifies everything, and the generator’s gradients vanish. The loss saturates.

Oscillation: The two networks chase each other without converging. DD adapts to GG‘s outputs, then GG shifts, then DD re-adapts. There’s no guarantee of convergence with gradient descent on this non-convex, two-player game.

Sensitivity to hyperparameters: Learning rates, architecture choices, and batch sizes all require careful tuning. Small changes can cause training to diverge.

These problems motivated a search for better loss functions and training procedures.

Techniques for stable GAN training

graph TD
  subgraph LossDesign["Loss Design"]
      LS["Label Smoothing:
use 0.9 instead of 1.0
for real labels"]
      NS["Non-Saturating Loss:
flip G objective for
stronger early gradients"]
  end
  subgraph ArchTricks["Architecture"]
      SN["Spectral Normalization:
control Lipschitz constant
of D weights"]
      PG["Progressive Growing:
start small, add
layers gradually"]
  end
  subgraph TrainStrategy["Training Strategy"]
      CR["Critic Ratio:
train D more often
than G (e.g. 5:1)"]
      GP["Gradient Penalty:
penalize D gradient norm
deviating from 1"]
  end

  style LossDesign fill:#e6f3ff,stroke:#333,color:#000
  style ArchTricks fill:#fff3e6,stroke:#333,color:#000
  style TrainStrategy fill:#e6ffe6,stroke:#333,color:#000

No single trick solves GAN training. Practitioners combine multiple stabilization methods. WGAN-GP uses the Wasserstein loss with gradient penalty and a higher critic-to-generator training ratio.

flowchart TD
  Z["Random noise z"] --> G["Generator G"]
  G --> FAKE["Fake samples G(z)"]
  REAL["Real data x"] --> D["Discriminator D"]
  FAKE --> D
  D --> DLOSS["D loss: classify real vs fake"]
  D --> GLOSS["G loss: fool discriminator"]
  DLOSS -->|"∇ update D"| D
  GLOSS -->|"∇ update G"| G

  style G fill:#4a9eff,color:#fff
  style D fill:#ff6b6b,color:#fff
  style DLOSS fill:#ffa94d,color:#fff
  style GLOSS fill:#51cf66,color:#fff

Wasserstein GAN: Earth Mover’s Distance

The Wasserstein-1 distance (Earth Mover’s Distance) between distributions pp and qq is:

W(p,q)=infγΠ(p,q)E(x,y)γ[xy]W(p, q) = \inf_{\gamma \in \Pi(p, q)} \mathbb{E}_{(x, y) \sim \gamma}[\|x - y\|]

Think of it as: if pp is a pile of dirt and qq is where you want the dirt, WW is the minimum total work to move the dirt. Unlike JS divergence, the Wasserstein distance provides smooth gradients even when the distributions don’t overlap.

This matters because early in training, pgp_g and pdatap_{\text{data}} often have disjoint supports (they don’t overlap in high-dimensional space). JS divergence is constant (and maxed out) in this case, giving zero gradient. Wasserstein distance still tells you how far apart the distributions are.

By the Kantorovich-Rubinstein duality, we can rewrite this as:

W(pdata,pg)=supfL1Expdata[f(x)]Expg[f(x)]W(p_{\text{data}}, p_g) = \sup_{\|f\|_L \leq 1} \mathbb{E}_{x \sim p_{\text{data}}}[f(x)] - \mathbb{E}_{x \sim p_g}[f(x)]

where the supremum is over all 1-Lipschitz functions ff. In WGAN, the discriminator (now called a critic) approximates this ff. Its output is no longer a probability; it’s an unbounded score.

The WGAN critic loss is:

Lcritic=Expg[D(x)]Expdata[D(x)]\mathcal{L}_{\text{critic}} = \mathbb{E}_{x \sim p_g}[D(x)] - \mathbb{E}_{x \sim p_{\text{data}}}[D(x)]

The generator loss is:

LG=Ezpz[D(G(z))]\mathcal{L}_G = -\mathbb{E}_{z \sim p_z}[D(G(z))]
flowchart LR
  subgraph VanillaGAN["Vanilla GAN"]
      A1["D output: sigmoid → probability"] --> A2["Loss: BCE"]
      A2 --> A3["Gradient vanishes when distributions don't overlap"]
  end

  subgraph WGAN["Wasserstein GAN"]
      B1["Critic output: unbounded score"] --> B2["Loss: Wasserstein distance"]
      B2 --> B3["Smooth gradients even with disjoint supports"]
  end

  style A3 fill:#ff6b6b,color:#fff
  style B3 fill:#51cf66,color:#fff

Gradient penalty (WGAN-GP)

The original WGAN enforced the Lipschitz constraint by weight clipping: after each gradient update, clamp all critic weights to [c,c][-c, c]. This works but causes problems. It biases the critic toward simple functions and can lead to exploding or vanishing gradients depending on cc.

WGAN-GP replaces weight clipping with a gradient penalty. The idea: a 1-Lipschitz function has gradients with norm at most 1 everywhere. So we penalize the critic when its gradient norm deviates from 1.

We sample interpolated points between real and fake data:

x^=αxreal+(1α)xfake,αUniform(0,1)\hat{x} = \alpha x_{\text{real}} + (1 - \alpha) x_{\text{fake}}, \quad \alpha \sim \text{Uniform}(0, 1)

Then add the penalty:

GP=λEx^[(x^D(x^)21)2]\text{GP} = \lambda \, \mathbb{E}_{\hat{x}}\left[(\|\nabla_{\hat{x}} D(\hat{x})\|_2 - 1)^2\right]

The full WGAN-GP critic loss is:

Lcritic=Ex~pg[D(x~)]Expdata[D(x)]+λEx^[(x^D(x^)21)2]\mathcal{L}_{\text{critic}} = \mathbb{E}_{\tilde{x} \sim p_g}[D(\tilde{x})] - \mathbb{E}_{x \sim p_{\text{data}}}[D(x)] + \lambda \, \mathbb{E}_{\hat{x}}[(\|\nabla_{\hat{x}} D(\hat{x})\|_2 - 1)^2]

Typical λ=10\lambda = 10. WGAN-GP also removes batch normalization from the critic, since BN introduces correlations between samples in a batch, which conflicts with the per-sample gradient penalty.

GAN variants comparison

VariantLoss function changeKey fixTraining stabilityYear
Vanilla GANBCE (minimax)Original formulationUnstable, mode collapse2014
WGANWasserstein distanceSmooth gradients via EMDBetter, but weight clipping issues2017
WGAN-GPWasserstein + gradient penaltyReplaces weight clippingSignificantly more stable2017
Spectral Norm GANBCE + spectral normalizationControls Lipschitz of DStable, simple to implement2018
Hinge GANHinge lossBounded D lossStable for large-scale models2017
Relativistic GANRelativistic avg discriminatorD compares real vs fakeImproved convergence2018

Example 1: discriminator and generator loss

Suppose we have a mini-batch of 3 real samples and 3 fake samples. The discriminator outputs:

  • Real samples: D(x1)=0.9D(x_1) = 0.9, D(x2)=0.8D(x_2) = 0.8, D(x3)=0.7D(x_3) = 0.7
  • Fake samples: D(G(z1))=0.3D(G(z_1)) = 0.3, D(G(z2))=0.4D(G(z_2)) = 0.4, D(G(z3))=0.2D(G(z_3)) = 0.2

Discriminator loss (BCE):

For each real sample, the target is 1. For each fake sample, the target is 0.

LD=13i=13logD(xi)13j=13log(1D(G(zj)))\mathcal{L}_D = -\frac{1}{3}\sum_{i=1}^{3}\log D(x_i) - \frac{1}{3}\sum_{j=1}^{3}\log(1 - D(G(z_j)))

Real part:

13[log(0.9)+log(0.8)+log(0.7)]=13[0.10540.22310.3567]-\frac{1}{3}[\log(0.9) + \log(0.8) + \log(0.7)] = -\frac{1}{3}[-0.1054 - 0.2231 - 0.3567] =13(0.6852)=0.2284= -\frac{1}{3}(-0.6852) = 0.2284

Fake part:

13[log(10.3)+log(10.4)+log(10.2)]=13[log(0.7)+log(0.6)+log(0.8)]-\frac{1}{3}[\log(1-0.3) + \log(1-0.4) + \log(1-0.2)] = -\frac{1}{3}[\log(0.7) + \log(0.6) + \log(0.8)] =13[0.35670.51080.2231]=13(1.0906)=0.3635= -\frac{1}{3}[-0.3567 - 0.5108 - 0.2231] = -\frac{1}{3}(-1.0906) = 0.3635

Total discriminator loss:

LD=0.2284+0.3635=0.5919\mathcal{L}_D = 0.2284 + 0.3635 = 0.5919

Generator loss (non-saturating):

LG=13j=13logD(G(zj))=13[log(0.3)+log(0.4)+log(0.2)]\mathcal{L}_G = -\frac{1}{3}\sum_{j=1}^{3}\log D(G(z_j)) = -\frac{1}{3}[\log(0.3) + \log(0.4) + \log(0.2)] =13[1.20400.91631.6094]=13(3.7297)=1.2432= -\frac{1}{3}[-1.2040 - 0.9163 - 1.6094] = -\frac{1}{3}(-3.7297) = 1.2432

The generator loss is much higher than the discriminator loss. That makes sense: the discriminator is already doing a decent job (high on reals, low on fakes), while the generator hasn’t fooled it yet.

Example 2: optimal discriminator

Suppose at a particular point xx in data space:

  • pdata(x)=0.6p_{\text{data}}(x) = 0.6
  • pg(x)=0.4p_g(x) = 0.4

The optimal discriminator at that point:

D(x)=pdata(x)pdata(x)+pg(x)=0.60.6+0.4=0.61.0=0.6D^*(x) = \frac{p_{\text{data}}(x)}{p_{\text{data}}(x) + p_g(x)} = \frac{0.6}{0.6 + 0.4} = \frac{0.6}{1.0} = 0.6

The discriminator assigns 60% probability to “real.” Since the real distribution has more mass here, that’s correct.

Now suppose training has converged and pg=pdatap_g = p_{\text{data}}, so pg(x)=0.6p_g(x) = 0.6 as well:

D(x)=0.60.6+0.6=0.61.2=0.5D^*(x) = \frac{0.6}{0.6 + 0.6} = \frac{0.6}{1.2} = 0.5

The discriminator outputs 0.5, meaning it truly cannot distinguish real from fake. This is the equilibrium. At every point in data space, D(x)=0.5D^*(x) = 0.5.

Let’s also check a point where the generator has too much mass. Say pdata(x)=0.2p_{\text{data}}(x') = 0.2 and pg(x)=0.8p_g(x') = 0.8:

D(x)=0.20.2+0.8=0.21.0=0.2D^*(x') = \frac{0.2}{0.2 + 0.8} = \frac{0.2}{1.0} = 0.2

The discriminator correctly says “probably fake” because the generator places too much density here relative to the real data.

Example 3: WGAN-GP gradient penalty

Given:

  • Real sample: xr=[1.5,2.0]x_r = [1.5, 2.0]
  • Fake sample: xf=[0.5,1.0]x_f = [0.5, 1.0]
  • Interpolation coefficient: α=0.4\alpha = 0.4

Step 1: Compute interpolated point

x^=αxr+(1α)xf=0.4[1.5,2.0]+0.6[0.5,1.0]\hat{x} = \alpha \, x_r + (1 - \alpha) \, x_f = 0.4 \cdot [1.5, 2.0] + 0.6 \cdot [0.5, 1.0] =[0.6,0.8]+[0.3,0.6]=[0.9,1.4]= [0.6, 0.8] + [0.3, 0.6] = [0.9, 1.4]

Step 2: Get gradient of critic at interpolated point

Suppose we compute the critic’s output at x^\hat{x} and backpropagate to get:

x^D(x^)=[1.2,0.8]\nabla_{\hat{x}} D(\hat{x}) = [1.2, 0.8]

Step 3: Compute gradient norm

x^D(x^)2=1.22+0.82=1.44+0.64=2.081.4422\|\nabla_{\hat{x}} D(\hat{x})\|_2 = \sqrt{1.2^2 + 0.8^2} = \sqrt{1.44 + 0.64} = \sqrt{2.08} \approx 1.4422

Step 4: Compute gradient penalty

GP=λ(x^D(x^)21)2=10(1.44221)2=10(0.4422)2\text{GP} = \lambda \cdot (\|\nabla_{\hat{x}} D(\hat{x})\|_2 - 1)^2 = 10 \cdot (1.4422 - 1)^2 = 10 \cdot (0.4422)^2 =100.1955=1.955= 10 \cdot 0.1955 = 1.955

The penalty is 1.955. The gradient norm is 1.44, which exceeds the target of 1.0, so the penalty pushes the critic toward having smaller gradients. If the gradient norm were exactly 1.0, the penalty would be zero.

This penalty gets added to the critic loss. If the Wasserstein loss terms sum to, say, 0.5-0.5, then the total critic loss is 0.5+1.955=1.455-0.5 + 1.955 = 1.455.

Practical training tips

  1. Train the critic more than the generator. In WGAN-GP, it’s common to do 5 critic updates per 1 generator update. The critic needs to be a good approximation of the Wasserstein distance before the generator uses its gradients.

  2. Use Adam with low learning rates. Typical: lr=0.0001\text{lr} = 0.0001, β1=0.0\beta_1 = 0.0, β2=0.9\beta_2 = 0.9 for WGAN-GP. The low β1\beta_1 is deliberate, momentum can destabilize GAN training.

  3. Monitor both losses and generated samples. Unlike supervised learning, a decreasing loss doesn’t always mean improvement. Look at the actual outputs.

  4. Spectral normalization is an alternative to gradient penalty that’s simpler to implement. It normalizes each weight matrix by its largest singular value, enforcing the Lipschitz constraint directly on the network weights.

  5. Use the non-saturating loss for vanilla GANs. For Wasserstein-based models, use the WGAN or WGAN-GP critic loss.

Summary

GANs formulate generative modeling as a two-player minimax game. The theoretical foundation connects to JS divergence (vanilla GAN) or Wasserstein distance (WGAN). The key practical challenges are mode collapse and training instability. WGAN-GP addresses these with the Earth Mover’s Distance and gradient penalty, giving smoother gradients and more stable training.

The core ideas: the generator never sees real data directly. It only learns through the discriminator’s gradients. The discriminator acts as a learned loss function that adapts during training. This is powerful but fragile, which is why so much research has gone into stabilizing the training process.

What comes next

Now that you understand GAN theory and training dynamics, the next article on DCGAN, conditional GANs, and GAN variants covers practical architectures. You’ll see how convolutional GANs generate images, how to condition generation on labels or other images, and how models like CycleGAN and StyleGAN push the boundaries of what GANs can create.

Start typing to search across all content
navigate Enter open Esc close