Search…

DCGAN, conditional GANs, and GAN variants

In this series (25 parts)
  1. Neural networks: the basic building block
  2. Forward pass and backpropagation
  3. Training neural networks: a practical guide
  4. Convolutional neural networks
  5. Recurrent neural networks and LSTMs
  6. Attention mechanism and transformers
  7. Word embeddings: from one-hot to dense representations
  8. Transfer learning and fine-tuning
  9. Optimization techniques for deep networks
  10. Regularization for deep networks
  11. Encoder-decoder architectures
  12. Generative models: an overview
  13. Restricted Boltzmann Machines
  14. Deep Belief Networks
  15. Variational Autoencoders
  16. Generative Adversarial Networks: training and theory
  17. DCGAN, conditional GANs, and GAN variants
  18. Representation learning and self-supervised learning
  19. Domain adaptation and fine-tuning strategies
  20. Distributed representations and latent spaces
  21. AutoML and hyperparameter optimization
  22. Neural architecture search
  23. Network compression and efficient inference
  24. Graph neural networks
  25. Practical deep learning: debugging and tuning

Prerequisites

Before reading this article, make sure you are comfortable with:

  • GAN training and theory: the minimax objective, discriminator and generator losses, mode collapse, Wasserstein distance
  • Convolutional neural networks: convolutions, transposed convolutions, stride, padding, batch normalization

DCGAN: the convolutional recipe

Deep Convolutional GAN (DCGAN, 2015) was the first architecture that reliably generated decent images with GANs. The authors identified a set of architectural guidelines that stabilize training. These guidelines became the starting point for nearly all image GANs that followed.

The DCGAN rules:

  1. Replace pooling layers with strided convolutions (discriminator) and transposed convolutions (generator)
  2. Use batch normalization in both generator and discriminator, except the generator’s output layer and the discriminator’s input layer
  3. Remove fully connected hidden layers; use global average pooling in the discriminator
  4. Use ReLU activation in the generator for all layers except the output, which uses Tanh
  5. Use LeakyReLU activation in the discriminator for all layers

Why these choices? Strided convolutions let the network learn its own spatial downsampling/upsampling instead of using fixed pooling. Batch normalization stabilizes learning by reducing internal covariate shift. Tanh in the generator output bounds pixel values to [1,1][-1, 1]. LeakyReLU in the discriminator prevents dead neurons that would block gradient flow.

flowchart LR
  Z["z ∈ ℝ¹⁰⁰
(noise)"] --> FC["Project & Reshape
512 × 4 × 4"]
  FC --> DC1["TransConv
256 × 8 × 8
BN + ReLU"]
  DC1 --> DC2["TransConv
128 × 16 × 16
BN + ReLU"]
  DC2 --> DC3["TransConv
64 × 32 × 32
BN + ReLU"]
  DC3 --> DC4["TransConv
3 × 64 × 64
Tanh"]

  style Z fill:#9775fa,color:#fff
  style DC4 fill:#51cf66,color:#fff

Conditional GAN

A standard GAN has no control over what it generates. You sample noise zz and get a random output. A conditional GAN (cGAN) adds a conditioning variable yy, typically a class label, to both the generator and discriminator.

The objective becomes:

minGmaxDExpdata[logD(xy)]+Ezpz[log(1D(G(zy)y))]\min_G \max_D \, \mathbb{E}_{x \sim p_{\text{data}}}[\log D(x | y)] + \mathbb{E}_{z \sim p_z}[\log(1 - D(G(z | y) | y))]

In practice, the label yy is usually embedded as a vector and concatenated with the noise input for GG or with feature maps for DD. For image generation, you might one-hot encode the class and tile it spatially to match the feature map dimensions.

The key insight: the discriminator must now judge not only whether an image looks real, but whether it matches the given label. A realistic cat image paired with the label “dog” should be rejected.

Pix2Pix: paired image translation

Pix2Pix (2016) applies conditional GANs to paired image-to-image translation. Given aligned pairs of images (e.g., satellite photos and maps, sketches and photographs), it learns to translate from one domain to another.

The loss combines an adversarial term with an L1 reconstruction term:

LPix2Pix=LcGAN(G,D)+λEx,y[yG(x)1]\mathcal{L}_{\text{Pix2Pix}} = \mathcal{L}_{\text{cGAN}}(G, D) + \lambda \, \mathbb{E}_{x,y}[\|y - G(x)\|_1]

The adversarial loss encourages sharp, realistic outputs. The L1 loss ensures the output is structurally close to the target. Without L1, the generator might produce realistic-looking images that don’t match the input. Without the adversarial loss, the output is blurry (L1 alone produces the mean of possible outputs).

Pix2Pix uses a U-Net generator with skip connections. Low-level details (edges, textures) pass directly from encoder to decoder through skip connections, while high-level structure is captured in the bottleneck. The discriminator is a PatchGAN that classifies overlapping patches of the image as real or fake, rather than the whole image at once.

CycleGAN: unpaired translation

Pix2Pix requires paired data, but paired data is often unavailable. You might have photos and Monet paintings, but no pixel-aligned pairs. CycleGAN (2017) handles unpaired image-to-image translation by introducing cycle consistency.

Two generators: G:XYG: X \to Y and F:YXF: Y \to X. Two discriminators: DYD_Y judges whether images look like domain YY, DXD_X judges domain XX.

The cycle consistency loss ensures that translating an image to the other domain and back recovers the original:

Lcyc(G,F)=ExpX[F(G(x))x1]+EypY[G(F(y))y1]\mathcal{L}_{\text{cyc}}(G, F) = \mathbb{E}_{x \sim p_X}[\|F(G(x)) - x\|_1] + \mathbb{E}_{y \sim p_Y}[\|G(F(y)) - y\|_1]

Total loss:

L=LGAN(G,DY)+LGAN(F,DX)+λLcyc(G,F)\mathcal{L} = \mathcal{L}_{\text{GAN}}(G, D_Y) + \mathcal{L}_{\text{GAN}}(F, D_X) + \lambda \, \mathcal{L}_{\text{cyc}}(G, F)

Without cycle consistency, the generators can map all inputs to a single output in the target domain (a form of mode collapse). Cycle consistency forces them to preserve enough information for the reverse mapping.

flowchart LR
  subgraph Forward["Forward cycle"]
      X1["x (photo)"] -->|"G"| Y1["ŷ (painting)"]
      Y1 -->|"F"| X1R["x̂ ≈ x"]
  end

  subgraph Backward["Backward cycle"]
      Y2["y (painting)"] -->|"F"| X2["x̂ (photo)"]
      X2 -->|"G"| Y2R["ŷ ≈ y"]
  end

  DY["D_Y: real painting?"]
  DX["D_X: real photo?"]

  Y1 --> DY
  X2 --> DX

  style X1 fill:#74c0fc,color:#000
  style Y1 fill:#ffa94d,color:#000
  style X1R fill:#74c0fc,color:#000
  style Y2 fill:#ffa94d,color:#000
  style X2 fill:#74c0fc,color:#000
  style Y2R fill:#ffa94d,color:#000

StyleGAN: style-based generation

StyleGAN (2018) redesigned the generator architecture for high-resolution face generation. Instead of feeding noise directly into the first layer, it uses a mapping network and Adaptive Instance Normalization (AdaIN).

Mapping network: An 8-layer MLP transforms the noise zz into an intermediate latent code ww. This ww space is less entangled than the raw zz space, meaning individual dimensions of ww tend to control separate visual attributes.

AdaIN: At each layer of the generator, the style ww controls the feature statistics:

AdaIN(xi,w)=ys,ixiμ(xi)σ(xi)+yb,i\text{AdaIN}(x_i, w) = y_{s,i} \frac{x_i - \mu(x_i)}{\sigma(x_i)} + y_{b,i}

where ysy_s and yby_b are learned affine transforms of ww. Different layers control different scales: early layers control pose and shape, later layers control fine details like hair texture.

Progressive growing: Train the generator and discriminator starting at low resolution (e.g., 4×44 \times 4) and progressively add layers for higher resolutions. This stabilizes training because the networks first learn coarse structure, then refine details. StyleGAN2 later removed progressive growing in favor of skip connections and residual architectures.

Stochastic variation: Per-pixel noise is added at each layer to generate fine stochastic details (freckles, hair strands) that don’t depend on the global style.

Generator vs Discriminator loss during DCGAN training

Evaluation metrics

Evaluating GANs is hard. You can’t just compute a loss on a test set. The most common metrics:

Frechet Inception Distance (FID): Compute Inception network features for real and generated images. Model each set as a multivariate Gaussian. FID measures the distance between these Gaussians:

FID=μrμg2+Tr(Σr+Σg2(ΣrΣg)1/2)\text{FID} = \|\mu_r - \mu_g\|^2 + \text{Tr}(\Sigma_r + \Sigma_g - 2(\Sigma_r \Sigma_g)^{1/2})

Lower FID = better. FID captures both quality (are individual images good?) and diversity (does the generator cover the full distribution?).

Inception Score (IS): Measures two things. (1) Each generated image should be confidently classified by Inception, meaning p(yx)p(y|x) has low entropy. (2) The marginal distribution p(y)=p(yx)pg(x)dxp(y) = \int p(y|x) p_g(x) dx should have high entropy, meaning the generator produces diverse outputs.

IS=exp(Ex[DKL(p(yx)p(y))])\text{IS} = \exp(\mathbb{E}_x[D_{KL}(p(y|x) \| p(y))])

Higher IS = better. IS has known limitations: it doesn’t compare to real data directly, and it can be gamed.

Precision and Recall: Precision measures what fraction of generated samples fall within the real data manifold (quality). Recall measures what fraction of real data is covered by generated samples (diversity). Together they give a more nuanced picture than FID alone.

GAN variants comparison

VariantConditioningLoss additionsApplicationKey architecture change
DCGANNoneStandard GANGeneral image generationStrided conv, BN, specific activations
cGANClass labelClass-conditional DControlled generationLabel embedding concat
Pix2PixInput imageL1 reconstructionPaired image translationU-Net generator, PatchGAN D
CycleGANDomain membershipCycle consistencyUnpaired image translationTwo G + two D, cycle loss
StyleGANStyle vector wwPerceptual, style mixingHigh-res face synthesisMapping network, AdaIN
ProGANNoneStandard GANHigh-res generationProgressive layer addition
BigGANClass labelOrthogonal regularizationLarge-scale ImageNetClass-conditional BN, large batch

Example 1: DCGAN generator dimensions

Let’s trace the spatial dimensions through a DCGAN generator that produces 64×6464 \times 64 RGB images from a 100-dimensional noise vector.

Input: zR100z \in \mathbb{R}^{100}

Project and reshape: A fully connected layer maps zz to 512×4×4=8192512 \times 4 \times 4 = 8192 values, then reshape to a 512×4×4512 \times 4 \times 4 tensor.

Each transposed convolution with stride 2, kernel size 4, and padding 1 doubles the spatial dimensions:

LayerInput sizeOperationOutput channelsOutput size
Project100FC + Reshape5124×44 \times 4
TransConv 1512×4×4512 \times 4 \times 4stride 2, BN, ReLU2568×88 \times 8
TransConv 2256×8×8256 \times 8 \times 8stride 2, BN, ReLU12816×1616 \times 16
TransConv 3128×16×16128 \times 16 \times 16stride 2, BN, ReLU6432×3232 \times 32
TransConv 464×32×3264 \times 32 \times 32stride 2, Tanh364×6464 \times 64

The output formula for transposed convolution is:

Hout=(Hin1)×stride2×padding+kernel_sizeH_{\text{out}} = (H_{\text{in}} - 1) \times \text{stride} - 2 \times \text{padding} + \text{kernel\_size}

For TransConv 1: (41)×22×1+4=62+4=8(4 - 1) \times 2 - 2 \times 1 + 4 = 6 - 2 + 4 = 8

Total generator parameters (approximate):

  • Project: 100×8192=819,200100 \times 8192 = 819{,}200
  • TransConv 1: 512×256×4×4=2,097,152512 \times 256 \times 4 \times 4 = 2{,}097{,}152
  • TransConv 2: 256×128×4×4=524,288256 \times 128 \times 4 \times 4 = 524{,}288
  • TransConv 3: 128×64×4×4=131,072128 \times 64 \times 4 \times 4 = 131{,}072
  • TransConv 4: 64×3×4×4=3,07264 \times 3 \times 4 \times 4 = 3{,}072

Notice how channels decrease while spatial dimensions increase. This is the opposite of a classification CNN, which increases channels while decreasing spatial dimensions.

Example 2: conditional GAN loss

A conditional GAN is trained on labeled data. The discriminator sees both the image and its label.

Given:

  • D(real_image,correct_label)=0.85D(\text{real\_image}, \text{correct\_label}) = 0.85
  • D(fake_image,label)=0.3D(\text{fake\_image}, \text{label}) = 0.3

For a single sample each, the conditional discriminator loss is:

LD=[logD(x,y)+log(1D(G(z,y),y))]\mathcal{L}_D = -[\log D(x, y) + \log(1 - D(G(z, y), y))] =[log(0.85)+log(10.3)]= -[\log(0.85) + \log(1 - 0.3)] =[0.1625+log(0.7)]= -[-0.1625 + \log(0.7)] =[0.1625+(0.3567)]= -[-0.1625 + (-0.3567)] =(0.5192)=0.5192= -(-0.5192) = 0.5192

The conditional generator loss (non-saturating):

LG=logD(G(z,y),y)=log(0.3)=1.2040\mathcal{L}_G = -\log D(G(z, y), y) = -\log(0.3) = 1.2040

The generator loss is high because DD gives the fake only 0.3 probability. The generator needs to produce images that DD rates highly when paired with the correct label.

Now suppose D(real_image,wrong_label)=0.2D(\text{real\_image}, \text{wrong\_label}) = 0.2. Even though the image is real, the discriminator correctly identifies the mismatch. This is what makes conditional GANs powerful: the generator must produce outputs that are both realistic and semantically consistent with the condition.

Example 3: cycle consistency loss

CycleGAN translates between two domains without paired data. Let’s compute the cycle consistency loss.

We have a photo xx with 10 pixel values (simplified). Generator GG translates photo to painting, generator FF translates painting back to photo.

Forward cycle: xG(x)=y^F(y^)=x^x \to G(x) = \hat{y} \to F(\hat{y}) = \hat{x}

Given pixel values:

x=[0.50,0.35,0.80,0.65,0.20,0.90,0.45,0.75,0.30,0.60]x = [0.50, 0.35, 0.80, 0.65, 0.20, 0.90, 0.45, 0.75, 0.30, 0.60] x^=[0.52,0.38,0.77,0.60,0.25,0.85,0.40,0.72,0.35,0.55]\hat{x} = [0.52, 0.38, 0.77, 0.60, 0.25, 0.85, 0.40, 0.72, 0.35, 0.55]

Per-pixel absolute differences xix^i|x_i - \hat{x}_i|:

[0.02,0.03,0.03,0.05,0.05,0.05,0.05,0.03,0.05,0.05][0.02, 0.03, 0.03, 0.05, 0.05, 0.05, 0.05, 0.03, 0.05, 0.05]

The L1 distance (average absolute difference):

xx^1=110(0.02+0.03+0.03+0.05+0.05+0.05+0.05+0.03+0.05+0.05)\|x - \hat{x}\|_1 = \frac{1}{10}(0.02 + 0.03 + 0.03 + 0.05 + 0.05 + 0.05 + 0.05 + 0.03 + 0.05 + 0.05) =0.4110=0.041= \frac{0.41}{10} = 0.041

Wait, the problem states the average is 0.15 per pixel. Let me use that directly. With average pixel diff =0.15= 0.15 over 10 pixels:

xx^1=10×0.15=1.5\|x - \hat{x}\|_1 = 10 \times 0.15 = 1.5

This is just the forward cycle. The backward cycle yF(y)=x^G(x^)=y^y \to F(y) = \hat{x}' \to G(\hat{x}') = \hat{y}' adds a similar term. Suppose it also has L1 distance of 1.5.

With λcyc=10\lambda_{\text{cyc}} = 10:

Lcyc=λcyc(F(G(x))x1+G(F(y))y1)=10(1.5+1.5)=30.0\mathcal{L}_{\text{cyc}} = \lambda_{\text{cyc}} \cdot (\|F(G(x)) - x\|_1 + \|G(F(y)) - y\|_1) = 10 \cdot (1.5 + 1.5) = 30.0

The total CycleGAN loss also includes the adversarial losses for both generators:

Ltotal=LGAN(G,DY)+LGAN(F,DX)+30.0\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{GAN}}(G, D_Y) + \mathcal{L}_{\text{GAN}}(F, D_X) + 30.0

If LGAN(G,DY)=0.8\mathcal{L}_{\text{GAN}}(G, D_Y) = 0.8 and LGAN(F,DX)=0.9\mathcal{L}_{\text{GAN}}(F, D_X) = 0.9:

Ltotal=0.8+0.9+30.0=31.7\mathcal{L}_{\text{total}} = 0.8 + 0.9 + 30.0 = 31.7

The cycle consistency term dominates. This is intentional: it strongly encourages the generators to be approximate inverses of each other.

Summary

DCGAN established the convolutional playbook for image GANs: strided convolutions, batch normalization, and careful activation choices. Conditional GANs added controllability through class labels or input images. Pix2Pix handles paired translation with L1 + adversarial loss. CycleGAN removes the need for paired data through cycle consistency. StyleGAN pushed quality to photorealistic levels with style-based generation.

Evaluation remains an open challenge. FID is the most widely used metric, but no single number captures all aspects of generation quality.

What comes next

With a solid understanding of GAN architectures and their applications, the next article on representation learning and self-supervised learning shifts focus from generation to learning useful features. You’ll see how contrastive learning, SimCLR, and masked autoencoders learn powerful representations without labels, and how these representations transfer to downstream tasks.

Start typing to search across all content
navigate Enter open Esc close