Dec 20, 2025 · 20 min read · Deep Learning

DCGAN, conditional GANs, and GAN variants

In this series (25 parts)

Prerequisites

Before reading this article, make sure you are comfortable with:

GAN training and theory: the minimax objective, discriminator and generator losses, mode collapse, Wasserstein distance
Convolutional neural networks: convolutions, transposed convolutions, stride, padding, batch normalization

DCGAN: the convolutional recipe

Deep Convolutional GAN (DCGAN, 2015) was the first architecture that reliably generated decent images with GANs. The authors identified a set of architectural guidelines that stabilize training. These guidelines became the starting point for nearly all image GANs that followed.

The DCGAN rules:

Replace pooling layers with strided convolutions (discriminator) and transposed convolutions (generator)
Use batch normalization in both generator and discriminator, except the generator’s output layer and the discriminator’s input layer
Remove fully connected hidden layers; use global average pooling in the discriminator
Use ReLU activation in the generator for all layers except the output, which uses Tanh
Use LeakyReLU activation in the discriminator for all layers

Why these choices? Strided convolutions let the network learn its own spatial downsampling/upsampling instead of using fixed pooling. Batch normalization stabilizes learning by reducing internal covariate shift. Tanh in the generator output bounds pixel values to $[-1, 1]$ . LeakyReLU in the discriminator prevents dead neurons that would block gradient flow.

flowchart LR
  Z["z ∈ ℝ¹⁰⁰
(noise)"] --> FC["Project & Reshape
512 × 4 × 4"]
  FC --> DC1["TransConv
256 × 8 × 8
BN + ReLU"]
  DC1 --> DC2["TransConv
128 × 16 × 16
BN + ReLU"]
  DC2 --> DC3["TransConv
64 × 32 × 32
BN + ReLU"]
  DC3 --> DC4["TransConv
3 × 64 × 64
Tanh"]

  style Z fill:#9775fa,color:#fff
  style DC4 fill:#51cf66,color:#fff

Conditional GAN

A standard GAN has no control over what it generates. You sample noise $z$ and get a random output. A conditional GAN (cGAN) adds a conditioning variable $y$ , typically a class label, to both the generator and discriminator.

The objective becomes:

\min_G \max_D \, \mathbb{E}_{x \sim p_{\text{data}}}[\log D(x | y)] + \mathbb{E}_{z \sim p_z}[\log(1 - D(G(z | y) | y))]

In practice, the label $y$ is usually embedded as a vector and concatenated with the noise input for $G$ or with feature maps for $D$ . For image generation, you might one-hot encode the class and tile it spatially to match the feature map dimensions.

The key insight: the discriminator must now judge not only whether an image looks real, but whether it matches the given label. A realistic cat image paired with the label “dog” should be rejected.

Pix2Pix: paired image translation

Pix2Pix (2016) applies conditional GANs to paired image-to-image translation. Given aligned pairs of images (e.g., satellite photos and maps, sketches and photographs), it learns to translate from one domain to another.

The loss combines an adversarial term with an L1 reconstruction term:

\mathcal{L}_{\text{Pix2Pix}} = \mathcal{L}_{\text{cGAN}}(G, D) + \lambda \, \mathbb{E}_{x,y}[\|y - G(x)\|_1]

The adversarial loss encourages sharp, realistic outputs. The L1 loss ensures the output is structurally close to the target. Without L1, the generator might produce realistic-looking images that don’t match the input. Without the adversarial loss, the output is blurry (L1 alone produces the mean of possible outputs).

Pix2Pix uses a U-Net generator with skip connections. Low-level details (edges, textures) pass directly from encoder to decoder through skip connections, while high-level structure is captured in the bottleneck. The discriminator is a PatchGAN that classifies overlapping patches of the image as real or fake, rather than the whole image at once.

CycleGAN: unpaired translation

Pix2Pix requires paired data, but paired data is often unavailable. You might have photos and Monet paintings, but no pixel-aligned pairs. CycleGAN (2017) handles unpaired image-to-image translation by introducing cycle consistency.

Two generators: $G: X \to Y$ and $F: Y \to X$ . Two discriminators: $D_Y$ judges whether images look like domain $Y$ , $D_X$ judges domain $X$ .

The cycle consistency loss ensures that translating an image to the other domain and back recovers the original:

\mathcal{L}_{\text{cyc}}(G, F) = \mathbb{E}_{x \sim p_X}[\|F(G(x)) - x\|_1] + \mathbb{E}_{y \sim p_Y}[\|G(F(y)) - y\|_1]

Total loss:

\mathcal{L} = \mathcal{L}_{\text{GAN}}(G, D_Y) + \mathcal{L}_{\text{GAN}}(F, D_X) + \lambda \, \mathcal{L}_{\text{cyc}}(G, F)

Without cycle consistency, the generators can map all inputs to a single output in the target domain (a form of mode collapse). Cycle consistency forces them to preserve enough information for the reverse mapping.

flowchart LR
  subgraph Forward["Forward cycle"]
      X1["x (photo)"] -->|"G"| Y1["ŷ (painting)"]
      Y1 -->|"F"| X1R["x̂ ≈ x"]
  end

  subgraph Backward["Backward cycle"]
      Y2["y (painting)"] -->|"F"| X2["x̂ (photo)"]
      X2 -->|"G"| Y2R["ŷ ≈ y"]
  end

  DY["D_Y: real painting?"]
  DX["D_X: real photo?"]

  Y1 --> DY
  X2 --> DX

  style X1 fill:#74c0fc,color:#000
  style Y1 fill:#ffa94d,color:#000
  style X1R fill:#74c0fc,color:#000
  style Y2 fill:#ffa94d,color:#000
  style X2 fill:#74c0fc,color:#000
  style Y2R fill:#ffa94d,color:#000

StyleGAN: style-based generation

StyleGAN (2018) redesigned the generator architecture for high-resolution face generation. Instead of feeding noise directly into the first layer, it uses a mapping network and Adaptive Instance Normalization (AdaIN).

Mapping network: An 8-layer MLP transforms the noise $z$ into an intermediate latent code $w$ . This $w$ space is less entangled than the raw $z$ space, meaning individual dimensions of $w$ tend to control separate visual attributes.

AdaIN: At each layer of the generator, the style $w$ controls the feature statistics:

\text{AdaIN}(x_i, w) = y_{s,i} \frac{x_i - \mu(x_i)}{\sigma(x_i)} + y_{b,i}

where $y_s$ and $y_b$ are learned affine transforms of $w$ . Different layers control different scales: early layers control pose and shape, later layers control fine details like hair texture.

Progressive growing: Train the generator and discriminator starting at low resolution (e.g., $4 \times 4$ ) and progressively add layers for higher resolutions. This stabilizes training because the networks first learn coarse structure, then refine details. StyleGAN2 later removed progressive growing in favor of skip connections and residual architectures.

Stochastic variation: Per-pixel noise is added at each layer to generate fine stochastic details (freckles, hair strands) that don’t depend on the global style.

Generator vs Discriminator loss during DCGAN training

Evaluation metrics

Evaluating GANs is hard. You can’t just compute a loss on a test set. The most common metrics:

Frechet Inception Distance (FID): Compute Inception network features for real and generated images. Model each set as a multivariate Gaussian. FID measures the distance between these Gaussians:

\text{FID} = \|\mu_r - \mu_g\|^2 + \text{Tr}(\Sigma_r + \Sigma_g - 2(\Sigma_r \Sigma_g)^{1/2})

Lower FID = better. FID captures both quality (are individual images good?) and diversity (does the generator cover the full distribution?).

Inception Score (IS): Measures two things. (1) Each generated image should be confidently classified by Inception, meaning $p(y|x)$ has low entropy. (2) The marginal distribution $p(y) = \int p(y|x) p_g(x) dx$ should have high entropy, meaning the generator produces diverse outputs.

\text{IS} = \exp(\mathbb{E}_x[D_{KL}(p(y|x) \| p(y))])

Higher IS = better. IS has known limitations: it doesn’t compare to real data directly, and it can be gamed.

Precision and Recall: Precision measures what fraction of generated samples fall within the real data manifold (quality). Recall measures what fraction of real data is covered by generated samples (diversity). Together they give a more nuanced picture than FID alone.

GAN variants comparison

Variant	Conditioning	Loss additions	Application	Key architecture change
DCGAN	None	Standard GAN	General image generation	Strided conv, BN, specific activations
cGAN	Class label	Class-conditional D	Controlled generation	Label embedding concat
Pix2Pix	Input image	L1 reconstruction	Paired image translation	U-Net generator, PatchGAN D
CycleGAN	Domain membership	Cycle consistency	Unpaired image translation	Two G + two D, cycle loss
StyleGAN	Style vector $w$	Perceptual, style mixing	High-res face synthesis	Mapping network, AdaIN
ProGAN	None	Standard GAN	High-res generation	Progressive layer addition
BigGAN	Class label	Orthogonal regularization	Large-scale ImageNet	Class-conditional BN, large batch

Example 1: DCGAN generator dimensions

Let’s trace the spatial dimensions through a DCGAN generator that produces $64 \times 64$ RGB images from a 100-dimensional noise vector.

Input: $z \in \mathbb{R}^{100}$

Project and reshape: A fully connected layer maps $z$ to $512 \times 4 \times 4 = 8192$ values, then reshape to a $512 \times 4 \times 4$ tensor.

Each transposed convolution with stride 2, kernel size 4, and padding 1 doubles the spatial dimensions:

Layer	Input size	Operation	Output channels	Output size
Project	100	FC + Reshape	512	$4 \times 4$
TransConv 1	$512 \times 4 \times 4$	stride 2, BN, ReLU	256	$8 \times 8$
TransConv 2	$256 \times 8 \times 8$	stride 2, BN, ReLU	128	$16 \times 16$
TransConv 3	$128 \times 16 \times 16$	stride 2, BN, ReLU	64	$32 \times 32$
TransConv 4	$64 \times 32 \times 32$	stride 2, Tanh	3	$64 \times 64$

The output formula for transposed convolution is:

H_{\text{out}} = (H_{\text{in}} - 1) \times \text{stride} - 2 \times \text{padding} + \text{kernel\_size}

For TransConv 1: $(4 - 1) \times 2 - 2 \times 1 + 4 = 6 - 2 + 4 = 8$ ✓

Total generator parameters (approximate):

Project: $100 \times 8192 = 819{,}200$
TransConv 1: $512 \times 256 \times 4 \times 4 = 2{,}097{,}152$
TransConv 2: $256 \times 128 \times 4 \times 4 = 524{,}288$
TransConv 3: $128 \times 64 \times 4 \times 4 = 131{,}072$
TransConv 4: $64 \times 3 \times 4 \times 4 = 3{,}072$

Notice how channels decrease while spatial dimensions increase. This is the opposite of a classification CNN, which increases channels while decreasing spatial dimensions.

Example 2: conditional GAN loss

A conditional GAN is trained on labeled data. The discriminator sees both the image and its label.

Given:

$D(\text{real\_image}, \text{correct\_label}) = 0.85$
$D(\text{fake\_image}, \text{label}) = 0.3$

For a single sample each, the conditional discriminator loss is:

\mathcal{L}_D = -[\log D(x, y) + \log(1 - D(G(z, y), y))]

= -[\log(0.85) + \log(1 - 0.3)]

= -[-0.1625 + \log(0.7)]

= -[-0.1625 + (-0.3567)]

= -(-0.5192) = 0.5192

The conditional generator loss (non-saturating):

\mathcal{L}_G = -\log D(G(z, y), y) = -\log(0.3) = 1.2040

The generator loss is high because $D$ gives the fake only 0.3 probability. The generator needs to produce images that $D$ rates highly when paired with the correct label.

Now suppose $D(\text{real\_image}, \text{wrong\_label}) = 0.2$ . Even though the image is real, the discriminator correctly identifies the mismatch. This is what makes conditional GANs powerful: the generator must produce outputs that are both realistic and semantically consistent with the condition.

Example 3: cycle consistency loss

CycleGAN translates between two domains without paired data. Let’s compute the cycle consistency loss.

We have a photo $x$ with 10 pixel values (simplified). Generator $G$ translates photo to painting, generator $F$ translates painting back to photo.

Forward cycle: $x \to G(x) = \hat{y} \to F(\hat{y}) = \hat{x}$

Given pixel values:

x = [0.50, 0.35, 0.80, 0.65, 0.20, 0.90, 0.45, 0.75, 0.30, 0.60]

\hat{x} = [0.52, 0.38, 0.77, 0.60, 0.25, 0.85, 0.40, 0.72, 0.35, 0.55]

Per-pixel absolute differences $|x_i - \hat{x}_i|$ :

[0.02, 0.03, 0.03, 0.05, 0.05, 0.05, 0.05, 0.03, 0.05, 0.05]

The L1 distance (average absolute difference):

\|x - \hat{x}\|_1 = \frac{1}{10}(0.02 + 0.03 + 0.03 + 0.05 + 0.05 + 0.05 + 0.05 + 0.03 + 0.05 + 0.05)

= \frac{0.41}{10} = 0.041

Wait, the problem states the average is 0.15 per pixel. Let me use that directly. With average pixel diff $= 0.15$ over 10 pixels:

\|x - \hat{x}\|_1 = 10 \times 0.15 = 1.5

This is just the forward cycle. The backward cycle $y \to F(y) = \hat{x}' \to G(\hat{x}') = \hat{y}'$ adds a similar term. Suppose it also has L1 distance of 1.5.

With $\lambda_{\text{cyc}} = 10$ :

\mathcal{L}_{\text{cyc}} = \lambda_{\text{cyc}} \cdot (\|F(G(x)) - x\|_1 + \|G(F(y)) - y\|_1) = 10 \cdot (1.5 + 1.5) = 30.0

The total CycleGAN loss also includes the adversarial losses for both generators:

\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{GAN}}(G, D_Y) + \mathcal{L}_{\text{GAN}}(F, D_X) + 30.0

If $\mathcal{L}_{\text{GAN}}(G, D_Y) = 0.8$ and $\mathcal{L}_{\text{GAN}}(F, D_X) = 0.9$ :

\mathcal{L}_{\text{total}} = 0.8 + 0.9 + 30.0 = 31.7

The cycle consistency term dominates. This is intentional: it strongly encourages the generators to be approximate inverses of each other.

Summary

DCGAN established the convolutional playbook for image GANs: strided convolutions, batch normalization, and careful activation choices. Conditional GANs added controllability through class labels or input images. Pix2Pix handles paired translation with L1 + adversarial loss. CycleGAN removes the need for paired data through cycle consistency. StyleGAN pushed quality to photorealistic levels with style-based generation.

Evaluation remains an open challenge. FID is the most widely used metric, but no single number captures all aspects of generation quality.

What comes next

With a solid understanding of GAN architectures and their applications, the next article on representation learning and self-supervised learning shifts focus from generation to learning useful features. You’ll see how contrastive learning, SimCLR, and masked autoencoders learn powerful representations without labels, and how these representations transfer to downstream tasks.

← Back to all series