DCGAN, conditional GANs, and GAN variants
In this series (25 parts)
- Neural networks: the basic building block
- Forward pass and backpropagation
- Training neural networks: a practical guide
- Convolutional neural networks
- Recurrent neural networks and LSTMs
- Attention mechanism and transformers
- Word embeddings: from one-hot to dense representations
- Transfer learning and fine-tuning
- Optimization techniques for deep networks
- Regularization for deep networks
- Encoder-decoder architectures
- Generative models: an overview
- Restricted Boltzmann Machines
- Deep Belief Networks
- Variational Autoencoders
- Generative Adversarial Networks: training and theory
- DCGAN, conditional GANs, and GAN variants
- Representation learning and self-supervised learning
- Domain adaptation and fine-tuning strategies
- Distributed representations and latent spaces
- AutoML and hyperparameter optimization
- Neural architecture search
- Network compression and efficient inference
- Graph neural networks
- Practical deep learning: debugging and tuning
Prerequisites
Before reading this article, make sure you are comfortable with:
- GAN training and theory: the minimax objective, discriminator and generator losses, mode collapse, Wasserstein distance
- Convolutional neural networks: convolutions, transposed convolutions, stride, padding, batch normalization
DCGAN: the convolutional recipe
Deep Convolutional GAN (DCGAN, 2015) was the first architecture that reliably generated decent images with GANs. The authors identified a set of architectural guidelines that stabilize training. These guidelines became the starting point for nearly all image GANs that followed.
The DCGAN rules:
- Replace pooling layers with strided convolutions (discriminator) and transposed convolutions (generator)
- Use batch normalization in both generator and discriminator, except the generator’s output layer and the discriminator’s input layer
- Remove fully connected hidden layers; use global average pooling in the discriminator
- Use ReLU activation in the generator for all layers except the output, which uses Tanh
- Use LeakyReLU activation in the discriminator for all layers
Why these choices? Strided convolutions let the network learn its own spatial downsampling/upsampling instead of using fixed pooling. Batch normalization stabilizes learning by reducing internal covariate shift. Tanh in the generator output bounds pixel values to . LeakyReLU in the discriminator prevents dead neurons that would block gradient flow.
flowchart LR Z["z ∈ ℝ¹⁰⁰ (noise)"] --> FC["Project & Reshape 512 × 4 × 4"] FC --> DC1["TransConv 256 × 8 × 8 BN + ReLU"] DC1 --> DC2["TransConv 128 × 16 × 16 BN + ReLU"] DC2 --> DC3["TransConv 64 × 32 × 32 BN + ReLU"] DC3 --> DC4["TransConv 3 × 64 × 64 Tanh"] style Z fill:#9775fa,color:#fff style DC4 fill:#51cf66,color:#fff
Conditional GAN
A standard GAN has no control over what it generates. You sample noise and get a random output. A conditional GAN (cGAN) adds a conditioning variable , typically a class label, to both the generator and discriminator.
The objective becomes:
In practice, the label is usually embedded as a vector and concatenated with the noise input for or with feature maps for . For image generation, you might one-hot encode the class and tile it spatially to match the feature map dimensions.
The key insight: the discriminator must now judge not only whether an image looks real, but whether it matches the given label. A realistic cat image paired with the label “dog” should be rejected.
Pix2Pix: paired image translation
Pix2Pix (2016) applies conditional GANs to paired image-to-image translation. Given aligned pairs of images (e.g., satellite photos and maps, sketches and photographs), it learns to translate from one domain to another.
The loss combines an adversarial term with an L1 reconstruction term:
The adversarial loss encourages sharp, realistic outputs. The L1 loss ensures the output is structurally close to the target. Without L1, the generator might produce realistic-looking images that don’t match the input. Without the adversarial loss, the output is blurry (L1 alone produces the mean of possible outputs).
Pix2Pix uses a U-Net generator with skip connections. Low-level details (edges, textures) pass directly from encoder to decoder through skip connections, while high-level structure is captured in the bottleneck. The discriminator is a PatchGAN that classifies overlapping patches of the image as real or fake, rather than the whole image at once.
CycleGAN: unpaired translation
Pix2Pix requires paired data, but paired data is often unavailable. You might have photos and Monet paintings, but no pixel-aligned pairs. CycleGAN (2017) handles unpaired image-to-image translation by introducing cycle consistency.
Two generators: and . Two discriminators: judges whether images look like domain , judges domain .
The cycle consistency loss ensures that translating an image to the other domain and back recovers the original:
Total loss:
Without cycle consistency, the generators can map all inputs to a single output in the target domain (a form of mode collapse). Cycle consistency forces them to preserve enough information for the reverse mapping.
flowchart LR
subgraph Forward["Forward cycle"]
X1["x (photo)"] -->|"G"| Y1["ŷ (painting)"]
Y1 -->|"F"| X1R["x̂ ≈ x"]
end
subgraph Backward["Backward cycle"]
Y2["y (painting)"] -->|"F"| X2["x̂ (photo)"]
X2 -->|"G"| Y2R["ŷ ≈ y"]
end
DY["D_Y: real painting?"]
DX["D_X: real photo?"]
Y1 --> DY
X2 --> DX
style X1 fill:#74c0fc,color:#000
style Y1 fill:#ffa94d,color:#000
style X1R fill:#74c0fc,color:#000
style Y2 fill:#ffa94d,color:#000
style X2 fill:#74c0fc,color:#000
style Y2R fill:#ffa94d,color:#000
StyleGAN: style-based generation
StyleGAN (2018) redesigned the generator architecture for high-resolution face generation. Instead of feeding noise directly into the first layer, it uses a mapping network and Adaptive Instance Normalization (AdaIN).
Mapping network: An 8-layer MLP transforms the noise into an intermediate latent code . This space is less entangled than the raw space, meaning individual dimensions of tend to control separate visual attributes.
AdaIN: At each layer of the generator, the style controls the feature statistics:
where and are learned affine transforms of . Different layers control different scales: early layers control pose and shape, later layers control fine details like hair texture.
Progressive growing: Train the generator and discriminator starting at low resolution (e.g., ) and progressively add layers for higher resolutions. This stabilizes training because the networks first learn coarse structure, then refine details. StyleGAN2 later removed progressive growing in favor of skip connections and residual architectures.
Stochastic variation: Per-pixel noise is added at each layer to generate fine stochastic details (freckles, hair strands) that don’t depend on the global style.
Generator vs Discriminator loss during DCGAN training
Evaluation metrics
Evaluating GANs is hard. You can’t just compute a loss on a test set. The most common metrics:
Frechet Inception Distance (FID): Compute Inception network features for real and generated images. Model each set as a multivariate Gaussian. FID measures the distance between these Gaussians:
Lower FID = better. FID captures both quality (are individual images good?) and diversity (does the generator cover the full distribution?).
Inception Score (IS): Measures two things. (1) Each generated image should be confidently classified by Inception, meaning has low entropy. (2) The marginal distribution should have high entropy, meaning the generator produces diverse outputs.
Higher IS = better. IS has known limitations: it doesn’t compare to real data directly, and it can be gamed.
Precision and Recall: Precision measures what fraction of generated samples fall within the real data manifold (quality). Recall measures what fraction of real data is covered by generated samples (diversity). Together they give a more nuanced picture than FID alone.
GAN variants comparison
| Variant | Conditioning | Loss additions | Application | Key architecture change |
|---|---|---|---|---|
| DCGAN | None | Standard GAN | General image generation | Strided conv, BN, specific activations |
| cGAN | Class label | Class-conditional D | Controlled generation | Label embedding concat |
| Pix2Pix | Input image | L1 reconstruction | Paired image translation | U-Net generator, PatchGAN D |
| CycleGAN | Domain membership | Cycle consistency | Unpaired image translation | Two G + two D, cycle loss |
| StyleGAN | Style vector | Perceptual, style mixing | High-res face synthesis | Mapping network, AdaIN |
| ProGAN | None | Standard GAN | High-res generation | Progressive layer addition |
| BigGAN | Class label | Orthogonal regularization | Large-scale ImageNet | Class-conditional BN, large batch |
Example 1: DCGAN generator dimensions
Let’s trace the spatial dimensions through a DCGAN generator that produces RGB images from a 100-dimensional noise vector.
Input:
Project and reshape: A fully connected layer maps to values, then reshape to a tensor.
Each transposed convolution with stride 2, kernel size 4, and padding 1 doubles the spatial dimensions:
| Layer | Input size | Operation | Output channels | Output size |
|---|---|---|---|---|
| Project | 100 | FC + Reshape | 512 | |
| TransConv 1 | stride 2, BN, ReLU | 256 | ||
| TransConv 2 | stride 2, BN, ReLU | 128 | ||
| TransConv 3 | stride 2, BN, ReLU | 64 | ||
| TransConv 4 | stride 2, Tanh | 3 |
The output formula for transposed convolution is:
For TransConv 1: ✓
Total generator parameters (approximate):
- Project:
- TransConv 1:
- TransConv 2:
- TransConv 3:
- TransConv 4:
Notice how channels decrease while spatial dimensions increase. This is the opposite of a classification CNN, which increases channels while decreasing spatial dimensions.
Example 2: conditional GAN loss
A conditional GAN is trained on labeled data. The discriminator sees both the image and its label.
Given:
For a single sample each, the conditional discriminator loss is:
The conditional generator loss (non-saturating):
The generator loss is high because gives the fake only 0.3 probability. The generator needs to produce images that rates highly when paired with the correct label.
Now suppose . Even though the image is real, the discriminator correctly identifies the mismatch. This is what makes conditional GANs powerful: the generator must produce outputs that are both realistic and semantically consistent with the condition.
Example 3: cycle consistency loss
CycleGAN translates between two domains without paired data. Let’s compute the cycle consistency loss.
We have a photo with 10 pixel values (simplified). Generator translates photo to painting, generator translates painting back to photo.
Forward cycle:
Given pixel values:
Per-pixel absolute differences :
The L1 distance (average absolute difference):
Wait, the problem states the average is 0.15 per pixel. Let me use that directly. With average pixel diff over 10 pixels:
This is just the forward cycle. The backward cycle adds a similar term. Suppose it also has L1 distance of 1.5.
With :
The total CycleGAN loss also includes the adversarial losses for both generators:
If and :
The cycle consistency term dominates. This is intentional: it strongly encourages the generators to be approximate inverses of each other.
Summary
DCGAN established the convolutional playbook for image GANs: strided convolutions, batch normalization, and careful activation choices. Conditional GANs added controllability through class labels or input images. Pix2Pix handles paired translation with L1 + adversarial loss. CycleGAN removes the need for paired data through cycle consistency. StyleGAN pushed quality to photorealistic levels with style-based generation.
Evaluation remains an open challenge. FID is the most widely used metric, but no single number captures all aspects of generation quality.
What comes next
With a solid understanding of GAN architectures and their applications, the next article on representation learning and self-supervised learning shifts focus from generation to learning useful features. You’ll see how contrastive learning, SimCLR, and masked autoencoders learn powerful representations without labels, and how these representations transfer to downstream tasks.