Deep Belief Networks
In this series (25 parts)
- Neural networks: the basic building block
- Forward pass and backpropagation
- Training neural networks: a practical guide
- Convolutional neural networks
- Recurrent neural networks and LSTMs
- Attention mechanism and transformers
- Word embeddings: from one-hot to dense representations
- Transfer learning and fine-tuning
- Optimization techniques for deep networks
- Regularization for deep networks
- Encoder-decoder architectures
- Generative models: an overview
- Restricted Boltzmann Machines
- Deep Belief Networks
- Variational Autoencoders
- Generative Adversarial Networks: training and theory
- DCGAN, conditional GANs, and GAN variants
- Representation learning and self-supervised learning
- Domain adaptation and fine-tuning strategies
- Distributed representations and latent spaces
- AutoML and hyperparameter optimization
- Neural architecture search
- Network compression and efficient inference
- Graph neural networks
- Practical deep learning: debugging and tuning
Prerequisites: Restricted Boltzmann Machines.
A single RBM learns one layer of features. A Deep Belief Network (DBN) stacks multiple RBMs to learn a hierarchy: edges in layer 1, parts in layer 2, objects in layer 3. The trick is training them one layer at a time, bottom up.
From one RBM to many
A DBN with layers has weights . The bottom layers form a directed generative model, and the top two layers form an undirected RBM. This hybrid structure is unusual, but it falls naturally out of the training procedure.
The generative story works top-down:
- Sample from the top-level RBM to get activations for the top hidden layer.
- Pass those activations down through sigmoid belief layers to generate visible data.
The key insight: each RBM we stack improves the variational lower bound on the log-likelihood of the data. So greedy training is not just a heuristic. It has a theoretical justification.
Greedy layer-wise pretraining
Layer-wise reconstruction error after pretraining
The training procedure is simple:
- Train an RBM on the raw data. This gives you .
- Use the trained RBM to transform the data: compute for every training example.
- Train a second RBM on these hidden representations. This gives you .
- Repeat for as many layers as you want.
Each RBM only sees the output of the previous one. It never sees the raw data (except the first RBM).
graph TD
subgraph Step1["Step 1: Train RBM 1"]
V1["Data v"] ---|"W₁"| H1_1["h¹"]
end
subgraph Step2["Step 2: Train RBM 2"]
H1_2["h¹ activations"] ---|"W₂"| H2_2["h²"]
end
subgraph Step3["Step 3: Train RBM 3"]
H2_3["h² activations"] ---|"W₃"| H3_3["h³"]
end
Step1 -->|"freeze W₁, compute h¹"| Step2
Step2 -->|"freeze W₂, compute h²"| Step3
style Step1 fill:#e6f3ff,stroke:#333,color:#000
style Step2 fill:#d4edda,stroke:#333,color:#000
style Step3 fill:#fff3cd,stroke:#333,color:#000
Figure 1: Greedy layer-wise pretraining. Each step trains one RBM, then freezes its weights and uses its hidden activations as input for the next RBM.
Why greedy pretraining works
Hinton et al. (2006) proved that adding each new RBM layer either improves or leaves unchanged the variational lower bound on . Intuitively, each layer learns to model the distribution of the layer below it better than a single layer could.
Before this result, deep networks (more than 2-3 layers) were considered impractical. Random initialization put the weights in a terrible region of parameter space, and gradient descent could not escape. Greedy pretraining found a much better starting point.
The DBN as a generative model
Once trained, a DBN generates data top-down:
- Run Gibbs sampling in the top RBM (layers and ) to get a sample .
- For each layer below, compute and sample.
- The final sample at the visible layer is the generated data.
graph TD
subgraph Generative["Generative (top-down)"]
G3["h³ ~ RBM sampling"] -->|"σ(b₂ + W₃ᵀh³)"| G2["h²"]
G2 -->|"σ(b₁ + W₂ᵀh²)"| G1["h¹"]
G1 -->|"σ(a + W₁ᵀh¹)"| G0["v (generated)"]
end
subgraph Discriminative["Discriminative (bottom-up)"]
D0["v (input)"] -->|"σ(b₁ + W₁v)"| D1["h¹"]
D1 -->|"σ(b₂ + W₂h¹)"| D2["h²"]
D2 --> D3["Classifier"]
end
style Generative fill:#e6ffe6,stroke:#333,color:#000
style Discriminative fill:#ffe6e6,stroke:#333,color:#000
Figure 2: A DBN can be used generatively (top-down sampling) or discriminatively (bottom-up classification). The generative direction is the DBN’s natural mode. The discriminative direction reuses the learned features.
The generative direction uses the transpose of the weight matrices. This is one of the elegant properties of the DBN: the same weights serve both directions.
Fine-tuning with backpropagation
Pretraining gives good initial weights, but they are not optimized for any specific task. Fine-tuning adds a task-specific output layer (e.g., softmax for classification) and trains the entire network end-to-end with backpropagation.
The pretrained weights serve as initialization. Gradient descent then adjusts all weights jointly to minimize the supervised loss. Because pretraining placed the weights in a good region, fine-tuning converges faster and reaches better solutions than training from scratch.
Example 1: Forward pass through a 2-layer DBN
Layer 1 RBM (4 visible, 3 hidden):
Layer 2 RBM (3 visible for this RBM, 2 hidden):
Input:
Layer 1: compute .
Layer 2: compute .
The DBN has compressed a 4-dimensional input into a 2-dimensional representation through two layers of learned features.
Example 2: Fine-tuning with backpropagation
Using the network from Example 1, add a single output neuron with sigmoid activation for binary classification. Output weight , bias .
Forward pass (using from Example 1):
Loss (target , using binary cross-entropy):
Backward pass. The output gradient:
Gradient for :
Gradient propagated to :
This gradient continues backward through and , adjusting all pretrained weights to minimize the classification loss. The pretraining gives a much better starting point than random initialization.
Example 3: Generating a sample from the DBN
To generate, we go top-down. Start by sampling from the top RBM.
Step 1: Sample from top RBM. Suppose after Gibbs sampling in the top RBM, we get (binary samples).
Step 2: Generate from . Using and the layer 1 biases:
For the original biases :
Suppose we sample .
Step 3: Generate from . Using and visible biases (assuming zero for simplicity):
Sample: (each compared against its probability). This is one generated sample from the DBN.
Why DBNs mattered
DBNs solved the deep network training problem in 2006. Before Hinton’s paper, the consensus was that deep networks could not be trained effectively. The gradients either vanished or exploded, and random initialization left the weights in a poor region.
Greedy pretraining showed that you could initialize deep networks well by treating each layer as a separate unsupervised learning problem. This was a proof of concept: deep learning works, you just need the right starting point.
Why DBNs faded
Within a few years, several developments made greedy pretraining unnecessary:
- ReLU activations reduced vanishing gradients, making deep networks trainable from random initialization.
- Batch normalization stabilized training by normalizing layer inputs.
- Dropout provided effective regularization.
- Large datasets (ImageNet) and GPUs made end-to-end supervised training practical.
By 2012, deep networks trained from scratch (AlexNet) outperformed DBN-based approaches. Greedy layer-wise pretraining was no longer needed.
DBN vs modern pretraining
The idea of pretraining did not die. It evolved. Modern approaches like BERT and SimCLR use self-supervised learning to pretrain on large unlabeled datasets, then fine-tune on smaller labeled datasets. The philosophy is the same: learn good representations from unlabeled data, then adapt them.
| Aspect | DBN Approach | BERT / SimCLR | Practical Winner |
|---|---|---|---|
| Pretraining method | Greedy layer-wise RBM | Self-supervised on full network | BERT / SimCLR |
| Data requirement | Modest (works with small datasets) | Massive (millions of examples) | Depends on data availability |
| Training objective | Maximize likelihood per layer | Masked prediction / contrastive | BERT / SimCLR |
| Fine-tuning | Backprop through all layers | Backprop through all layers | Tie |
| Scalability | Difficult beyond ~5 layers | Scales to hundreds of layers | BERT / SimCLR |
| Hardware needs | CPU-era feasible | GPU/TPU clusters required | Depends on budget |
| Theoretical grounding | Variational bound improvement | Empirical success | DBN (more theory) |
The connection is real: DBNs showed that unsupervised pretraining followed by supervised fine-tuning is a powerful recipe. Transfer learning follows the same pattern with different mechanics.
Practical considerations
How many layers? In the original experiments, 2-3 RBM layers worked best. More layers gave diminishing returns and sometimes hurt performance. Each additional RBM has to model an increasingly abstract distribution, and the greedy approximation gets less accurate.
How many hidden units per layer? The standard approach is a funnel: more units in lower layers, fewer in upper layers. For MNIST (784 input pixels), a typical DBN might use 500 units in layer 1, 500 in layer 2, and 2000 in the top RBM. The top RBM needs more capacity because it is responsible for capturing the global structure.
How long to pretrain each layer? There is no fixed rule. Monitor the reconstruction error of each RBM and stop when it plateaus. Over-training one layer can actually hurt the layers above it, because it overfits to noise patterns that the next layer then has to model.
Pretraining vs random init baseline. Always compare against a randomly initialized deep network trained with the same architecture and optimizer. In the modern era (with ReLU, batch normalization, and good optimizers), random initialization often wins. DBN pretraining is most beneficial when data is limited or the network is very deep.
The legacy of DBNs
DBNs are rarely used today, but their impact on the field is hard to overstate. They provided the first convincing evidence that deep networks could be trained at all. This triggered a wave of research into deep learning, eventually leading to the breakthroughs we see today: AlexNet in computer vision, Transformers in NLP, and diffusion models in image generation.
The key ideas from DBNs that persist:
- Unsupervised pretraining evolved into self-supervised learning (BERT, GPT, SimCLR).
- Layer-wise feature learning inspired work on understanding what each layer of a deep network learns.
- Generative models as regularizers showed that learning to generate data helps learn useful representations.
If you are starting a new project today, you will almost certainly not use a DBN. But understanding how they work gives you a deeper appreciation for why modern techniques exist and what problems they were designed to solve.
What comes next
DBNs are generative models built from stacked RBMs. But there is a more principled way to build deep generative models: the Variational Autoencoder. VAEs combine encoder-decoder architectures with variational inference to learn latent representations. They replaced DBNs as the go-to latent variable generative model.