Dec 5, 2025 · 18 min read · Deep Learning

Deep Belief Networks

In this series (25 parts)

Prerequisites: Restricted Boltzmann Machines.

A single RBM learns one layer of features. A Deep Belief Network (DBN) stacks multiple RBMs to learn a hierarchy: edges in layer 1, parts in layer 2, objects in layer 3. The trick is training them one layer at a time, bottom up.

From one RBM to many

A DBN with $L$ layers has weights $W_1, W_2, \ldots, W_L$ . The bottom layers form a directed generative model, and the top two layers form an undirected RBM. This hybrid structure is unusual, but it falls naturally out of the training procedure.

The generative story works top-down:

Sample from the top-level RBM to get activations for the top hidden layer.
Pass those activations down through sigmoid belief layers to generate visible data.

The key insight: each RBM we stack improves the variational lower bound on the log-likelihood of the data. So greedy training is not just a heuristic. It has a theoretical justification.

Greedy layer-wise pretraining

Layer-wise reconstruction error after pretraining

The training procedure is simple:

Train an RBM on the raw data. This gives you $W_1$ .
Use the trained RBM to transform the data: compute $P(h^{(1)} \mid v)$ for every training example.
Train a second RBM on these hidden representations. This gives you $W_2$ .
Repeat for as many layers as you want.

Each RBM only sees the output of the previous one. It never sees the raw data (except the first RBM).

graph TD
  subgraph Step1["Step 1: Train RBM 1"]
      V1["Data v"] ---|"W₁"| H1_1["h¹"]
  end
  subgraph Step2["Step 2: Train RBM 2"]
      H1_2["h¹ activations"] ---|"W₂"| H2_2["h²"]
  end
  subgraph Step3["Step 3: Train RBM 3"]
      H2_3["h² activations"] ---|"W₃"| H3_3["h³"]
  end
  Step1 -->|"freeze W₁, compute h¹"| Step2
  Step2 -->|"freeze W₂, compute h²"| Step3
  style Step1 fill:#e6f3ff,stroke:#333,color:#000
  style Step2 fill:#d4edda,stroke:#333,color:#000
  style Step3 fill:#fff3cd,stroke:#333,color:#000

Figure 1: Greedy layer-wise pretraining. Each step trains one RBM, then freezes its weights and uses its hidden activations as input for the next RBM.

Why greedy pretraining works

Hinton et al. (2006) proved that adding each new RBM layer either improves or leaves unchanged the variational lower bound on $\log P(v)$ . Intuitively, each layer learns to model the distribution of the layer below it better than a single layer could.

Before this result, deep networks (more than 2-3 layers) were considered impractical. Random initialization put the weights in a terrible region of parameter space, and gradient descent could not escape. Greedy pretraining found a much better starting point.

The DBN as a generative model

Once trained, a DBN generates data top-down:

Run Gibbs sampling in the top RBM (layers $L$ and $L-1$ ) to get a sample $h^{(L-1)}$ .
For each layer below, compute $P(h^{(l-1)} \mid h^{(l)}) = \sigma(b^{(l-1)} + W_l^T h^{(l)})$ and sample.
The final sample at the visible layer is the generated data.

graph TD
  subgraph Generative["Generative (top-down)"]
      G3["h³ ~ RBM sampling"] -->|"σ(b₂ + W₃ᵀh³)"| G2["h²"]
      G2 -->|"σ(b₁ + W₂ᵀh²)"| G1["h¹"]
      G1 -->|"σ(a + W₁ᵀh¹)"| G0["v (generated)"]
  end
  subgraph Discriminative["Discriminative (bottom-up)"]
      D0["v (input)"] -->|"σ(b₁ + W₁v)"| D1["h¹"]
      D1 -->|"σ(b₂ + W₂h¹)"| D2["h²"]
      D2 --> D3["Classifier"]
  end
  style Generative fill:#e6ffe6,stroke:#333,color:#000
  style Discriminative fill:#ffe6e6,stroke:#333,color:#000

Figure 2: A DBN can be used generatively (top-down sampling) or discriminatively (bottom-up classification). The generative direction is the DBN’s natural mode. The discriminative direction reuses the learned features.

The generative direction uses the transpose of the weight matrices. This is one of the elegant properties of the DBN: the same weights serve both directions.

Fine-tuning with backpropagation

Pretraining gives good initial weights, but they are not optimized for any specific task. Fine-tuning adds a task-specific output layer (e.g., softmax for classification) and trains the entire network end-to-end with backpropagation.

The pretrained weights serve as initialization. Gradient descent then adjusts all weights jointly to minimize the supervised loss. Because pretraining placed the weights in a good region, fine-tuning converges faster and reaches better solutions than training from scratch.

Example 1: Forward pass through a 2-layer DBN

Layer 1 RBM (4 visible, 3 hidden):

$W_1 = \begin{bmatrix} 0.3 & -0.1 & 0.5 \\ 0.2 & 0.4 & -0.3 \\ -0.1 & 0.6 & 0.2 \\ 0.4 & -0.2 & 0.1 \end{bmatrix}, \quad b_1 = [-0.1, 0.2, 0.1]$

Layer 2 RBM (3 visible for this RBM, 2 hidden):

$W_2 = \begin{bmatrix} 0.5 & -0.3 \\ 0.2 & 0.7 \\ -0.4 & 0.1 \end{bmatrix}, \quad b_2 = [0.1, -0.2]$

Input: $v = [1, 0, 1, 1]$

Layer 1: compute $P(h^{(1)} \mid v)$ .

$z_1 = b_1 + W_1^T v = [-0.1, 0.2, 0.1] + W_1^T [1, 0, 1, 1]$

$W_1^T v = \begin{bmatrix} 0.3 + 0 + (-0.1) + 0.4 \\ -0.1 + 0 + 0.6 + (-0.2) \\ 0.5 + 0 + 0.2 + 0.1 \end{bmatrix} = \begin{bmatrix} 0.6 \\ 0.3 \\ 0.8 \end{bmatrix}$

$z_1 = [-0.1 + 0.6,\; 0.2 + 0.3,\; 0.1 + 0.8] = [0.5,\; 0.5,\; 0.9]$

$h^{(1)} = \sigma(z_1) = [\sigma(0.5),\; \sigma(0.5),\; \sigma(0.9)] = [0.622,\; 0.622,\; 0.711]$

Layer 2: compute $P(h^{(2)} \mid h^{(1)})$ .

$z_2 = b_2 + W_2^T h^{(1)} = [0.1, -0.2] + W_2^T [0.622, 0.622, 0.711]$

$W_2^T h^{(1)} = \begin{bmatrix} 0.5(0.622) + 0.2(0.622) + (-0.4)(0.711) \\ -0.3(0.622) + 0.7(0.622) + 0.1(0.711) \end{bmatrix} = \begin{bmatrix} 0.311 + 0.124 - 0.284 \\ -0.187 + 0.435 + 0.071 \end{bmatrix} = \begin{bmatrix} 0.151 \\ 0.320 \end{bmatrix}$

$z_2 = [0.1 + 0.151,\; -0.2 + 0.320] = [0.251,\; 0.120]$

$h^{(2)} = \sigma(z_2) = [\sigma(0.251),\; \sigma(0.120)] = [0.562,\; 0.530]$

The DBN has compressed a 4-dimensional input into a 2-dimensional representation through two layers of learned features.

Example 2: Fine-tuning with backpropagation

Using the network from Example 1, add a single output neuron with sigmoid activation for binary classification. Output weight $W_3 = [0.6, -0.4]$ , bias $b_3 = 0.1$ .

Forward pass (using $h^{(2)} = [0.562, 0.530]$ from Example 1):

$z_3 = W_3^T h^{(2)} + b_3 = 0.6(0.562) + (-0.4)(0.530) + 0.1$ $= 0.337 - 0.212 + 0.1 = 0.225$ $\hat{y} = \sigma(0.225) = 0.556$

Loss (target $y = 1$ , using binary cross-entropy):

$\mathcal{L} = -[y \log \hat{y} + (1-y)\log(1-\hat{y})]$ $= -[\log(0.556)] = -(-0.588) = 0.588$

Backward pass. The output gradient:

$\delta_3 = \hat{y} - y = 0.556 - 1 = -0.444$

Gradient for $W_3$ :

$\frac{\partial \mathcal{L}}{\partial W_3} = \delta_3 \cdot h^{(2)} = -0.444 \cdot [0.562, 0.530] = [-0.249, -0.235]$

Gradient propagated to $h^{(2)}$ :

$\delta_{h^{(2)}} = \delta_3 \cdot W_3 \odot h^{(2)} \odot (1 - h^{(2)})$ $= -0.444 \cdot [0.6, -0.4] \odot [0.562 \cdot 0.438,\; 0.530 \cdot 0.470]$ $= [-0.266, 0.178] \odot [0.246, 0.249]$ $= [-0.065, 0.044]$

This gradient continues backward through $W_2$ and $W_1$ , adjusting all pretrained weights to minimize the classification loss. The pretraining gives a much better starting point than random initialization.

Example 3: Generating a sample from the DBN

To generate, we go top-down. Start by sampling from the top RBM.

Step 1: Sample from top RBM. Suppose after Gibbs sampling in the top RBM, we get $h^{(2)} = [1, 0]$ (binary samples).

Step 2: Generate $h^{(1)}$ from $h^{(2)}$ . Using $W_2^T$ and the layer 1 biases:

$P(h^{(1)}_j = 1 \mid h^{(2)}) = \sigma(b_{1,j} + \sum_k W_{2,jk} \, h^{(2)}_k)$

For the original biases $b_1 = [-0.1, 0.2, 0.1]$ :

$P(h^{(1)}_0 = 1) = \sigma(-0.1 + 0.5 \cdot 1 + (-0.3) \cdot 0) = \sigma(0.4) = 0.599$ $P(h^{(1)}_1 = 1) = \sigma(0.2 + 0.2 \cdot 1 + 0.7 \cdot 0) = \sigma(0.4) = 0.599$ $P(h^{(1)}_2 = 1) = \sigma(0.1 + (-0.4) \cdot 1 + 0.1 \cdot 0) = \sigma(-0.3) = 0.426$

Suppose we sample $h^{(1)} = [1, 1, 0]$ .

Step 3: Generate $v$ from $h^{(1)}$ . Using $W_1^T$ and visible biases $a = [0, 0, 0, 0]$ (assuming zero for simplicity):

$P(v_i = 1 \mid h^{(1)}) = \sigma(a_i + \sum_j W_{1,ij} \, h^{(1)}_j)$

$P(v_0 = 1) = \sigma(0.3 \cdot 1 + (-0.1) \cdot 1 + 0.5 \cdot 0) = \sigma(0.2) = 0.550$ $P(v_1 = 1) = \sigma(0.2 \cdot 1 + 0.4 \cdot 1 + (-0.3) \cdot 0) = \sigma(0.6) = 0.646$ $P(v_2 = 1) = \sigma((-0.1) \cdot 1 + 0.6 \cdot 1 + 0.2 \cdot 0) = \sigma(0.5) = 0.622$ $P(v_3 = 1) = \sigma(0.4 \cdot 1 + (-0.2) \cdot 1 + 0.1 \cdot 0) = \sigma(0.2) = 0.550$

Sample: $v = [1, 1, 1, 1]$ (each compared against its probability). This is one generated sample from the DBN.

Why DBNs mattered

DBNs solved the deep network training problem in 2006. Before Hinton’s paper, the consensus was that deep networks could not be trained effectively. The gradients either vanished or exploded, and random initialization left the weights in a poor region.

Greedy pretraining showed that you could initialize deep networks well by treating each layer as a separate unsupervised learning problem. This was a proof of concept: deep learning works, you just need the right starting point.

Why DBNs faded

Within a few years, several developments made greedy pretraining unnecessary:

ReLU activations reduced vanishing gradients, making deep networks trainable from random initialization.
Batch normalization stabilized training by normalizing layer inputs.
Dropout provided effective regularization.
Large datasets (ImageNet) and GPUs made end-to-end supervised training practical.

By 2012, deep networks trained from scratch (AlexNet) outperformed DBN-based approaches. Greedy layer-wise pretraining was no longer needed.

DBN vs modern pretraining

The idea of pretraining did not die. It evolved. Modern approaches like BERT and SimCLR use self-supervised learning to pretrain on large unlabeled datasets, then fine-tune on smaller labeled datasets. The philosophy is the same: learn good representations from unlabeled data, then adapt them.

Aspect	DBN Approach	BERT / SimCLR	Practical Winner
Pretraining method	Greedy layer-wise RBM	Self-supervised on full network	BERT / SimCLR
Data requirement	Modest (works with small datasets)	Massive (millions of examples)	Depends on data availability
Training objective	Maximize likelihood per layer	Masked prediction / contrastive	BERT / SimCLR
Fine-tuning	Backprop through all layers	Backprop through all layers	Tie
Scalability	Difficult beyond ~5 layers	Scales to hundreds of layers	BERT / SimCLR
Hardware needs	CPU-era feasible	GPU/TPU clusters required	Depends on budget
Theoretical grounding	Variational bound improvement	Empirical success	DBN (more theory)

The connection is real: DBNs showed that unsupervised pretraining followed by supervised fine-tuning is a powerful recipe. Transfer learning follows the same pattern with different mechanics.

Practical considerations

How many layers? In the original experiments, 2-3 RBM layers worked best. More layers gave diminishing returns and sometimes hurt performance. Each additional RBM has to model an increasingly abstract distribution, and the greedy approximation gets less accurate.

How many hidden units per layer? The standard approach is a funnel: more units in lower layers, fewer in upper layers. For MNIST (784 input pixels), a typical DBN might use 500 units in layer 1, 500 in layer 2, and 2000 in the top RBM. The top RBM needs more capacity because it is responsible for capturing the global structure.

How long to pretrain each layer? There is no fixed rule. Monitor the reconstruction error of each RBM and stop when it plateaus. Over-training one layer can actually hurt the layers above it, because it overfits to noise patterns that the next layer then has to model.

Pretraining vs random init baseline. Always compare against a randomly initialized deep network trained with the same architecture and optimizer. In the modern era (with ReLU, batch normalization, and good optimizers), random initialization often wins. DBN pretraining is most beneficial when data is limited or the network is very deep.

The legacy of DBNs

DBNs are rarely used today, but their impact on the field is hard to overstate. They provided the first convincing evidence that deep networks could be trained at all. This triggered a wave of research into deep learning, eventually leading to the breakthroughs we see today: AlexNet in computer vision, Transformers in NLP, and diffusion models in image generation.

The key ideas from DBNs that persist:

Unsupervised pretraining evolved into self-supervised learning (BERT, GPT, SimCLR).
Layer-wise feature learning inspired work on understanding what each layer of a deep network learns.
Generative models as regularizers showed that learning to generate data helps learn useful representations.

If you are starting a new project today, you will almost certainly not use a DBN. But understanding how they work gives you a deeper appreciation for why modern techniques exist and what problems they were designed to solve.

What comes next

DBNs are generative models built from stacked RBMs. But there is a more principled way to build deep generative models: the Variational Autoencoder. VAEs combine encoder-decoder architectures with variational inference to learn latent representations. They replaced DBNs as the go-to latent variable generative model.

← Back to all series