Search…

Deep Belief Networks

In this series (25 parts)
  1. Neural networks: the basic building block
  2. Forward pass and backpropagation
  3. Training neural networks: a practical guide
  4. Convolutional neural networks
  5. Recurrent neural networks and LSTMs
  6. Attention mechanism and transformers
  7. Word embeddings: from one-hot to dense representations
  8. Transfer learning and fine-tuning
  9. Optimization techniques for deep networks
  10. Regularization for deep networks
  11. Encoder-decoder architectures
  12. Generative models: an overview
  13. Restricted Boltzmann Machines
  14. Deep Belief Networks
  15. Variational Autoencoders
  16. Generative Adversarial Networks: training and theory
  17. DCGAN, conditional GANs, and GAN variants
  18. Representation learning and self-supervised learning
  19. Domain adaptation and fine-tuning strategies
  20. Distributed representations and latent spaces
  21. AutoML and hyperparameter optimization
  22. Neural architecture search
  23. Network compression and efficient inference
  24. Graph neural networks
  25. Practical deep learning: debugging and tuning

Prerequisites: Restricted Boltzmann Machines.

A single RBM learns one layer of features. A Deep Belief Network (DBN) stacks multiple RBMs to learn a hierarchy: edges in layer 1, parts in layer 2, objects in layer 3. The trick is training them one layer at a time, bottom up.

From one RBM to many

A DBN with LL layers has weights W1,W2,,WLW_1, W_2, \ldots, W_L. The bottom layers form a directed generative model, and the top two layers form an undirected RBM. This hybrid structure is unusual, but it falls naturally out of the training procedure.

The generative story works top-down:

  1. Sample from the top-level RBM to get activations for the top hidden layer.
  2. Pass those activations down through sigmoid belief layers to generate visible data.

The key insight: each RBM we stack improves the variational lower bound on the log-likelihood of the data. So greedy training is not just a heuristic. It has a theoretical justification.

Greedy layer-wise pretraining

Layer-wise reconstruction error after pretraining

The training procedure is simple:

  1. Train an RBM on the raw data. This gives you W1W_1.
  2. Use the trained RBM to transform the data: compute P(h(1)v)P(h^{(1)} \mid v) for every training example.
  3. Train a second RBM on these hidden representations. This gives you W2W_2.
  4. Repeat for as many layers as you want.

Each RBM only sees the output of the previous one. It never sees the raw data (except the first RBM).

graph TD
  subgraph Step1["Step 1: Train RBM 1"]
      V1["Data v"] ---|"W₁"| H1_1["h¹"]
  end
  subgraph Step2["Step 2: Train RBM 2"]
      H1_2["h¹ activations"] ---|"W₂"| H2_2["h²"]
  end
  subgraph Step3["Step 3: Train RBM 3"]
      H2_3["h² activations"] ---|"W₃"| H3_3["h³"]
  end
  Step1 -->|"freeze W₁, compute h¹"| Step2
  Step2 -->|"freeze W₂, compute h²"| Step3
  style Step1 fill:#e6f3ff,stroke:#333,color:#000
  style Step2 fill:#d4edda,stroke:#333,color:#000
  style Step3 fill:#fff3cd,stroke:#333,color:#000

Figure 1: Greedy layer-wise pretraining. Each step trains one RBM, then freezes its weights and uses its hidden activations as input for the next RBM.

Why greedy pretraining works

Hinton et al. (2006) proved that adding each new RBM layer either improves or leaves unchanged the variational lower bound on logP(v)\log P(v). Intuitively, each layer learns to model the distribution of the layer below it better than a single layer could.

Before this result, deep networks (more than 2-3 layers) were considered impractical. Random initialization put the weights in a terrible region of parameter space, and gradient descent could not escape. Greedy pretraining found a much better starting point.

The DBN as a generative model

Once trained, a DBN generates data top-down:

  1. Run Gibbs sampling in the top RBM (layers LL and L1L-1) to get a sample h(L1)h^{(L-1)}.
  2. For each layer below, compute P(h(l1)h(l))=σ(b(l1)+WlTh(l))P(h^{(l-1)} \mid h^{(l)}) = \sigma(b^{(l-1)} + W_l^T h^{(l)}) and sample.
  3. The final sample at the visible layer is the generated data.
graph TD
  subgraph Generative["Generative (top-down)"]
      G3["h³ ~ RBM sampling"] -->|"σ(b₂ + W₃ᵀh³)"| G2["h²"]
      G2 -->|"σ(b₁ + W₂ᵀh²)"| G1["h¹"]
      G1 -->|"σ(a + W₁ᵀh¹)"| G0["v (generated)"]
  end
  subgraph Discriminative["Discriminative (bottom-up)"]
      D0["v (input)"] -->|"σ(b₁ + W₁v)"| D1["h¹"]
      D1 -->|"σ(b₂ + W₂h¹)"| D2["h²"]
      D2 --> D3["Classifier"]
  end
  style Generative fill:#e6ffe6,stroke:#333,color:#000
  style Discriminative fill:#ffe6e6,stroke:#333,color:#000

Figure 2: A DBN can be used generatively (top-down sampling) or discriminatively (bottom-up classification). The generative direction is the DBN’s natural mode. The discriminative direction reuses the learned features.

The generative direction uses the transpose of the weight matrices. This is one of the elegant properties of the DBN: the same weights serve both directions.

Fine-tuning with backpropagation

Pretraining gives good initial weights, but they are not optimized for any specific task. Fine-tuning adds a task-specific output layer (e.g., softmax for classification) and trains the entire network end-to-end with backpropagation.

The pretrained weights serve as initialization. Gradient descent then adjusts all weights jointly to minimize the supervised loss. Because pretraining placed the weights in a good region, fine-tuning converges faster and reaches better solutions than training from scratch.

Example 1: Forward pass through a 2-layer DBN

Layer 1 RBM (4 visible, 3 hidden):

W1=[0.30.10.50.20.40.30.10.60.20.40.20.1],b1=[0.1,0.2,0.1]W_1 = \begin{bmatrix} 0.3 & -0.1 & 0.5 \\ 0.2 & 0.4 & -0.3 \\ -0.1 & 0.6 & 0.2 \\ 0.4 & -0.2 & 0.1 \end{bmatrix}, \quad b_1 = [-0.1, 0.2, 0.1]

Layer 2 RBM (3 visible for this RBM, 2 hidden):

W2=[0.50.30.20.70.40.1],b2=[0.1,0.2]W_2 = \begin{bmatrix} 0.5 & -0.3 \\ 0.2 & 0.7 \\ -0.4 & 0.1 \end{bmatrix}, \quad b_2 = [0.1, -0.2]

Input: v=[1,0,1,1]v = [1, 0, 1, 1]

Layer 1: compute P(h(1)v)P(h^{(1)} \mid v).

z1=b1+W1Tv=[0.1,0.2,0.1]+W1T[1,0,1,1]z_1 = b_1 + W_1^T v = [-0.1, 0.2, 0.1] + W_1^T [1, 0, 1, 1]

W1Tv=[0.3+0+(0.1)+0.40.1+0+0.6+(0.2)0.5+0+0.2+0.1]=[0.60.30.8]W_1^T v = \begin{bmatrix} 0.3 + 0 + (-0.1) + 0.4 \\ -0.1 + 0 + 0.6 + (-0.2) \\ 0.5 + 0 + 0.2 + 0.1 \end{bmatrix} = \begin{bmatrix} 0.6 \\ 0.3 \\ 0.8 \end{bmatrix}

z1=[0.1+0.6,  0.2+0.3,  0.1+0.8]=[0.5,  0.5,  0.9]z_1 = [-0.1 + 0.6,\; 0.2 + 0.3,\; 0.1 + 0.8] = [0.5,\; 0.5,\; 0.9]

h(1)=σ(z1)=[σ(0.5),  σ(0.5),  σ(0.9)]=[0.622,  0.622,  0.711]h^{(1)} = \sigma(z_1) = [\sigma(0.5),\; \sigma(0.5),\; \sigma(0.9)] = [0.622,\; 0.622,\; 0.711]

Layer 2: compute P(h(2)h(1))P(h^{(2)} \mid h^{(1)}).

z2=b2+W2Th(1)=[0.1,0.2]+W2T[0.622,0.622,0.711]z_2 = b_2 + W_2^T h^{(1)} = [0.1, -0.2] + W_2^T [0.622, 0.622, 0.711]

W2Th(1)=[0.5(0.622)+0.2(0.622)+(0.4)(0.711)0.3(0.622)+0.7(0.622)+0.1(0.711)]=[0.311+0.1240.2840.187+0.435+0.071]=[0.1510.320]W_2^T h^{(1)} = \begin{bmatrix} 0.5(0.622) + 0.2(0.622) + (-0.4)(0.711) \\ -0.3(0.622) + 0.7(0.622) + 0.1(0.711) \end{bmatrix} = \begin{bmatrix} 0.311 + 0.124 - 0.284 \\ -0.187 + 0.435 + 0.071 \end{bmatrix} = \begin{bmatrix} 0.151 \\ 0.320 \end{bmatrix}

z2=[0.1+0.151,  0.2+0.320]=[0.251,  0.120]z_2 = [0.1 + 0.151,\; -0.2 + 0.320] = [0.251,\; 0.120]

h(2)=σ(z2)=[σ(0.251),  σ(0.120)]=[0.562,  0.530]h^{(2)} = \sigma(z_2) = [\sigma(0.251),\; \sigma(0.120)] = [0.562,\; 0.530]

The DBN has compressed a 4-dimensional input into a 2-dimensional representation through two layers of learned features.

Example 2: Fine-tuning with backpropagation

Using the network from Example 1, add a single output neuron with sigmoid activation for binary classification. Output weight W3=[0.6,0.4]W_3 = [0.6, -0.4], bias b3=0.1b_3 = 0.1.

Forward pass (using h(2)=[0.562,0.530]h^{(2)} = [0.562, 0.530] from Example 1):

z3=W3Th(2)+b3=0.6(0.562)+(0.4)(0.530)+0.1z_3 = W_3^T h^{(2)} + b_3 = 0.6(0.562) + (-0.4)(0.530) + 0.1 =0.3370.212+0.1=0.225= 0.337 - 0.212 + 0.1 = 0.225 y^=σ(0.225)=0.556\hat{y} = \sigma(0.225) = 0.556

Loss (target y=1y = 1, using binary cross-entropy):

L=[ylogy^+(1y)log(1y^)]\mathcal{L} = -[y \log \hat{y} + (1-y)\log(1-\hat{y})] =[log(0.556)]=(0.588)=0.588= -[\log(0.556)] = -(-0.588) = 0.588

Backward pass. The output gradient:

δ3=y^y=0.5561=0.444\delta_3 = \hat{y} - y = 0.556 - 1 = -0.444

Gradient for W3W_3:

LW3=δ3h(2)=0.444[0.562,0.530]=[0.249,0.235]\frac{\partial \mathcal{L}}{\partial W_3} = \delta_3 \cdot h^{(2)} = -0.444 \cdot [0.562, 0.530] = [-0.249, -0.235]

Gradient propagated to h(2)h^{(2)}:

δh(2)=δ3W3h(2)(1h(2))\delta_{h^{(2)}} = \delta_3 \cdot W_3 \odot h^{(2)} \odot (1 - h^{(2)}) =0.444[0.6,0.4][0.5620.438,  0.5300.470]= -0.444 \cdot [0.6, -0.4] \odot [0.562 \cdot 0.438,\; 0.530 \cdot 0.470] =[0.266,0.178][0.246,0.249]= [-0.266, 0.178] \odot [0.246, 0.249] =[0.065,0.044]= [-0.065, 0.044]

This gradient continues backward through W2W_2 and W1W_1, adjusting all pretrained weights to minimize the classification loss. The pretraining gives a much better starting point than random initialization.

Example 3: Generating a sample from the DBN

To generate, we go top-down. Start by sampling from the top RBM.

Step 1: Sample from top RBM. Suppose after Gibbs sampling in the top RBM, we get h(2)=[1,0]h^{(2)} = [1, 0] (binary samples).

Step 2: Generate h(1)h^{(1)} from h(2)h^{(2)}. Using W2TW_2^T and the layer 1 biases:

P(hj(1)=1h(2))=σ(b1,j+kW2,jkhk(2))P(h^{(1)}_j = 1 \mid h^{(2)}) = \sigma(b_{1,j} + \sum_k W_{2,jk} \, h^{(2)}_k)

For the original biases b1=[0.1,0.2,0.1]b_1 = [-0.1, 0.2, 0.1]:

P(h0(1)=1)=σ(0.1+0.51+(0.3)0)=σ(0.4)=0.599P(h^{(1)}_0 = 1) = \sigma(-0.1 + 0.5 \cdot 1 + (-0.3) \cdot 0) = \sigma(0.4) = 0.599 P(h1(1)=1)=σ(0.2+0.21+0.70)=σ(0.4)=0.599P(h^{(1)}_1 = 1) = \sigma(0.2 + 0.2 \cdot 1 + 0.7 \cdot 0) = \sigma(0.4) = 0.599 P(h2(1)=1)=σ(0.1+(0.4)1+0.10)=σ(0.3)=0.426P(h^{(1)}_2 = 1) = \sigma(0.1 + (-0.4) \cdot 1 + 0.1 \cdot 0) = \sigma(-0.3) = 0.426

Suppose we sample h(1)=[1,1,0]h^{(1)} = [1, 1, 0].

Step 3: Generate vv from h(1)h^{(1)}. Using W1TW_1^T and visible biases a=[0,0,0,0]a = [0, 0, 0, 0] (assuming zero for simplicity):

P(vi=1h(1))=σ(ai+jW1,ijhj(1))P(v_i = 1 \mid h^{(1)}) = \sigma(a_i + \sum_j W_{1,ij} \, h^{(1)}_j)

P(v0=1)=σ(0.31+(0.1)1+0.50)=σ(0.2)=0.550P(v_0 = 1) = \sigma(0.3 \cdot 1 + (-0.1) \cdot 1 + 0.5 \cdot 0) = \sigma(0.2) = 0.550 P(v1=1)=σ(0.21+0.41+(0.3)0)=σ(0.6)=0.646P(v_1 = 1) = \sigma(0.2 \cdot 1 + 0.4 \cdot 1 + (-0.3) \cdot 0) = \sigma(0.6) = 0.646 P(v2=1)=σ((0.1)1+0.61+0.20)=σ(0.5)=0.622P(v_2 = 1) = \sigma((-0.1) \cdot 1 + 0.6 \cdot 1 + 0.2 \cdot 0) = \sigma(0.5) = 0.622 P(v3=1)=σ(0.41+(0.2)1+0.10)=σ(0.2)=0.550P(v_3 = 1) = \sigma(0.4 \cdot 1 + (-0.2) \cdot 1 + 0.1 \cdot 0) = \sigma(0.2) = 0.550

Sample: v=[1,1,1,1]v = [1, 1, 1, 1] (each compared against its probability). This is one generated sample from the DBN.

Why DBNs mattered

DBNs solved the deep network training problem in 2006. Before Hinton’s paper, the consensus was that deep networks could not be trained effectively. The gradients either vanished or exploded, and random initialization left the weights in a poor region.

Greedy pretraining showed that you could initialize deep networks well by treating each layer as a separate unsupervised learning problem. This was a proof of concept: deep learning works, you just need the right starting point.

Why DBNs faded

Within a few years, several developments made greedy pretraining unnecessary:

  • ReLU activations reduced vanishing gradients, making deep networks trainable from random initialization.
  • Batch normalization stabilized training by normalizing layer inputs.
  • Dropout provided effective regularization.
  • Large datasets (ImageNet) and GPUs made end-to-end supervised training practical.

By 2012, deep networks trained from scratch (AlexNet) outperformed DBN-based approaches. Greedy layer-wise pretraining was no longer needed.

DBN vs modern pretraining

The idea of pretraining did not die. It evolved. Modern approaches like BERT and SimCLR use self-supervised learning to pretrain on large unlabeled datasets, then fine-tune on smaller labeled datasets. The philosophy is the same: learn good representations from unlabeled data, then adapt them.

AspectDBN ApproachBERT / SimCLRPractical Winner
Pretraining methodGreedy layer-wise RBMSelf-supervised on full networkBERT / SimCLR
Data requirementModest (works with small datasets)Massive (millions of examples)Depends on data availability
Training objectiveMaximize likelihood per layerMasked prediction / contrastiveBERT / SimCLR
Fine-tuningBackprop through all layersBackprop through all layersTie
ScalabilityDifficult beyond ~5 layersScales to hundreds of layersBERT / SimCLR
Hardware needsCPU-era feasibleGPU/TPU clusters requiredDepends on budget
Theoretical groundingVariational bound improvementEmpirical successDBN (more theory)

The connection is real: DBNs showed that unsupervised pretraining followed by supervised fine-tuning is a powerful recipe. Transfer learning follows the same pattern with different mechanics.

Practical considerations

How many layers? In the original experiments, 2-3 RBM layers worked best. More layers gave diminishing returns and sometimes hurt performance. Each additional RBM has to model an increasingly abstract distribution, and the greedy approximation gets less accurate.

How many hidden units per layer? The standard approach is a funnel: more units in lower layers, fewer in upper layers. For MNIST (784 input pixels), a typical DBN might use 500 units in layer 1, 500 in layer 2, and 2000 in the top RBM. The top RBM needs more capacity because it is responsible for capturing the global structure.

How long to pretrain each layer? There is no fixed rule. Monitor the reconstruction error of each RBM and stop when it plateaus. Over-training one layer can actually hurt the layers above it, because it overfits to noise patterns that the next layer then has to model.

Pretraining vs random init baseline. Always compare against a randomly initialized deep network trained with the same architecture and optimizer. In the modern era (with ReLU, batch normalization, and good optimizers), random initialization often wins. DBN pretraining is most beneficial when data is limited or the network is very deep.

The legacy of DBNs

DBNs are rarely used today, but their impact on the field is hard to overstate. They provided the first convincing evidence that deep networks could be trained at all. This triggered a wave of research into deep learning, eventually leading to the breakthroughs we see today: AlexNet in computer vision, Transformers in NLP, and diffusion models in image generation.

The key ideas from DBNs that persist:

  • Unsupervised pretraining evolved into self-supervised learning (BERT, GPT, SimCLR).
  • Layer-wise feature learning inspired work on understanding what each layer of a deep network learns.
  • Generative models as regularizers showed that learning to generate data helps learn useful representations.

If you are starting a new project today, you will almost certainly not use a DBN. But understanding how they work gives you a deeper appreciation for why modern techniques exist and what problems they were designed to solve.

What comes next

DBNs are generative models built from stacked RBMs. But there is a more principled way to build deep generative models: the Variational Autoencoder. VAEs combine encoder-decoder architectures with variational inference to learn latent representations. They replaced DBNs as the go-to latent variable generative model.

Start typing to search across all content
navigate Enter open Esc close