Search…

Restricted Boltzmann Machines

In this series (25 parts)
  1. Neural networks: the basic building block
  2. Forward pass and backpropagation
  3. Training neural networks: a practical guide
  4. Convolutional neural networks
  5. Recurrent neural networks and LSTMs
  6. Attention mechanism and transformers
  7. Word embeddings: from one-hot to dense representations
  8. Transfer learning and fine-tuning
  9. Optimization techniques for deep networks
  10. Regularization for deep networks
  11. Encoder-decoder architectures
  12. Generative models: an overview
  13. Restricted Boltzmann Machines
  14. Deep Belief Networks
  15. Variational Autoencoders
  16. Generative Adversarial Networks: training and theory
  17. DCGAN, conditional GANs, and GAN variants
  18. Representation learning and self-supervised learning
  19. Domain adaptation and fine-tuning strategies
  20. Distributed representations and latent spaces
  21. AutoML and hyperparameter optimization
  22. Neural architecture search
  23. Network compression and efficient inference
  24. Graph neural networks
  25. Practical deep learning: debugging and tuning

Prerequisites: Probability fundamentals and Random variables and distributions.

What is an RBM, intuitively?

An RBM learns patterns by adjusting an energy function. Low energy means the model likes that pattern. High energy means the model considers it unlikely. Training pushes observed data toward low energy and everything else toward high energy.

Think of a preference system. You have visible features (things you observe) and hidden features (explanations you infer). Each combination gets an energy score. The model learns which combinations to prefer.

Visible PatternHidden PatternEnergyInterpretation
[1, 1, 0, 0][1, 0]-2.4Strongly preferred
[1, 0, 1, 0][0, 1]-1.8Preferred
[0, 1, 0, 1][1, 1]0.3Slightly disfavored
[0, 0, 1, 1][0, 0]1.9Strongly disfavored

Lower energy means higher probability. The model assigns the most probability mass to patterns it sees frequently during training.

RBM structure: two layers, no within-layer connections

graph LR
  V["Visible Layer
(observed data)"] <-->|"Every visible unit connects
to every hidden unit"| H["Hidden Layer
(learned features)"]

  style V fill:#fff3e6,stroke:#333,color:#000
  style H fill:#e6f3ff,stroke:#333,color:#000

No connections exist within a layer. This restriction is what makes the math tractable: given one layer, you can compute the other layer’s activations in parallel.

Now let’s formalize the energy function and derive the training algorithm.

Restricted Boltzmann Machines (RBMs) turn an energy function into a probability distribution. Low energy configurations are more probable. High energy configurations are less probable. That is the entire idea. The rest is working out how to train this efficiently.

Energy-based models

An energy-based model assigns a scalar energy E(x)E(x) to every possible configuration xx. The probability of a configuration is:

P(x)=eE(x)Z,Z=xeE(x)P(x) = \frac{e^{-E(x)}}{Z}, \quad Z = \sum_{x'} e^{-E(x')}

The normalizing constant ZZ (called the partition function) sums over all possible configurations to make the probabilities add up to 1. This is borrowed from statistical physics, where the Boltzmann distribution describes the probability of a system being in a particular state.

The model learns by adjusting its parameters so that training data gets low energy (high probability) and everything else gets high energy (low probability).

RBM structure

An RBM has two layers of binary units:

  • Visible units v{0,1}mv \in \{0, 1\}^m: represent the observed data (pixels, features, etc.)
  • Hidden units h{0,1}nh \in \{0, 1\}^n: represent learned features

The “restricted” part is the key constraint: there are no connections within a layer. Visible units connect only to hidden units, and hidden units connect only to visible units. This bipartite structure makes inference tractable.

graph TD
  subgraph Hidden["Hidden layer h"]
      H1["h₁"]
      H2["h₂"]
      H3["h₃"]
  end
  subgraph Visible["Visible layer v"]
      V1["v₁"]
      V2["v₂"]
      V3["v₃"]
      V4["v₄"]
  end
  V1 --- H1
  V1 --- H2
  V1 --- H3
  V2 --- H1
  V2 --- H2
  V2 --- H3
  V3 --- H1
  V3 --- H2
  V3 --- H3
  V4 --- H1
  V4 --- H2
  V4 --- H3
  style Hidden fill:#e6f3ff,stroke:#333,color:#000
  style Visible fill:#fff3e6,stroke:#333,color:#000

Figure 1: RBM bipartite graph. Every visible unit connects to every hidden unit. No connections exist within a layer. This restriction is what makes RBMs tractable.

The energy function

The energy of a joint configuration (v,h)(v, h) is:

E(v,h)=aTvbThvTWhE(v, h) = -a^T v - b^T h - v^T W h

where:

  • WRm×nW \in \mathbb{R}^{m \times n} is the weight matrix connecting visible to hidden units
  • aRma \in \mathbb{R}^m is the visible bias vector
  • bRnb \in \mathbb{R}^n is the hidden bias vector

The joint probability is:

P(v,h)=eE(v,h)ZP(v, h) = \frac{e^{-E(v, h)}}{Z}

And the probability of a visible configuration (what we actually care about) is:

P(v)=1ZheE(v,h)P(v) = \frac{1}{Z} \sum_h e^{-E(v, h)}

Example 1: Computing RBM energy

Given 2 visible and 2 hidden units:

v=[1,0],h=[1,1]v = [1, 0], \quad h = [1, 1] W=[0.50.30.20.4],a=[0.1,0.1],b=[0.2,0.3]W = \begin{bmatrix} 0.5 & 0.3 \\ -0.2 & 0.4 \end{bmatrix}, \quad a = [0.1, -0.1], \quad b = [0.2, 0.3]

Step 1: Visible bias term.

aTv=0.11+(0.1)0=0.1a^T v = 0.1 \cdot 1 + (-0.1) \cdot 0 = 0.1

Step 2: Hidden bias term.

bTh=0.21+0.31=0.5b^T h = 0.2 \cdot 1 + 0.3 \cdot 1 = 0.5

Step 3: Interaction term. We need vTWhv^T W h:

vTW=[1,0][0.50.30.20.4]=[0.5,0.3]v^T W = [1, 0] \begin{bmatrix} 0.5 & 0.3 \\ -0.2 & 0.4 \end{bmatrix} = [0.5, 0.3]

(vTW)h=[0.5,0.3][1,1]=0.5+0.3=0.8(v^T W) h = [0.5, 0.3] \cdot [1, 1] = 0.5 + 0.3 = 0.8

Step 4: Total energy.

E(v,h)=0.10.50.8=1.4E(v, h) = -0.1 - 0.5 - 0.8 = -1.4

The negative energy means this configuration is relatively probable. Configurations with more negative energy get higher probability under the Boltzmann distribution.

Conditional distributions

The bipartite structure gives us a huge advantage: given the visible units, all hidden units are conditionally independent (and vice versa). This means we can compute:

P(hj=1v)=σ ⁣(bj+iviWij)P(h_j = 1 \mid v) = \sigma\!\left(b_j + \sum_i v_i W_{ij}\right)

P(vi=1h)=σ ⁣(ai+jWijhj)P(v_i = 1 \mid h) = \sigma\!\left(a_i + \sum_j W_{ij} h_j\right)

where σ\sigma is the sigmoid function. Each unit’s activation depends only on the units in the other layer, not on units in its own layer. This is what the “restricted” constraint buys us: we can sample an entire layer in parallel.

Gibbs sampling

To generate samples from an RBM, we use Gibbs sampling. We alternate between sampling hidden units given visible units, and visible units given hidden units:

v(0)h(0)v(1)h(1)v(2)v^{(0)} \to h^{(0)} \to v^{(1)} \to h^{(1)} \to v^{(2)} \to \cdots

graph LR
  V0["v⁰ (data)"] -->|"P(h|v)"| H0["h⁰"]
  H0 -->|"P(v|h)"| V1["v¹"]
  V1 -->|"P(h|v)"| H1["h¹"]
  H1 -->|"P(v|h)"| V2["v² ≈ sample"]
  style V0 fill:#fff3e6,stroke:#333,color:#000
  style V2 fill:#e6ffe6,stroke:#333,color:#000

Figure 2: Gibbs sampling alternates between sampling hidden units from visible and visible units from hidden. After enough steps, the samples approximate the model distribution.

After many iterations, the Markov chain converges to the model’s equilibrium distribution. In practice, we run a finite number of steps.

Gibbs sampling bounces between visible and hidden layers

graph TD
  V0["v: visible state"] -->|"Sample each hj
from P(hj=1|v)"| H0["h: hidden state"]
  H0 -->|"Sample each vi
from P(vi=1|h)"| V1["v: new visible state"]
  V1 -->|"Repeat"| H1["h: new hidden state"]
  H1 -->|"Keep bouncing"| V2["v: converges to
model distribution"]

  style V0 fill:#fff3e6,stroke:#333,color:#000
  style V2 fill:#e6ffe6,stroke:#333,color:#000

Each bounce updates one layer while holding the other fixed. Because there are no within-layer connections, all units in a layer can be sampled simultaneously. After many bounces, the samples reflect the model’s learned distribution.

Contrastive divergence

The log-likelihood gradient for an RBM weight WijW_{ij} is:

logP(v)Wij=vihjdatavihjmodel\frac{\partial \log P(v)}{\partial W_{ij}} = \langle v_i h_j \rangle_{\text{data}} - \langle v_i h_j \rangle_{\text{model}}

The first term (positive phase) is easy: clamp the visible units to a training example, compute P(hj=1v)P(h_j = 1 \mid v), and take the expectation. The second term (negative phase) requires sampling from the model distribution, which means running Gibbs sampling to convergence. That is extremely expensive.

Contrastive Divergence (CD-kk) is Hinton’s practical shortcut. Instead of running Gibbs sampling to convergence, run just kk steps (usually k=1k = 1):

  1. Start with a training example v(0)v^{(0)}.
  2. Sample h(0)h^{(0)} from P(hv(0))P(h \mid v^{(0)}).
  3. Sample v(1)v^{(1)} from P(vh(0))P(v \mid h^{(0)}).
  4. Compute P(h(1)v(1))P(h^{(1)} \mid v^{(1)}) (no need to sample).

The update rule becomes:

ΔWij=η(vihjdatavihjrecon)\Delta W_{ij} = \eta \left( \langle v_i h_j \rangle_{\text{data}} - \langle v_i h_j \rangle_{\text{recon}} \right)

This is biased (we are not running the chain to convergence), but it works surprisingly well in practice.

Contrastive divergence: one step of learning

graph LR
  V0["Clamp training
data v0"] -->|"P(h|v)"| H0["Compute
hidden h0"]
  H0 -->|"P(v|h)"| V1["Reconstruct
visible v1"]
  V1 -->|"P(h|v)"| H1["Compute
hidden h1"]
  V0 -.->|"Positive phase:
v0 * h0"| UPD["Update Weights"]
  V1 -.->|"Negative phase:
v1 * h1"| UPD

  style V0 fill:#fff3e6,stroke:#333,color:#000
  style V1 fill:#e6f3ff,stroke:#333,color:#000
  style UPD fill:#51cf66,color:#fff

The positive phase captures what the data likes. The negative phase captures what the model currently likes. The weight update is the difference between them.

Example 2: CD-1 weight update

Using the same parameters as Example 1:

W=[0.50.30.20.4],a=[0.1,0.1],b=[0.2,0.3]W = \begin{bmatrix} 0.5 & 0.3 \\ -0.2 & 0.4 \end{bmatrix}, \quad a = [0.1, -0.1], \quad b = [0.2, 0.3]

Start with training example v(0)=[1,0]v^{(0)} = [1, 0].

Positive phase: compute P(hv(0))P(h \mid v^{(0)}).

P(h0=1v(0))=σ(b0+v0W00+v1W10)P(h_0 = 1 \mid v^{(0)}) = \sigma(b_0 + v_0 W_{00} + v_1 W_{10}) =σ(0.2+10.5+0(0.2))=σ(0.7)=0.668= \sigma(0.2 + 1 \cdot 0.5 + 0 \cdot (-0.2)) = \sigma(0.7) = 0.668

P(h1=1v(0))=σ(b1+v0W01+v1W11)P(h_1 = 1 \mid v^{(0)}) = \sigma(b_1 + v_0 W_{01} + v_1 W_{11}) =σ(0.3+10.3+00.4)=σ(0.6)=0.646= \sigma(0.3 + 1 \cdot 0.3 + 0 \cdot 0.4) = \sigma(0.6) = 0.646

Suppose we sample h(0)=[1,1]h^{(0)} = [1, 1] (both above their respective probabilities by chance).

Negative phase: compute P(vh(0))P(v \mid h^{(0)}).

P(v0=1h(0))=σ(a0+W00h0+W01h1)P(v_0 = 1 \mid h^{(0)}) = \sigma(a_0 + W_{00} h_0 + W_{01} h_1) =σ(0.1+0.51+0.31)=σ(0.9)=0.711= \sigma(0.1 + 0.5 \cdot 1 + 0.3 \cdot 1) = \sigma(0.9) = 0.711

P(v1=1h(0))=σ(a1+W10h0+W11h1)P(v_1 = 1 \mid h^{(0)}) = \sigma(a_1 + W_{10} h_0 + W_{11} h_1) =σ(0.1+(0.2)1+0.41)=σ(0.1)=0.525= \sigma(-0.1 + (-0.2) \cdot 1 + 0.4 \cdot 1) = \sigma(0.1) = 0.525

Suppose we sample v(1)=[1,1]v^{(1)} = [1, 1].

Reconstruction phase: compute P(hv(1))P(h \mid v^{(1)}).

P(h0=1v(1))=σ(0.2+10.5+1(0.2))=σ(0.5)=0.622P(h_0 = 1 \mid v^{(1)}) = \sigma(0.2 + 1 \cdot 0.5 + 1 \cdot (-0.2)) = \sigma(0.5) = 0.622

P(h1=1v(1))=σ(0.3+10.3+10.4)=σ(1.0)=0.731P(h_1 = 1 \mid v^{(1)}) = \sigma(0.3 + 1 \cdot 0.3 + 1 \cdot 0.4) = \sigma(1.0) = 0.731

Weight update for W00W_{00}:

positive=v0(0)P(h0=1v(0))=10.668=0.668\text{positive} = v_0^{(0)} \cdot P(h_0 = 1 \mid v^{(0)}) = 1 \cdot 0.668 = 0.668 negative=v0(1)P(h0=1v(1))=10.622=0.622\text{negative} = v_0^{(1)} \cdot P(h_0 = 1 \mid v^{(1)}) = 1 \cdot 0.622 = 0.622 ΔW00=η(0.6680.622)=0.046η\Delta W_{00} = \eta \cdot (0.668 - 0.622) = 0.046\eta

The positive gradient (0.668) says “this weight should increase because the data likes this connection.” The negative gradient (0.622) says “but the model already partially explains it.” The small positive difference (0.046) means a slight increase to W00W_{00}.

What RBMs learn

Each hidden unit becomes a feature detector. For image data, hidden units learn to detect edges, textures, and parts. For text data, they learn topic-like features. The weights WijW_{ij} encode which visible patterns activate which hidden features.

You can visualize what a hidden unit detects by looking at its weight vector W:,jW_{:,j} reshaped to the input dimensions. For MNIST digits, hidden units learn stroke detectors, loops, and line segments.

Energy landscape: valleys are learned patterns

graph TD
  HIGH["High Energy Region
(unlikely patterns)"] --> BARRIER["Energy Barriers"]
  BARRIER --> LOW1["Low Energy Valley
(learned pattern A)"]
  BARRIER --> LOW2["Low Energy Valley
(learned pattern B)"]
  BARRIER --> LOW3["Low Energy Valley
(learned pattern C)"]
  LOW1 -.- DESC1["Example: digit 3"]
  LOW2 -.- DESC2["Example: digit 7"]
  LOW3 -.- DESC3["Example: digit 9"]

  style HIGH fill:#ff6b6b,color:#fff
  style LOW1 fill:#51cf66,color:#fff
  style LOW2 fill:#51cf66,color:#fff
  style LOW3 fill:#51cf66,color:#fff

Training carves valleys in the energy surface. Each valley corresponds to a frequently observed pattern. Sampling from the model means descending into these valleys. The depth of a valley reflects how common that pattern is in the training data.

The partition function problem

Example 3: Why exact computation is intractable

With 2 visible and 2 hidden units, each binary, there are 22+2=162^{2+2} = 16 possible configurations. We can enumerate all of them:

vvhhE(v,h)E(v,h)eEe^{-E}
[0,0][0,0]01.000
[0,0][0,1]0.3-0.31.350
[0,0][1,0]0.2-0.21.221
[0,0][1,1]0.5-0.51.649
[1,0][0,0]0.1-0.11.105
[1,0][0,1]0.4-0.41.492
[1,0][1,0]0.8-0.82.226
[1,0][1,1]1.4-1.44.055
[0,1][0,0]0.10.10.905
[0,1][0,1]0.6-0.61.822
[0,1][1,0]0.10.10.905
[0,1][1,1]0.6-0.61.822
[1,1][0,0]001.000
[1,1][0,1]0.7-0.72.014
[1,1][1,0]0.5-0.51.649
[1,1][1,1]1.5-1.54.482

The partition function is Z=eE=28.70Z = \sum e^{-E} = 28.70 (sum of all values in the last column). With 16 states, this is trivial.

Now scale up. With 100 visible and 100 hidden units:

States=22001.6×1060\text{States} = 2^{200} \approx 1.6 \times 10^{60}

That is more states than there are atoms in the observable universe. You cannot enumerate them. You cannot compute ZZ exactly. This is precisely why we need contrastive divergence: it avoids computing ZZ altogether by using the difference between the positive and negative phase gradients.

RBM hyperparameters

ParameterEffectTypical Range
Hidden units nnMore units = more expressive features, but slower and risk of overfitting50 to 500
Learning rate η\etaToo high causes oscillation, too low causes slow training0.001 to 0.1
CD-kk stepsMore steps = less biased gradient, but slower per update1 to 10 (usually 1)
Mini-batch sizeLarger batches give smoother gradients10 to 100
Weight decayRegularization to prevent large weights10410^{-4} to 10210^{-2}
MomentumAccelerates training by accumulating gradient direction0.5 initially, 0.9 later

Historical importance

RBMs were crucial to the deep learning revolution. In 2006, Hinton showed that stacking RBMs and training them greedily, layer by layer, could initialize deep networks far better than random initialization. Before this, deep networks were considered too hard to train.

This pre-training approach was eventually replaced by better initialization methods (Xavier, He), activation functions (ReLU), and normalization techniques (batch normalization). But RBMs proved that deep generative models could work, and they opened the door for everything that followed.

Today, RBMs are rarely used in production systems. But the concepts they introduced (energy-based modeling, contrastive learning, and unsupervised feature extraction) remain foundational.

Practical tips for training RBMs

Reconstruction error during RBM training (50 epochs)

Monitor reconstruction error. After each CD step, compare v(0)v^{(0)} (the data) with v(1)v^{(1)} (the reconstruction). The mean squared error should decrease over training. If it does not, your learning rate is probably too high or too low.

Initialize weights small. Start with weights drawn from N(0,0.01)\mathcal{N}(0, 0.01). Large initial weights can cause the sigmoid activations to saturate, making learning very slow.

Use persistent CD (PCD) for better gradients. Instead of starting the Gibbs chain from the data each time, maintain a persistent chain across updates. This gives a less biased estimate of the negative phase, especially later in training when the model distribution is harder to sample.

Sparsity regularization. If you want each hidden unit to activate for only a small subset of inputs, add a penalty that encourages the mean activation of each hidden unit to be near a target value (typically 0.05 to 0.1). This produces more interpretable features.

Monitoring the free energy. The free energy of the visible units is F(v)=logheE(v,h)F(v) = -\log \sum_h e^{-E(v,h)}. For an RBM with sigmoid hidden units:

F(v)=aTvjlog(1+ebj+W:,jTv)F(v) = -a^T v - \sum_j \log(1 + e^{b_j + W_{:,j}^T v})

Track the difference in free energy between training data and random samples. If training data has much lower free energy, the model is learning. If the gap shrinks to zero, the model may be overfitting.

Variants and extensions

Gaussian-Bernoulli RBM. The standard RBM uses binary visible units, but real-valued data (like image pixels in [0,1]) requires Gaussian visible units. The energy function changes to:

E(v,h)=i(viai)22σi2bThi,jviσiWijhjE(v, h) = \sum_i \frac{(v_i - a_i)^2}{2\sigma_i^2} - b^T h - \sum_{i,j} \frac{v_i}{\sigma_i} W_{ij} h_j

This is harder to train but necessary for continuous data.

Conditional RBM. Adds visible context variables that influence the hidden units but are not generated by the model. Used for temporal data: condition on previous frames to predict the next one.

Convolutional RBM. Shares weights spatially, similar to a convolutional layer. Each hidden unit is a local feature detector rather than a global one. This reduces the parameter count and builds in spatial invariance.

What comes next

RBMs are powerful on their own, but their real impact came from stacking them into Deep Belief Networks. By training one RBM on top of another, you can build deep generative models that learn hierarchical features. That is exactly what we cover next.

Start typing to search across all content
navigate Enter open Esc close