Nov 30, 2025 · 20 min read · Deep Learning

Restricted Boltzmann Machines

In this series (25 parts)

Prerequisites: Probability fundamentals and Random variables and distributions.

What is an RBM, intuitively?

An RBM learns patterns by adjusting an energy function. Low energy means the model likes that pattern. High energy means the model considers it unlikely. Training pushes observed data toward low energy and everything else toward high energy.

Think of a preference system. You have visible features (things you observe) and hidden features (explanations you infer). Each combination gets an energy score. The model learns which combinations to prefer.

Visible Pattern	Hidden Pattern	Energy	Interpretation
[1, 1, 0, 0]	[1, 0]	-2.4	Strongly preferred
[1, 0, 1, 0]	[0, 1]	-1.8	Preferred
[0, 1, 0, 1]	[1, 1]	0.3	Slightly disfavored
[0, 0, 1, 1]	[0, 0]	1.9	Strongly disfavored

Lower energy means higher probability. The model assigns the most probability mass to patterns it sees frequently during training.

RBM structure: two layers, no within-layer connections

graph LR
  V["Visible Layer
(observed data)"] <-->|"Every visible unit connects
to every hidden unit"| H["Hidden Layer
(learned features)"]

  style V fill:#fff3e6,stroke:#333,color:#000
  style H fill:#e6f3ff,stroke:#333,color:#000

No connections exist within a layer. This restriction is what makes the math tractable: given one layer, you can compute the other layer’s activations in parallel.

Now let’s formalize the energy function and derive the training algorithm.

Restricted Boltzmann Machines (RBMs) turn an energy function into a probability distribution. Low energy configurations are more probable. High energy configurations are less probable. That is the entire idea. The rest is working out how to train this efficiently.

Energy-based models

An energy-based model assigns a scalar energy $E(x)$ to every possible configuration $x$ . The probability of a configuration is:

$P(x) = \frac{e^{-E(x)}}{Z}, \quad Z = \sum_{x'} e^{-E(x')}$

The normalizing constant $Z$ (called the partition function) sums over all possible configurations to make the probabilities add up to 1. This is borrowed from statistical physics, where the Boltzmann distribution describes the probability of a system being in a particular state.

The model learns by adjusting its parameters so that training data gets low energy (high probability) and everything else gets high energy (low probability).

RBM structure

An RBM has two layers of binary units:

Visible units $v \in \{0, 1\}^m$ : represent the observed data (pixels, features, etc.)
Hidden units $h \in \{0, 1\}^n$ : represent learned features

The “restricted” part is the key constraint: there are no connections within a layer. Visible units connect only to hidden units, and hidden units connect only to visible units. This bipartite structure makes inference tractable.

graph TD
  subgraph Hidden["Hidden layer h"]
      H1["h₁"]
      H2["h₂"]
      H3["h₃"]
  end
  subgraph Visible["Visible layer v"]
      V1["v₁"]
      V2["v₂"]
      V3["v₃"]
      V4["v₄"]
  end
  V1 --- H1
  V1 --- H2
  V1 --- H3
  V2 --- H1
  V2 --- H2
  V2 --- H3
  V3 --- H1
  V3 --- H2
  V3 --- H3
  V4 --- H1
  V4 --- H2
  V4 --- H3
  style Hidden fill:#e6f3ff,stroke:#333,color:#000
  style Visible fill:#fff3e6,stroke:#333,color:#000

Figure 1: RBM bipartite graph. Every visible unit connects to every hidden unit. No connections exist within a layer. This restriction is what makes RBMs tractable.

The energy function

The energy of a joint configuration $(v, h)$ is:

$E(v, h) = -a^T v - b^T h - v^T W h$

where:

$W \in \mathbb{R}^{m \times n}$ is the weight matrix connecting visible to hidden units
$a \in \mathbb{R}^m$ is the visible bias vector
$b \in \mathbb{R}^n$ is the hidden bias vector

The joint probability is:

$P(v, h) = \frac{e^{-E(v, h)}}{Z}$

And the probability of a visible configuration (what we actually care about) is:

$P(v) = \frac{1}{Z} \sum_h e^{-E(v, h)}$

Example 1: Computing RBM energy

Given 2 visible and 2 hidden units:

$v = [1, 0], \quad h = [1, 1]$ $W = \begin{bmatrix} 0.5 & 0.3 \\ -0.2 & 0.4 \end{bmatrix}, \quad a = [0.1, -0.1], \quad b = [0.2, 0.3]$

Step 1: Visible bias term.

$a^T v = 0.1 \cdot 1 + (-0.1) \cdot 0 = 0.1$

Step 2: Hidden bias term.

$b^T h = 0.2 \cdot 1 + 0.3 \cdot 1 = 0.5$

Step 3: Interaction term. We need $v^T W h$ :

$v^T W = [1, 0] \begin{bmatrix} 0.5 & 0.3 \\ -0.2 & 0.4 \end{bmatrix} = [0.5, 0.3]$

$(v^T W) h = [0.5, 0.3] \cdot [1, 1] = 0.5 + 0.3 = 0.8$

Step 4: Total energy.

$E(v, h) = -0.1 - 0.5 - 0.8 = -1.4$

The negative energy means this configuration is relatively probable. Configurations with more negative energy get higher probability under the Boltzmann distribution.

Conditional distributions

The bipartite structure gives us a huge advantage: given the visible units, all hidden units are conditionally independent (and vice versa). This means we can compute:

$P(h_j = 1 \mid v) = \sigma\!\left(b_j + \sum_i v_i W_{ij}\right)$

$P(v_i = 1 \mid h) = \sigma\!\left(a_i + \sum_j W_{ij} h_j\right)$

where $\sigma$ is the sigmoid function. Each unit’s activation depends only on the units in the other layer, not on units in its own layer. This is what the “restricted” constraint buys us: we can sample an entire layer in parallel.

Gibbs sampling

To generate samples from an RBM, we use Gibbs sampling. We alternate between sampling hidden units given visible units, and visible units given hidden units:

$v^{(0)} \to h^{(0)} \to v^{(1)} \to h^{(1)} \to v^{(2)} \to \cdots$

graph LR
  V0["v⁰ (data)"] -->|"P(h|v)"| H0["h⁰"]
  H0 -->|"P(v|h)"| V1["v¹"]
  V1 -->|"P(h|v)"| H1["h¹"]
  H1 -->|"P(v|h)"| V2["v² ≈ sample"]
  style V0 fill:#fff3e6,stroke:#333,color:#000
  style V2 fill:#e6ffe6,stroke:#333,color:#000

Figure 2: Gibbs sampling alternates between sampling hidden units from visible and visible units from hidden. After enough steps, the samples approximate the model distribution.

After many iterations, the Markov chain converges to the model’s equilibrium distribution. In practice, we run a finite number of steps.

Gibbs sampling bounces between visible and hidden layers

graph TD
  V0["v: visible state"] -->|"Sample each hj
from P(hj=1|v)"| H0["h: hidden state"]
  H0 -->|"Sample each vi
from P(vi=1|h)"| V1["v: new visible state"]
  V1 -->|"Repeat"| H1["h: new hidden state"]
  H1 -->|"Keep bouncing"| V2["v: converges to
model distribution"]

  style V0 fill:#fff3e6,stroke:#333,color:#000
  style V2 fill:#e6ffe6,stroke:#333,color:#000

Each bounce updates one layer while holding the other fixed. Because there are no within-layer connections, all units in a layer can be sampled simultaneously. After many bounces, the samples reflect the model’s learned distribution.

Contrastive divergence

The log-likelihood gradient for an RBM weight $W_{ij}$ is:

$\frac{\partial \log P(v)}{\partial W_{ij}} = \langle v_i h_j \rangle_{\text{data}} - \langle v_i h_j \rangle_{\text{model}}$

The first term (positive phase) is easy: clamp the visible units to a training example, compute $P(h_j = 1 \mid v)$ , and take the expectation. The second term (negative phase) requires sampling from the model distribution, which means running Gibbs sampling to convergence. That is extremely expensive.

Contrastive Divergence (CD- $k$ ) is Hinton’s practical shortcut. Instead of running Gibbs sampling to convergence, run just $k$ steps (usually $k = 1$ ):

Start with a training example $v^{(0)}$ .
Sample $h^{(0)}$ from $P(h \mid v^{(0)})$ .
Sample $v^{(1)}$ from $P(v \mid h^{(0)})$ .
Compute $P(h^{(1)} \mid v^{(1)})$ (no need to sample).

The update rule becomes:

$\Delta W_{ij} = \eta \left( \langle v_i h_j \rangle_{\text{data}} - \langle v_i h_j \rangle_{\text{recon}} \right)$

This is biased (we are not running the chain to convergence), but it works surprisingly well in practice.

Contrastive divergence: one step of learning

graph LR
  V0["Clamp training
data v0"] -->|"P(h|v)"| H0["Compute
hidden h0"]
  H0 -->|"P(v|h)"| V1["Reconstruct
visible v1"]
  V1 -->|"P(h|v)"| H1["Compute
hidden h1"]
  V0 -.->|"Positive phase:
v0 * h0"| UPD["Update Weights"]
  V1 -.->|"Negative phase:
v1 * h1"| UPD

  style V0 fill:#fff3e6,stroke:#333,color:#000
  style V1 fill:#e6f3ff,stroke:#333,color:#000
  style UPD fill:#51cf66,color:#fff

The positive phase captures what the data likes. The negative phase captures what the model currently likes. The weight update is the difference between them.

Example 2: CD-1 weight update

Using the same parameters as Example 1:

$W = \begin{bmatrix} 0.5 & 0.3 \\ -0.2 & 0.4 \end{bmatrix}, \quad a = [0.1, -0.1], \quad b = [0.2, 0.3]$

Start with training example $v^{(0)} = [1, 0]$ .

Positive phase: compute $P(h \mid v^{(0)})$ .

$P(h_0 = 1 \mid v^{(0)}) = \sigma(b_0 + v_0 W_{00} + v_1 W_{10})$ $= \sigma(0.2 + 1 \cdot 0.5 + 0 \cdot (-0.2)) = \sigma(0.7) = 0.668$

$P(h_1 = 1 \mid v^{(0)}) = \sigma(b_1 + v_0 W_{01} + v_1 W_{11})$ $= \sigma(0.3 + 1 \cdot 0.3 + 0 \cdot 0.4) = \sigma(0.6) = 0.646$

Suppose we sample $h^{(0)} = [1, 1]$ (both above their respective probabilities by chance).

Negative phase: compute $P(v \mid h^{(0)})$ .

$P(v_0 = 1 \mid h^{(0)}) = \sigma(a_0 + W_{00} h_0 + W_{01} h_1)$ $= \sigma(0.1 + 0.5 \cdot 1 + 0.3 \cdot 1) = \sigma(0.9) = 0.711$

$P(v_1 = 1 \mid h^{(0)}) = \sigma(a_1 + W_{10} h_0 + W_{11} h_1)$ $= \sigma(-0.1 + (-0.2) \cdot 1 + 0.4 \cdot 1) = \sigma(0.1) = 0.525$

Suppose we sample $v^{(1)} = [1, 1]$ .

Reconstruction phase: compute $P(h \mid v^{(1)})$ .

$P(h_0 = 1 \mid v^{(1)}) = \sigma(0.2 + 1 \cdot 0.5 + 1 \cdot (-0.2)) = \sigma(0.5) = 0.622$

$P(h_1 = 1 \mid v^{(1)}) = \sigma(0.3 + 1 \cdot 0.3 + 1 \cdot 0.4) = \sigma(1.0) = 0.731$

Weight update for $W_{00}$ :

$\text{positive} = v_0^{(0)} \cdot P(h_0 = 1 \mid v^{(0)}) = 1 \cdot 0.668 = 0.668$ $\text{negative} = v_0^{(1)} \cdot P(h_0 = 1 \mid v^{(1)}) = 1 \cdot 0.622 = 0.622$ $\Delta W_{00} = \eta \cdot (0.668 - 0.622) = 0.046\eta$

The positive gradient (0.668) says “this weight should increase because the data likes this connection.” The negative gradient (0.622) says “but the model already partially explains it.” The small positive difference (0.046) means a slight increase to $W_{00}$ .

What RBMs learn

Each hidden unit becomes a feature detector. For image data, hidden units learn to detect edges, textures, and parts. For text data, they learn topic-like features. The weights $W_{ij}$ encode which visible patterns activate which hidden features.

You can visualize what a hidden unit detects by looking at its weight vector $W_{:,j}$ reshaped to the input dimensions. For MNIST digits, hidden units learn stroke detectors, loops, and line segments.

Energy landscape: valleys are learned patterns

graph TD
  HIGH["High Energy Region
(unlikely patterns)"] --> BARRIER["Energy Barriers"]
  BARRIER --> LOW1["Low Energy Valley
(learned pattern A)"]
  BARRIER --> LOW2["Low Energy Valley
(learned pattern B)"]
  BARRIER --> LOW3["Low Energy Valley
(learned pattern C)"]
  LOW1 -.- DESC1["Example: digit 3"]
  LOW2 -.- DESC2["Example: digit 7"]
  LOW3 -.- DESC3["Example: digit 9"]

  style HIGH fill:#ff6b6b,color:#fff
  style LOW1 fill:#51cf66,color:#fff
  style LOW2 fill:#51cf66,color:#fff
  style LOW3 fill:#51cf66,color:#fff

Training carves valleys in the energy surface. Each valley corresponds to a frequently observed pattern. Sampling from the model means descending into these valleys. The depth of a valley reflects how common that pattern is in the training data.

The partition function problem

Example 3: Why exact computation is intractable

With 2 visible and 2 hidden units, each binary, there are $2^{2+2} = 16$ possible configurations. We can enumerate all of them:

$v$	$h$	$E(v,h)$	$e^{-E}$
[0,0]	[0,0]	0	1.000
[0,0]	[0,1]	$-0.3$	1.350
[0,0]	[1,0]	$-0.2$	1.221
[0,0]	[1,1]	$-0.5$	1.649
[1,0]	[0,0]	$-0.1$	1.105
[1,0]	[0,1]	$-0.4$	1.492
[1,0]	[1,0]	$-0.8$	2.226
[1,0]	[1,1]	$-1.4$	4.055
[0,1]	[0,0]	$0.1$	0.905
[0,1]	[0,1]	$-0.6$	1.822
[0,1]	[1,0]	$0.1$	0.905
[0,1]	[1,1]	$-0.6$	1.822
[1,1]	[0,0]	$0$	1.000
[1,1]	[0,1]	$-0.7$	2.014
[1,1]	[1,0]	$-0.5$	1.649
[1,1]	[1,1]	$-1.5$	4.482

The partition function is $Z = \sum e^{-E} = 28.70$ (sum of all values in the last column). With 16 states, this is trivial.

Now scale up. With 100 visible and 100 hidden units:

$\text{States} = 2^{200} \approx 1.6 \times 10^{60}$

That is more states than there are atoms in the observable universe. You cannot enumerate them. You cannot compute $Z$ exactly. This is precisely why we need contrastive divergence: it avoids computing $Z$ altogether by using the difference between the positive and negative phase gradients.

RBM hyperparameters

Parameter	Effect	Typical Range
Hidden units $n$	More units = more expressive features, but slower and risk of overfitting	50 to 500
Learning rate $\eta$	Too high causes oscillation, too low causes slow training	0.001 to 0.1
CD- $k$ steps	More steps = less biased gradient, but slower per update	1 to 10 (usually 1)
Mini-batch size	Larger batches give smoother gradients	10 to 100
Weight decay	Regularization to prevent large weights	$10^{-4}$ to $10^{-2}$
Momentum	Accelerates training by accumulating gradient direction	0.5 initially, 0.9 later

Historical importance

RBMs were crucial to the deep learning revolution. In 2006, Hinton showed that stacking RBMs and training them greedily, layer by layer, could initialize deep networks far better than random initialization. Before this, deep networks were considered too hard to train.

This pre-training approach was eventually replaced by better initialization methods (Xavier, He), activation functions (ReLU), and normalization techniques (batch normalization). But RBMs proved that deep generative models could work, and they opened the door for everything that followed.

Today, RBMs are rarely used in production systems. But the concepts they introduced (energy-based modeling, contrastive learning, and unsupervised feature extraction) remain foundational.

Practical tips for training RBMs

Reconstruction error during RBM training (50 epochs)

Monitor reconstruction error. After each CD step, compare $v^{(0)}$ (the data) with $v^{(1)}$ (the reconstruction). The mean squared error should decrease over training. If it does not, your learning rate is probably too high or too low.

Initialize weights small. Start with weights drawn from $\mathcal{N}(0, 0.01)$ . Large initial weights can cause the sigmoid activations to saturate, making learning very slow.

Use persistent CD (PCD) for better gradients. Instead of starting the Gibbs chain from the data each time, maintain a persistent chain across updates. This gives a less biased estimate of the negative phase, especially later in training when the model distribution is harder to sample.

Sparsity regularization. If you want each hidden unit to activate for only a small subset of inputs, add a penalty that encourages the mean activation of each hidden unit to be near a target value (typically 0.05 to 0.1). This produces more interpretable features.

Monitoring the free energy. The free energy of the visible units is $F(v) = -\log \sum_h e^{-E(v,h)}$ . For an RBM with sigmoid hidden units:

$F(v) = -a^T v - \sum_j \log(1 + e^{b_j + W_{:,j}^T v})$

Track the difference in free energy between training data and random samples. If training data has much lower free energy, the model is learning. If the gap shrinks to zero, the model may be overfitting.

Variants and extensions

Gaussian-Bernoulli RBM. The standard RBM uses binary visible units, but real-valued data (like image pixels in [0,1]) requires Gaussian visible units. The energy function changes to:

$E(v, h) = \sum_i \frac{(v_i - a_i)^2}{2\sigma_i^2} - b^T h - \sum_{i,j} \frac{v_i}{\sigma_i} W_{ij} h_j$

This is harder to train but necessary for continuous data.

Conditional RBM. Adds visible context variables that influence the hidden units but are not generated by the model. Used for temporal data: condition on previous frames to predict the next one.

Convolutional RBM. Shares weights spatially, similar to a convolutional layer. Each hidden unit is a local feature detector rather than a global one. This reduces the parameter count and builds in spatial invariance.

What comes next

RBMs are powerful on their own, but their real impact came from stacking them into Deep Belief Networks. By training one RBM on top of another, you can build deep generative models that learn hierarchical features. That is exactly what we cover next.

← Back to all series