Restricted Boltzmann Machines
In this series (25 parts)
- Neural networks: the basic building block
- Forward pass and backpropagation
- Training neural networks: a practical guide
- Convolutional neural networks
- Recurrent neural networks and LSTMs
- Attention mechanism and transformers
- Word embeddings: from one-hot to dense representations
- Transfer learning and fine-tuning
- Optimization techniques for deep networks
- Regularization for deep networks
- Encoder-decoder architectures
- Generative models: an overview
- Restricted Boltzmann Machines
- Deep Belief Networks
- Variational Autoencoders
- Generative Adversarial Networks: training and theory
- DCGAN, conditional GANs, and GAN variants
- Representation learning and self-supervised learning
- Domain adaptation and fine-tuning strategies
- Distributed representations and latent spaces
- AutoML and hyperparameter optimization
- Neural architecture search
- Network compression and efficient inference
- Graph neural networks
- Practical deep learning: debugging and tuning
Prerequisites: Probability fundamentals and Random variables and distributions.
What is an RBM, intuitively?
An RBM learns patterns by adjusting an energy function. Low energy means the model likes that pattern. High energy means the model considers it unlikely. Training pushes observed data toward low energy and everything else toward high energy.
Think of a preference system. You have visible features (things you observe) and hidden features (explanations you infer). Each combination gets an energy score. The model learns which combinations to prefer.
| Visible Pattern | Hidden Pattern | Energy | Interpretation |
|---|---|---|---|
| [1, 1, 0, 0] | [1, 0] | -2.4 | Strongly preferred |
| [1, 0, 1, 0] | [0, 1] | -1.8 | Preferred |
| [0, 1, 0, 1] | [1, 1] | 0.3 | Slightly disfavored |
| [0, 0, 1, 1] | [0, 0] | 1.9 | Strongly disfavored |
Lower energy means higher probability. The model assigns the most probability mass to patterns it sees frequently during training.
RBM structure: two layers, no within-layer connections
graph LR V["Visible Layer (observed data)"] <-->|"Every visible unit connects to every hidden unit"| H["Hidden Layer (learned features)"] style V fill:#fff3e6,stroke:#333,color:#000 style H fill:#e6f3ff,stroke:#333,color:#000
No connections exist within a layer. This restriction is what makes the math tractable: given one layer, you can compute the other layer’s activations in parallel.
Now let’s formalize the energy function and derive the training algorithm.
Restricted Boltzmann Machines (RBMs) turn an energy function into a probability distribution. Low energy configurations are more probable. High energy configurations are less probable. That is the entire idea. The rest is working out how to train this efficiently.
Energy-based models
An energy-based model assigns a scalar energy to every possible configuration . The probability of a configuration is:
The normalizing constant (called the partition function) sums over all possible configurations to make the probabilities add up to 1. This is borrowed from statistical physics, where the Boltzmann distribution describes the probability of a system being in a particular state.
The model learns by adjusting its parameters so that training data gets low energy (high probability) and everything else gets high energy (low probability).
RBM structure
An RBM has two layers of binary units:
- Visible units : represent the observed data (pixels, features, etc.)
- Hidden units : represent learned features
The “restricted” part is the key constraint: there are no connections within a layer. Visible units connect only to hidden units, and hidden units connect only to visible units. This bipartite structure makes inference tractable.
graph TD
subgraph Hidden["Hidden layer h"]
H1["h₁"]
H2["h₂"]
H3["h₃"]
end
subgraph Visible["Visible layer v"]
V1["v₁"]
V2["v₂"]
V3["v₃"]
V4["v₄"]
end
V1 --- H1
V1 --- H2
V1 --- H3
V2 --- H1
V2 --- H2
V2 --- H3
V3 --- H1
V3 --- H2
V3 --- H3
V4 --- H1
V4 --- H2
V4 --- H3
style Hidden fill:#e6f3ff,stroke:#333,color:#000
style Visible fill:#fff3e6,stroke:#333,color:#000
Figure 1: RBM bipartite graph. Every visible unit connects to every hidden unit. No connections exist within a layer. This restriction is what makes RBMs tractable.
The energy function
The energy of a joint configuration is:
where:
- is the weight matrix connecting visible to hidden units
- is the visible bias vector
- is the hidden bias vector
The joint probability is:
And the probability of a visible configuration (what we actually care about) is:
Example 1: Computing RBM energy
Given 2 visible and 2 hidden units:
Step 1: Visible bias term.
Step 2: Hidden bias term.
Step 3: Interaction term. We need :
Step 4: Total energy.
The negative energy means this configuration is relatively probable. Configurations with more negative energy get higher probability under the Boltzmann distribution.
Conditional distributions
The bipartite structure gives us a huge advantage: given the visible units, all hidden units are conditionally independent (and vice versa). This means we can compute:
where is the sigmoid function. Each unit’s activation depends only on the units in the other layer, not on units in its own layer. This is what the “restricted” constraint buys us: we can sample an entire layer in parallel.
Gibbs sampling
To generate samples from an RBM, we use Gibbs sampling. We alternate between sampling hidden units given visible units, and visible units given hidden units:
graph LR V0["v⁰ (data)"] -->|"P(h|v)"| H0["h⁰"] H0 -->|"P(v|h)"| V1["v¹"] V1 -->|"P(h|v)"| H1["h¹"] H1 -->|"P(v|h)"| V2["v² ≈ sample"] style V0 fill:#fff3e6,stroke:#333,color:#000 style V2 fill:#e6ffe6,stroke:#333,color:#000
Figure 2: Gibbs sampling alternates between sampling hidden units from visible and visible units from hidden. After enough steps, the samples approximate the model distribution.
After many iterations, the Markov chain converges to the model’s equilibrium distribution. In practice, we run a finite number of steps.
Gibbs sampling bounces between visible and hidden layers
graph TD V0["v: visible state"] -->|"Sample each hj from P(hj=1|v)"| H0["h: hidden state"] H0 -->|"Sample each vi from P(vi=1|h)"| V1["v: new visible state"] V1 -->|"Repeat"| H1["h: new hidden state"] H1 -->|"Keep bouncing"| V2["v: converges to model distribution"] style V0 fill:#fff3e6,stroke:#333,color:#000 style V2 fill:#e6ffe6,stroke:#333,color:#000
Each bounce updates one layer while holding the other fixed. Because there are no within-layer connections, all units in a layer can be sampled simultaneously. After many bounces, the samples reflect the model’s learned distribution.
Contrastive divergence
The log-likelihood gradient for an RBM weight is:
The first term (positive phase) is easy: clamp the visible units to a training example, compute , and take the expectation. The second term (negative phase) requires sampling from the model distribution, which means running Gibbs sampling to convergence. That is extremely expensive.
Contrastive Divergence (CD-) is Hinton’s practical shortcut. Instead of running Gibbs sampling to convergence, run just steps (usually ):
- Start with a training example .
- Sample from .
- Sample from .
- Compute (no need to sample).
The update rule becomes:
This is biased (we are not running the chain to convergence), but it works surprisingly well in practice.
Contrastive divergence: one step of learning
graph LR V0["Clamp training data v0"] -->|"P(h|v)"| H0["Compute hidden h0"] H0 -->|"P(v|h)"| V1["Reconstruct visible v1"] V1 -->|"P(h|v)"| H1["Compute hidden h1"] V0 -.->|"Positive phase: v0 * h0"| UPD["Update Weights"] V1 -.->|"Negative phase: v1 * h1"| UPD style V0 fill:#fff3e6,stroke:#333,color:#000 style V1 fill:#e6f3ff,stroke:#333,color:#000 style UPD fill:#51cf66,color:#fff
The positive phase captures what the data likes. The negative phase captures what the model currently likes. The weight update is the difference between them.
Example 2: CD-1 weight update
Using the same parameters as Example 1:
Start with training example .
Positive phase: compute .
Suppose we sample (both above their respective probabilities by chance).
Negative phase: compute .
Suppose we sample .
Reconstruction phase: compute .
Weight update for :
The positive gradient (0.668) says “this weight should increase because the data likes this connection.” The negative gradient (0.622) says “but the model already partially explains it.” The small positive difference (0.046) means a slight increase to .
What RBMs learn
Each hidden unit becomes a feature detector. For image data, hidden units learn to detect edges, textures, and parts. For text data, they learn topic-like features. The weights encode which visible patterns activate which hidden features.
You can visualize what a hidden unit detects by looking at its weight vector reshaped to the input dimensions. For MNIST digits, hidden units learn stroke detectors, loops, and line segments.
Energy landscape: valleys are learned patterns
graph TD HIGH["High Energy Region (unlikely patterns)"] --> BARRIER["Energy Barriers"] BARRIER --> LOW1["Low Energy Valley (learned pattern A)"] BARRIER --> LOW2["Low Energy Valley (learned pattern B)"] BARRIER --> LOW3["Low Energy Valley (learned pattern C)"] LOW1 -.- DESC1["Example: digit 3"] LOW2 -.- DESC2["Example: digit 7"] LOW3 -.- DESC3["Example: digit 9"] style HIGH fill:#ff6b6b,color:#fff style LOW1 fill:#51cf66,color:#fff style LOW2 fill:#51cf66,color:#fff style LOW3 fill:#51cf66,color:#fff
Training carves valleys in the energy surface. Each valley corresponds to a frequently observed pattern. Sampling from the model means descending into these valleys. The depth of a valley reflects how common that pattern is in the training data.
The partition function problem
Example 3: Why exact computation is intractable
With 2 visible and 2 hidden units, each binary, there are possible configurations. We can enumerate all of them:
| [0,0] | [0,0] | 0 | 1.000 |
| [0,0] | [0,1] | 1.350 | |
| [0,0] | [1,0] | 1.221 | |
| [0,0] | [1,1] | 1.649 | |
| [1,0] | [0,0] | 1.105 | |
| [1,0] | [0,1] | 1.492 | |
| [1,0] | [1,0] | 2.226 | |
| [1,0] | [1,1] | 4.055 | |
| [0,1] | [0,0] | 0.905 | |
| [0,1] | [0,1] | 1.822 | |
| [0,1] | [1,0] | 0.905 | |
| [0,1] | [1,1] | 1.822 | |
| [1,1] | [0,0] | 1.000 | |
| [1,1] | [0,1] | 2.014 | |
| [1,1] | [1,0] | 1.649 | |
| [1,1] | [1,1] | 4.482 |
The partition function is (sum of all values in the last column). With 16 states, this is trivial.
Now scale up. With 100 visible and 100 hidden units:
That is more states than there are atoms in the observable universe. You cannot enumerate them. You cannot compute exactly. This is precisely why we need contrastive divergence: it avoids computing altogether by using the difference between the positive and negative phase gradients.
RBM hyperparameters
| Parameter | Effect | Typical Range |
|---|---|---|
| Hidden units | More units = more expressive features, but slower and risk of overfitting | 50 to 500 |
| Learning rate | Too high causes oscillation, too low causes slow training | 0.001 to 0.1 |
| CD- steps | More steps = less biased gradient, but slower per update | 1 to 10 (usually 1) |
| Mini-batch size | Larger batches give smoother gradients | 10 to 100 |
| Weight decay | Regularization to prevent large weights | to |
| Momentum | Accelerates training by accumulating gradient direction | 0.5 initially, 0.9 later |
Historical importance
RBMs were crucial to the deep learning revolution. In 2006, Hinton showed that stacking RBMs and training them greedily, layer by layer, could initialize deep networks far better than random initialization. Before this, deep networks were considered too hard to train.
This pre-training approach was eventually replaced by better initialization methods (Xavier, He), activation functions (ReLU), and normalization techniques (batch normalization). But RBMs proved that deep generative models could work, and they opened the door for everything that followed.
Today, RBMs are rarely used in production systems. But the concepts they introduced (energy-based modeling, contrastive learning, and unsupervised feature extraction) remain foundational.
Practical tips for training RBMs
Reconstruction error during RBM training (50 epochs)
Monitor reconstruction error. After each CD step, compare (the data) with (the reconstruction). The mean squared error should decrease over training. If it does not, your learning rate is probably too high or too low.
Initialize weights small. Start with weights drawn from . Large initial weights can cause the sigmoid activations to saturate, making learning very slow.
Use persistent CD (PCD) for better gradients. Instead of starting the Gibbs chain from the data each time, maintain a persistent chain across updates. This gives a less biased estimate of the negative phase, especially later in training when the model distribution is harder to sample.
Sparsity regularization. If you want each hidden unit to activate for only a small subset of inputs, add a penalty that encourages the mean activation of each hidden unit to be near a target value (typically 0.05 to 0.1). This produces more interpretable features.
Monitoring the free energy. The free energy of the visible units is . For an RBM with sigmoid hidden units:
Track the difference in free energy between training data and random samples. If training data has much lower free energy, the model is learning. If the gap shrinks to zero, the model may be overfitting.
Variants and extensions
Gaussian-Bernoulli RBM. The standard RBM uses binary visible units, but real-valued data (like image pixels in [0,1]) requires Gaussian visible units. The energy function changes to:
This is harder to train but necessary for continuous data.
Conditional RBM. Adds visible context variables that influence the hidden units but are not generated by the model. Used for temporal data: condition on previous frames to predict the next one.
Convolutional RBM. Shares weights spatially, similar to a convolutional layer. Each hidden unit is a local feature detector rather than a global one. This reduces the parameter count and builds in spatial invariance.
What comes next
RBMs are powerful on their own, but their real impact came from stacking them into Deep Belief Networks. By training one RBM on top of another, you can build deep generative models that learn hierarchical features. That is exactly what we cover next.