Search…

Ensemble methods: Bagging and Random Forests

In this series (18 parts)
  1. What is machine learning: a map of the field
  2. Data, features, and the ML pipeline
  3. Linear regression
  4. Bias, variance, and the tradeoff
  5. Regularization: Ridge, Lasso, and ElasticNet
  6. Logistic regression and classification
  7. Evaluation metrics for classification
  8. Naive Bayes classifier
  9. K-Nearest Neighbors
  10. Decision trees
  11. Ensemble methods: Bagging and Random Forests
  12. Boosting: AdaBoost and Gradient Boosting
  13. Support Vector Machines
  14. K-Means clustering
  15. Dimensionality Reduction: PCA
  16. Gaussian mixture models and EM algorithm
  17. Model selection and cross-validation
  18. Feature engineering and selection

A single decision tree is unstable. Change one training point and the entire structure can shift. The splits change, the leaves change, and predictions flip. This is high variance, and it is the central weakness of decision trees.

The fix is surprisingly simple: build many trees and let them vote.

Wisdom of crowds: the intuition

Ask five people to guess the number of jellybeans in a jar. Each guess is off by a lot. But the average of all five guesses is often surprisingly close to the true number. Random errors in different directions cancel out.

The same principle applies to ML models. Here are five decision trees making predictions on the same five test points.

Test pointTree 1Tree 2Tree 3Tree 4Tree 5AverageTrue value
A9511088105102100.0100
B155140165148142150.0150
C205190210195200200.0200
D728568788277.075
E128115135120122124.0125

No single tree nails every prediction. But the average is close to the true value every time. Individual trees overfit in different directions. Averaging cancels the noise.

Bootstrap sampling: creating diverse training sets

graph TD
  A["Original dataset: N points"] --> B["Sample N points WITH replacement"]
  B --> C["Bootstrap sample 1: ~63% unique points"]
  A --> D["Sample N points WITH replacement"]
  D --> E["Bootstrap sample 2: different ~63%"]
  A --> F["Sample N points WITH replacement"]
  F --> G["Bootstrap sample 3: different ~63%"]
  C --> H["Train Tree 1"]
  E --> I["Train Tree 2"]
  G --> J["Train Tree 3"]

Now let’s see why averaging works mathematically, and how Random Forests take this idea even further.

Why ensembles work

Think about the bias-variance tradeoff. A deep decision tree has low bias (it can fit complex patterns) but high variance (it overfits to noise in the training data). Pruning or limiting depth reduces variance but increases bias. You are stuck trading one problem for another.

Ensembles break this tradeoff. Instead of building one careful tree, you build many rough trees and average their predictions. Each individual tree overfits in its own way, to its own noise. But when you combine them, the noise cancels out. The bias stays roughly the same (each tree is still expressive), but the variance drops.

Here is the key intuition. Suppose you have BB independent random variables, each with variance σ2\sigma^2. The variance of their average is:

Var(1Bi=1BXi)=σ2B\text{Var}\left(\frac{1}{B}\sum_{i=1}^{B} X_i\right) = \frac{\sigma^2}{B}

More trees, less variance. That is the entire motivation behind ensemble methods.

Bootstrap sampling

You only have one training set. How do you train BB different trees? You create BB new datasets by bootstrap sampling: drawing nn samples from your nn training points, with replacement.

With replacement means the same point can appear multiple times in a single bootstrap sample. Some points show up twice or three times. Others do not appear at all.

What fraction of the original data appears in a single bootstrap sample? The probability that a specific point is NOT selected in any single draw is (11/n)(1 - 1/n). Over nn draws:

P(point excluded)=(11n)ne10.368P(\text{point excluded}) = \left(1 - \frac{1}{n}\right)^n \approx e^{-1} \approx 0.368

So roughly 63.2% of the original training points appear in each bootstrap sample, and about 36.8% are left out. This matters later when we discuss out-of-bag error.

Bagging: Bootstrap Aggregating

Bagging, short for Bootstrap Aggregating, is straightforward:

  1. Create BB bootstrap samples from the training data
  2. Train a full (unpruned) decision tree on each sample
  3. To predict, collect predictions from all BB trees
  4. For classification: take the majority vote
  5. For regression: take the average

That is it. No regularization tricks, no careful tuning. Each tree overfits, but the aggregate does not.

Worked Example 1: Bootstrap sampling and bagging

Suppose we have 6 training points for a binary classification problem (predicting whether a student passes an exam):

PointHours studied (xx)Pass? (yy)
A10
B20
C30
D51
E61
F81

We draw 3 bootstrap samples (each of size 6, with replacement):

Bootstrap sample 1: {A, A, C, D, E, F} Bootstrap sample 2: {B, C, D, D, F, F} Bootstrap sample 3: {A, B, B, C, E, F}

Now we train a decision stump (a tree with one split) on each.

Tree 1 (trained on {A, A, C, D, E, F}):

  • Points with x3x \leq 3: labels are {0, 0, 0}, predict 0
  • Points with x>3x > 3: labels are {1, 1, 1}, predict 1
  • Best split: x3x \leq 3 vs x>3x > 3. Training error = 0/6 = 0.

Tree 2 (trained on {B, C, D, D, F, F}):

  • Try split at x2x \leq 2: left = {0}, right = {0, 1, 1, 1, 1}. Misclassifications: 1 (C is class 0 on the right).
  • Try split at x3x \leq 3: left = {0, 0}, right = {1, 1, 1, 1}. Misclassifications: 0.
  • Best split: x3x \leq 3 vs x>3x > 3. Training error = 0/6 = 0.

Tree 3 (trained on {A, B, B, C, E, F}):

  • Try split at x3x \leq 3: left = {0, 0, 0, 0}, right = {1, 1}. Misclassifications: 0.
  • Best split: x3x \leq 3 vs x>3x > 3. Training error = 0/6 = 0.

Now let’s predict for a new point: x=4x = 4 (hours studied).

TreeSplit rulePrediction for x=4x = 4
1x30x \leq 3 \to 0, else 11
2x30x \leq 3 \to 0, else 11
3x30x \leq 3 \to 0, else 11

Majority vote: 1 (pass). All three trees agree here.

Now predict for x=3x = 3:

TreePrediction for x=3x = 3
10
20
30

Majority vote: 0 (fail).

In this simple example the trees agreed because they found the same split. In practice, with more features and deeper trees, each bootstrap sample produces a different tree structure, and the majority vote smooths out the disagreements.

A single tree trained on the full dataset would also split at x3x \leq 3. The real benefit of bagging shows up when the data is noisier and the trees are deeper, so they each overfit in different directions.

Why correlation is the enemy

The variance formula σ2/B\sigma^2 / B assumes the trees are independent. They are not. All BB trees are trained on overlapping data from the same original dataset. If a strong feature dominates, every tree will split on it first, making the trees correlated.

For BB identically distributed random variables, each with variance σ2\sigma^2 and pairwise correlation ρ\rho, the variance of their average is:

Var(1Bi=1BXi)=ρσ2+1ρBσ2\text{Var}\left(\frac{1}{B}\sum_{i=1}^{B} X_i\right) = \rho \sigma^2 + \frac{1 - \rho}{B}\sigma^2

As BB \to \infty, the second term vanishes. You are left with ρσ2\rho \sigma^2. No matter how many trees you add, you cannot push the variance below this floor. The correlation ρ\rho between trees is the bottleneck.

This is exactly the problem Random Forests solve.

Single tree vs forest: error decreases with more trees

graph LR
  A["1 tree: high variance"] --> B["10 trees: lower variance"]
  B --> C["100 trees: much lower"]
  C --> D["500 trees: diminishing returns"]
  D --> E["Variance floor = rho * sigma-squared"]

Prediction error across 100 test samples: the forest (orange) is smoother and consistently lower than a single tree (blue)

Random Forests: decorrelating the trees

Random Forests add one twist to bagging: feature subsampling at each split. When deciding the best split at a node, instead of considering all pp features, the algorithm randomly selects a subset of mm features and picks the best split among those.

The standard choices for mm:

  • Classification: m=pm = \lfloor\sqrt{p}\rfloor
  • Regression: m=p/3m = \lfloor p/3 \rfloor

Why does this help? If one feature is very strong, a bagged ensemble will use it at the root of almost every tree. The trees look similar and are highly correlated. By forcing each split to consider only a random subset of features, you prevent this. Sometimes the dominant feature is not in the subset, so the tree must use other features. The trees become more diverse, their correlation ρ\rho drops, and the variance floor ρσ2\rho\sigma^2 drops with it.

The tradeoff: each individual tree is slightly worse (higher bias, since it cannot always pick the best feature). But the reduction in correlation more than compensates. The ensemble is stronger overall.

Random Forest: bootstrap samples, feature subsets, aggregate

graph TD
  A["Original data"] --> B1["Bootstrap sample 1"]
  A --> B2["Bootstrap sample 2"]
  A --> B3["Bootstrap sample B"]
  B1 --> T1["Tree 1: random feature subset at each split"]
  B2 --> T2["Tree 2: random feature subset at each split"]
  B3 --> T3["Tree B: random feature subset at each split"]
  T1 --> V["Aggregate: majority vote or average"]
  T2 --> V
  T3 --> V
  V --> P["Final prediction"]

Here is a minimal Random Forest in Python using scikit-learn:

from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

X, y = make_classification(n_samples=1000, n_features=20, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

rf = RandomForestClassifier(
    n_estimators=200,      # number of trees
    max_features='sqrt',   # sqrt(p) features per split
    max_depth=None,         # grow full trees
    oob_score=True,         # compute OOB error
    random_state=42
)
rf.fit(X_train, y_train)

print(f"Test accuracy:  {rf.score(X_test, y_test):.3f}")
print(f"OOB accuracy:   {rf.oob_score_:.3f}")

Out-of-Bag (OOB) error

Remember that each bootstrap sample leaves out about 36.8% of the training data. These left-out points are called out-of-bag (OOB) samples. For each training point, roughly one-third of the trees never saw it during training. You can use those trees to predict that point, giving you a free validation estimate.

The procedure:

  1. For each training point xix_i, collect predictions only from trees where xix_i was NOT in the bootstrap sample
  2. Aggregate those predictions (majority vote or average)
  3. Compare to the true label yiy_i
  4. The fraction of incorrect OOB predictions is the OOB error

This closely approximates leave-one-out cross-validation, but you get it for free, with no extra computation.

Worked Example 2: OOB error computation

Using the same 3 bootstrap samples and 6 data points from Example 1:

Bootstrap sample 1: {A, A, C, D, E, F} Bootstrap sample 2: {B, C, D, D, F, F} Bootstrap sample 3: {A, B, B, C, E, F}

First, identify which points are out-of-bag for each tree:

PointIn sample 1?In sample 2?In sample 3?OOB for trees
A{2}
B{1}
C{}
D{3}
E{2}
F{}

Now compute OOB predictions. Recall all three trees learned the rule x30x \leq 3 \to 0, else 1\to 1.

Point A (x=1x = 1, true label = 0): OOB trees = {Tree 2}. Tree 2 predicts: 131 \leq 3, so 0. OOB prediction = 0. ✓ Correct.

Point B (x=2x = 2, true label = 0): OOB trees = {Tree 1}. Tree 1 predicts: 232 \leq 3, so 0. OOB prediction = 0. ✓ Correct.

Point C (x=3x = 3, true label = 0): No OOB trees available. Skip this point.

Point D (x=5x = 5, true label = 1): OOB trees = {Tree 3}. Tree 3 predicts: 5>35 > 3, so 1. OOB prediction = 1. ✓ Correct.

Point E (x=6x = 6, true label = 1): OOB trees = {Tree 2}. Tree 2 predicts: 6>36 > 3, so 1. OOB prediction = 1. ✓ Correct.

Point F (x=8x = 8, true label = 1): No OOB trees available. Skip this point.

OOB error = number of incorrect OOB predictions / number of points with OOB predictions:

OOB error=04=0%\text{OOB error} = \frac{0}{4} = 0\%

All four evaluable points were classified correctly. In practice, with hundreds of trees, every training point will be OOB for many trees, so you get a reliable error estimate for all points.

⚠ With only 3 trees, some points (C and F here) have no OOB predictions. This is why you need a reasonable number of trees, typically 100 or more.

Feature importance

Random Forests give you two ways to measure how important each feature is.

Impurity-based importance

Every time a feature is used for a split, it reduces the impurity (Gini or entropy) by some amount. Sum up the total impurity reduction across all trees for each feature. Features that produce larger reductions are more important.

This is fast to compute since it falls out of the training process. But it has a known bias: it favors features with many possible split points (high cardinality features like IDs or continuous variables).

Permutation importance

After training, take each feature one at a time and randomly shuffle its values across the test set. Measure how much the model’s accuracy drops. If shuffling feature jj causes a big drop, that feature was important.

This is more reliable than impurity-based importance because it measures actual predictive impact on held-out data. It is also model-agnostic; you can use it with any classifier.

Feature subsampling in Random Forests

graph TD
  A["All p features available"] --> B["At each split: randomly pick m features"]
  B --> C["Find best split among those m"]
  C --> D["Different splits use different subsets"]
  D --> E["Trees become diverse"]
  E --> F["Lower correlation between trees"]
  F --> G["Lower ensemble variance"]
from sklearn.inspection import permutation_importance

result = permutation_importance(rf, X_test, y_test, n_repeats=10, random_state=42)

for i in result.importances_mean.argsort()[::-1][:5]:
    print(f"Feature {i}: {result.importances_mean[i]:.4f} +/- {result.importances_std[i]:.4f}")

Feature importance scores from a Random Forest trained on census data (permutation method)

Key hyperparameters

Random Forests have a few knobs worth tuning:

Number of trees (BB, or n_estimators): More trees is almost always better. Performance improves and then plateaus. It never gets worse (unlike boosting, which can overfit with too many rounds). Start with 200 to 500. If you have time, go higher. The only cost is computation.

Max features (mm, or max_features): Controls the feature subset size at each split. Smaller mm means more randomness, lower correlation between trees, but higher bias per tree. The defaults (p\sqrt{p} for classification, p/3p/3 for regression) work well in most cases. Tuning this via cross-validation or OOB error can squeeze out a little more performance.

Max depth (max_depth): By default, trees are grown fully until leaves are pure or contain min_samples_split points. You can limit depth to speed up training, but full-depth trees usually work fine since bagging handles the variance.

Min samples per leaf (min_samples_leaf): Setting this to 5 or 10 instead of 1 can smooth predictions slightly and speed up training. Worth trying if you have noisy data.

Bootstrap (bootstrap): You can turn off bootstrap sampling to use the full dataset for each tree. This is called “pasting.” It removes the OOB error estimate and usually performs slightly worse.

When Random Forests fail

Random Forests are reliable across a wide range of problems, but they have real limitations.

Extrapolation. A Random Forest predicts by averaging tree outputs. Each tree output is a value seen in training. For regression, this means the model literally cannot predict values outside the range of training targets. If your training labels are all between 0 and 100, the Random Forest will never predict 105. Linear models and neural networks can extrapolate; trees cannot.

Very high-dimensional sparse data. Text data with thousands of sparse features (bag-of-words representations) is not a great fit. When pp is huge and each feature is rarely nonzero, the random feature subsets at each split are mostly zeros. Linear models or gradient-boosted trees with careful tuning tend to work better here.

Smooth functions. If the true relationship is a smooth, continuous function (like a sine wave), a Random Forest approximates it with a jagged step function. It works, but it needs a lot of data to get a good approximation. A model that assumes smoothness (like a Gaussian process or a neural network) will be more data-efficient.

Interpretability at scale. A single decision tree is easy to interpret. A forest of 500 trees is not. You can extract feature importances, but you lose the ability to trace a single decision path. If interpretability is critical, consider simpler models or use the forest alongside explanation tools like SHAP values.

Summary

ConceptKey point
BaggingTrain BB trees on bootstrap samples, aggregate by vote/average
BootstrapSample with replacement; ~63.2% of data appears in each sample
Variance reductionAveraging BB uncorrelated predictors cuts variance by 1/B1/B
Random ForestBagging + random feature subsets at each split
Feature subsamplingp\sqrt{p} for classification, p/3p/3 for regression
OOB errorFree validation using the ~36.8% of points left out of each tree
Correlation floorVariance cannot drop below ρσ2\rho\sigma^2; decorrelation is key

The core insight is worth repeating: the power of Random Forests does not come from building better individual trees. It comes from building many diverse trees and combining them. Diversity, not individual strength, drives the ensemble.

What comes next

Bagging reduces variance by building independent trees and averaging them. But what if, instead of treating all trees equally, you built each new tree to fix the mistakes of the previous ones? That is the idea behind boosting.

In the next post, Boosting: AdaBoost and Gradient Boosting, we cover how to sequentially build trees that focus on hard examples, producing ensembles that reduce bias as well as variance. Where bagging builds trees in parallel, boosting builds them in sequence, and the result is one of the most effective off-the-shelf methods in machine learning.

Start typing to search across all content
navigate Enter open Esc close