Apr 2, 2026 · 18 min read · Machine Learning

Ensemble methods: Bagging and Random Forests

In this series (18 parts)

A single decision tree is unstable. Change one training point and the entire structure can shift. The splits change, the leaves change, and predictions flip. This is high variance, and it is the central weakness of decision trees.

The fix is surprisingly simple: build many trees and let them vote.

Wisdom of crowds: the intuition

Ask five people to guess the number of jellybeans in a jar. Each guess is off by a lot. But the average of all five guesses is often surprisingly close to the true number. Random errors in different directions cancel out.

The same principle applies to ML models. Here are five decision trees making predictions on the same five test points.

Test point	Tree 1	Tree 2	Tree 3	Tree 4	Tree 5	Average	True value
A	95	110	88	105	102	100.0	100
B	155	140	165	148	142	150.0	150
C	205	190	210	195	200	200.0	200
D	72	85	68	78	82	77.0	75
E	128	115	135	120	122	124.0	125

No single tree nails every prediction. But the average is close to the true value every time. Individual trees overfit in different directions. Averaging cancels the noise.

Bootstrap sampling: creating diverse training sets

graph TD
  A["Original dataset: N points"] --> B["Sample N points WITH replacement"]
  B --> C["Bootstrap sample 1: ~63% unique points"]
  A --> D["Sample N points WITH replacement"]
  D --> E["Bootstrap sample 2: different ~63%"]
  A --> F["Sample N points WITH replacement"]
  F --> G["Bootstrap sample 3: different ~63%"]
  C --> H["Train Tree 1"]
  E --> I["Train Tree 2"]
  G --> J["Train Tree 3"]

Now let’s see why averaging works mathematically, and how Random Forests take this idea even further.

Why ensembles work

Think about the bias-variance tradeoff. A deep decision tree has low bias (it can fit complex patterns) but high variance (it overfits to noise in the training data). Pruning or limiting depth reduces variance but increases bias. You are stuck trading one problem for another.

Ensembles break this tradeoff. Instead of building one careful tree, you build many rough trees and average their predictions. Each individual tree overfits in its own way, to its own noise. But when you combine them, the noise cancels out. The bias stays roughly the same (each tree is still expressive), but the variance drops.

Here is the key intuition. Suppose you have $B$ independent random variables, each with variance $\sigma^2$ . The variance of their average is:

$\text{Var}\left(\frac{1}{B}\sum_{i=1}^{B} X_i\right) = \frac{\sigma^2}{B}$

More trees, less variance. That is the entire motivation behind ensemble methods.

Bootstrap sampling

You only have one training set. How do you train $B$ different trees? You create $B$ new datasets by bootstrap sampling: drawing $n$ samples from your $n$ training points, with replacement.

With replacement means the same point can appear multiple times in a single bootstrap sample. Some points show up twice or three times. Others do not appear at all.

What fraction of the original data appears in a single bootstrap sample? The probability that a specific point is NOT selected in any single draw is $(1 - 1/n)$ . Over $n$ draws:

$P(\text{point excluded}) = \left(1 - \frac{1}{n}\right)^n \approx e^{-1} \approx 0.368$

So roughly 63.2% of the original training points appear in each bootstrap sample, and about 36.8% are left out. This matters later when we discuss out-of-bag error.

Bagging: Bootstrap Aggregating

Bagging, short for Bootstrap Aggregating, is straightforward:

Create $B$ bootstrap samples from the training data
Train a full (unpruned) decision tree on each sample
To predict, collect predictions from all $B$ trees
For classification: take the majority vote
For regression: take the average

That is it. No regularization tricks, no careful tuning. Each tree overfits, but the aggregate does not.

Worked Example 1: Bootstrap sampling and bagging

Suppose we have 6 training points for a binary classification problem (predicting whether a student passes an exam):

Point	Hours studied ( $x$ )	Pass? ( $y$ )
A	1	0
B	2	0
C	3	0
D	5	1
E	6	1
F	8	1

We draw 3 bootstrap samples (each of size 6, with replacement):

Bootstrap sample 1: {A, A, C, D, E, F} Bootstrap sample 2: {B, C, D, D, F, F} Bootstrap sample 3: {A, B, B, C, E, F}

Now we train a decision stump (a tree with one split) on each.

Tree 1 (trained on {A, A, C, D, E, F}):

Points with $x \leq 3$ : labels are {0, 0, 0}, predict 0
Points with $x > 3$ : labels are {1, 1, 1}, predict 1
Best split: $x \leq 3$ vs $x > 3$ . Training error = 0/6 = 0.

Tree 2 (trained on {B, C, D, D, F, F}):

Try split at $x \leq 2$ : left = {0}, right = {0, 1, 1, 1, 1}. Misclassifications: 1 (C is class 0 on the right).
Try split at $x \leq 3$ : left = {0, 0}, right = {1, 1, 1, 1}. Misclassifications: 0.
Best split: $x \leq 3$ vs $x > 3$ . Training error = 0/6 = 0.

Tree 3 (trained on {A, B, B, C, E, F}):

Try split at $x \leq 3$ : left = {0, 0, 0, 0}, right = {1, 1}. Misclassifications: 0.
Best split: $x \leq 3$ vs $x > 3$ . Training error = 0/6 = 0.

Now let’s predict for a new point: $x = 4$ (hours studied).

Tree	Split rule	Prediction for $x = 4$
1	$x \leq 3 \to 0$ , else 1	1
2	$x \leq 3 \to 0$ , else 1	1
3	$x \leq 3 \to 0$ , else 1	1

Majority vote: 1 (pass). All three trees agree here.

Now predict for $x = 3$ :

Tree	Prediction for $x = 3$
1	0
2	0
3	0

Majority vote: 0 (fail).

In this simple example the trees agreed because they found the same split. In practice, with more features and deeper trees, each bootstrap sample produces a different tree structure, and the majority vote smooths out the disagreements.

A single tree trained on the full dataset would also split at $x \leq 3$ . The real benefit of bagging shows up when the data is noisier and the trees are deeper, so they each overfit in different directions.

Why correlation is the enemy

The variance formula $\sigma^2 / B$ assumes the trees are independent. They are not. All $B$ trees are trained on overlapping data from the same original dataset. If a strong feature dominates, every tree will split on it first, making the trees correlated.

For $B$ identically distributed random variables, each with variance $\sigma^2$ and pairwise correlation $\rho$ , the variance of their average is:

$\text{Var}\left(\frac{1}{B}\sum_{i=1}^{B} X_i\right) = \rho \sigma^2 + \frac{1 - \rho}{B}\sigma^2$

As $B \to \infty$ , the second term vanishes. You are left with $\rho \sigma^2$ . No matter how many trees you add, you cannot push the variance below this floor. The correlation $\rho$ between trees is the bottleneck.

This is exactly the problem Random Forests solve.

Single tree vs forest: error decreases with more trees

graph LR
  A["1 tree: high variance"] --> B["10 trees: lower variance"]
  B --> C["100 trees: much lower"]
  C --> D["500 trees: diminishing returns"]
  D --> E["Variance floor = rho * sigma-squared"]

Prediction error across 100 test samples: the forest (orange) is smoother and consistently lower than a single tree (blue)

Random Forests: decorrelating the trees

Random Forests add one twist to bagging: feature subsampling at each split. When deciding the best split at a node, instead of considering all $p$ features, the algorithm randomly selects a subset of $m$ features and picks the best split among those.

The standard choices for $m$ :

Classification: $m = \lfloor\sqrt{p}\rfloor$
Regression: $m = \lfloor p/3 \rfloor$

Why does this help? If one feature is very strong, a bagged ensemble will use it at the root of almost every tree. The trees look similar and are highly correlated. By forcing each split to consider only a random subset of features, you prevent this. Sometimes the dominant feature is not in the subset, so the tree must use other features. The trees become more diverse, their correlation $\rho$ drops, and the variance floor $\rho\sigma^2$ drops with it.

The tradeoff: each individual tree is slightly worse (higher bias, since it cannot always pick the best feature). But the reduction in correlation more than compensates. The ensemble is stronger overall.

Random Forest: bootstrap samples, feature subsets, aggregate

graph TD
  A["Original data"] --> B1["Bootstrap sample 1"]
  A --> B2["Bootstrap sample 2"]
  A --> B3["Bootstrap sample B"]
  B1 --> T1["Tree 1: random feature subset at each split"]
  B2 --> T2["Tree 2: random feature subset at each split"]
  B3 --> T3["Tree B: random feature subset at each split"]
  T1 --> V["Aggregate: majority vote or average"]
  T2 --> V
  T3 --> V
  V --> P["Final prediction"]

Here is a minimal Random Forest in Python using scikit-learn:

from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

X, y = make_classification(n_samples=1000, n_features=20, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

rf = RandomForestClassifier(
    n_estimators=200,      # number of trees
    max_features='sqrt',   # sqrt(p) features per split
    max_depth=None,         # grow full trees
    oob_score=True,         # compute OOB error
    random_state=42
)
rf.fit(X_train, y_train)

print(f"Test accuracy:  {rf.score(X_test, y_test):.3f}")
print(f"OOB accuracy:   {rf.oob_score_:.3f}")

Out-of-Bag (OOB) error

Remember that each bootstrap sample leaves out about 36.8% of the training data. These left-out points are called out-of-bag (OOB) samples. For each training point, roughly one-third of the trees never saw it during training. You can use those trees to predict that point, giving you a free validation estimate.

The procedure:

For each training point $x_i$ , collect predictions only from trees where $x_i$ was NOT in the bootstrap sample
Aggregate those predictions (majority vote or average)
Compare to the true label $y_i$
The fraction of incorrect OOB predictions is the OOB error

This closely approximates leave-one-out cross-validation, but you get it for free, with no extra computation.

Worked Example 2: OOB error computation

Using the same 3 bootstrap samples and 6 data points from Example 1:

Bootstrap sample 1: {A, A, C, D, E, F} Bootstrap sample 2: {B, C, D, D, F, F} Bootstrap sample 3: {A, B, B, C, E, F}

First, identify which points are out-of-bag for each tree:

Point	In sample 1?	In sample 2?	In sample 3?	OOB for trees
A	✓	✗	✓	{2}
B	✗	✓	✓	{1}
C	✓	✓	✓	{}
D	✓	✓	✗	{3}
E	✓	✗	✓	{2}
F	✓	✓	✓	{}

Now compute OOB predictions. Recall all three trees learned the rule $x \leq 3 \to 0$ , else $\to 1$ .

Point A ( $x = 1$ , true label = 0): OOB trees = {Tree 2}. Tree 2 predicts: $1 \leq 3$ , so 0. OOB prediction = 0. ✓ Correct.

Point B ( $x = 2$ , true label = 0): OOB trees = {Tree 1}. Tree 1 predicts: $2 \leq 3$ , so 0. OOB prediction = 0. ✓ Correct.

Point C ( $x = 3$ , true label = 0): No OOB trees available. Skip this point.

Point D ( $x = 5$ , true label = 1): OOB trees = {Tree 3}. Tree 3 predicts: $5 > 3$ , so 1. OOB prediction = 1. ✓ Correct.

Point E ( $x = 6$ , true label = 1): OOB trees = {Tree 2}. Tree 2 predicts: $6 > 3$ , so 1. OOB prediction = 1. ✓ Correct.

Point F ( $x = 8$ , true label = 1): No OOB trees available. Skip this point.

OOB error = number of incorrect OOB predictions / number of points with OOB predictions:

$\text{OOB error} = \frac{0}{4} = 0\%$

All four evaluable points were classified correctly. In practice, with hundreds of trees, every training point will be OOB for many trees, so you get a reliable error estimate for all points.

⚠ With only 3 trees, some points (C and F here) have no OOB predictions. This is why you need a reasonable number of trees, typically 100 or more.

Feature importance

Random Forests give you two ways to measure how important each feature is.

Impurity-based importance

Every time a feature is used for a split, it reduces the impurity (Gini or entropy) by some amount. Sum up the total impurity reduction across all trees for each feature. Features that produce larger reductions are more important.

This is fast to compute since it falls out of the training process. But it has a known bias: it favors features with many possible split points (high cardinality features like IDs or continuous variables).

Permutation importance

After training, take each feature one at a time and randomly shuffle its values across the test set. Measure how much the model’s accuracy drops. If shuffling feature $j$ causes a big drop, that feature was important.

This is more reliable than impurity-based importance because it measures actual predictive impact on held-out data. It is also model-agnostic; you can use it with any classifier.

Feature subsampling in Random Forests

graph TD
  A["All p features available"] --> B["At each split: randomly pick m features"]
  B --> C["Find best split among those m"]
  C --> D["Different splits use different subsets"]
  D --> E["Trees become diverse"]
  E --> F["Lower correlation between trees"]
  F --> G["Lower ensemble variance"]

from sklearn.inspection import permutation_importance

result = permutation_importance(rf, X_test, y_test, n_repeats=10, random_state=42)

for i in result.importances_mean.argsort()[::-1][:5]:
    print(f"Feature {i}: {result.importances_mean[i]:.4f} +/- {result.importances_std[i]:.4f}")

Feature importance scores from a Random Forest trained on census data (permutation method)

Key hyperparameters

Random Forests have a few knobs worth tuning:

Number of trees ( $B$ , or n_estimators): More trees is almost always better. Performance improves and then plateaus. It never gets worse (unlike boosting, which can overfit with too many rounds). Start with 200 to 500. If you have time, go higher. The only cost is computation.

Max features ( $m$ , or max_features): Controls the feature subset size at each split. Smaller $m$ means more randomness, lower correlation between trees, but higher bias per tree. The defaults ( $\sqrt{p}$ for classification, $p/3$ for regression) work well in most cases. Tuning this via cross-validation or OOB error can squeeze out a little more performance.

Max depth (max_depth): By default, trees are grown fully until leaves are pure or contain min_samples_split points. You can limit depth to speed up training, but full-depth trees usually work fine since bagging handles the variance.

Min samples per leaf (min_samples_leaf): Setting this to 5 or 10 instead of 1 can smooth predictions slightly and speed up training. Worth trying if you have noisy data.

Bootstrap (bootstrap): You can turn off bootstrap sampling to use the full dataset for each tree. This is called “pasting.” It removes the OOB error estimate and usually performs slightly worse.

When Random Forests fail

Random Forests are reliable across a wide range of problems, but they have real limitations.

Extrapolation. A Random Forest predicts by averaging tree outputs. Each tree output is a value seen in training. For regression, this means the model literally cannot predict values outside the range of training targets. If your training labels are all between 0 and 100, the Random Forest will never predict 105. Linear models and neural networks can extrapolate; trees cannot.

Very high-dimensional sparse data. Text data with thousands of sparse features (bag-of-words representations) is not a great fit. When $p$ is huge and each feature is rarely nonzero, the random feature subsets at each split are mostly zeros. Linear models or gradient-boosted trees with careful tuning tend to work better here.

Smooth functions. If the true relationship is a smooth, continuous function (like a sine wave), a Random Forest approximates it with a jagged step function. It works, but it needs a lot of data to get a good approximation. A model that assumes smoothness (like a Gaussian process or a neural network) will be more data-efficient.

Interpretability at scale. A single decision tree is easy to interpret. A forest of 500 trees is not. You can extract feature importances, but you lose the ability to trace a single decision path. If interpretability is critical, consider simpler models or use the forest alongside explanation tools like SHAP values.

Summary

Concept	Key point
Bagging	Train $B$ trees on bootstrap samples, aggregate by vote/average
Bootstrap	Sample with replacement; ~63.2% of data appears in each sample
Variance reduction	Averaging $B$ uncorrelated predictors cuts variance by $1/B$
Random Forest	Bagging + random feature subsets at each split
Feature subsampling	$\sqrt{p}$ for classification, $p/3$ for regression
OOB error	Free validation using the ~36.8% of points left out of each tree
Correlation floor	Variance cannot drop below $\rho\sigma^2$ ; decorrelation is key

The core insight is worth repeating: the power of Random Forests does not come from building better individual trees. It comes from building many diverse trees and combining them. Diversity, not individual strength, drives the ensemble.

What comes next

Bagging reduces variance by building independent trees and averaging them. But what if, instead of treating all trees equally, you built each new tree to fix the mistakes of the previous ones? That is the idea behind boosting.

In the next post, Boosting: AdaBoost and Gradient Boosting, we cover how to sequentially build trees that focus on hard examples, producing ensembles that reduce bias as well as variance. Where bagging builds trees in parallel, boosting builds them in sequence, and the result is one of the most effective off-the-shelf methods in machine learning.

← Back to all series

Point	In sample 1?	In sample 2?	In sample 3?	OOB for trees
A	✓	✗	✓	{2}
B	✗	✓	✓	{1}
C	✓	✓	✓	{}
D	✓	✓	✗	{3}
E	✓	✗	✓	{2}
F	✓	✓	✓	{}

Point	In sample 1?	In sample 2?	In sample 3?	OOB for trees
A	✓	✗	✓	{2}
B	✗	✓	✓	{1}
C	✓	✓	✓	{}
D	✓	✓	✗	{3}
E	✓	✗	✓	{2}
F	✓	✓	✓	{}

Point	In sample 1?	In sample 2?	In sample 3?	OOB for trees
A	✓	✗	✓	{2}
B	✗	✓	✓	{1}
C	✓	✓	✓	{}
D	✓	✓	✗	{3}
E	✓	✗	✓	{2}
F	✓	✓	✓	{}