Ensemble methods: Bagging and Random Forests
In this series (18 parts)
- What is machine learning: a map of the field
- Data, features, and the ML pipeline
- Linear regression
- Bias, variance, and the tradeoff
- Regularization: Ridge, Lasso, and ElasticNet
- Logistic regression and classification
- Evaluation metrics for classification
- Naive Bayes classifier
- K-Nearest Neighbors
- Decision trees
- Ensemble methods: Bagging and Random Forests
- Boosting: AdaBoost and Gradient Boosting
- Support Vector Machines
- K-Means clustering
- Dimensionality Reduction: PCA
- Gaussian mixture models and EM algorithm
- Model selection and cross-validation
- Feature engineering and selection
A single decision tree is unstable. Change one training point and the entire structure can shift. The splits change, the leaves change, and predictions flip. This is high variance, and it is the central weakness of decision trees.
The fix is surprisingly simple: build many trees and let them vote.
Wisdom of crowds: the intuition
Ask five people to guess the number of jellybeans in a jar. Each guess is off by a lot. But the average of all five guesses is often surprisingly close to the true number. Random errors in different directions cancel out.
The same principle applies to ML models. Here are five decision trees making predictions on the same five test points.
| Test point | Tree 1 | Tree 2 | Tree 3 | Tree 4 | Tree 5 | Average | True value |
|---|---|---|---|---|---|---|---|
| A | 95 | 110 | 88 | 105 | 102 | 100.0 | 100 |
| B | 155 | 140 | 165 | 148 | 142 | 150.0 | 150 |
| C | 205 | 190 | 210 | 195 | 200 | 200.0 | 200 |
| D | 72 | 85 | 68 | 78 | 82 | 77.0 | 75 |
| E | 128 | 115 | 135 | 120 | 122 | 124.0 | 125 |
No single tree nails every prediction. But the average is close to the true value every time. Individual trees overfit in different directions. Averaging cancels the noise.
Bootstrap sampling: creating diverse training sets
graph TD A["Original dataset: N points"] --> B["Sample N points WITH replacement"] B --> C["Bootstrap sample 1: ~63% unique points"] A --> D["Sample N points WITH replacement"] D --> E["Bootstrap sample 2: different ~63%"] A --> F["Sample N points WITH replacement"] F --> G["Bootstrap sample 3: different ~63%"] C --> H["Train Tree 1"] E --> I["Train Tree 2"] G --> J["Train Tree 3"]
Now let’s see why averaging works mathematically, and how Random Forests take this idea even further.
Why ensembles work
Think about the bias-variance tradeoff. A deep decision tree has low bias (it can fit complex patterns) but high variance (it overfits to noise in the training data). Pruning or limiting depth reduces variance but increases bias. You are stuck trading one problem for another.
Ensembles break this tradeoff. Instead of building one careful tree, you build many rough trees and average their predictions. Each individual tree overfits in its own way, to its own noise. But when you combine them, the noise cancels out. The bias stays roughly the same (each tree is still expressive), but the variance drops.
Here is the key intuition. Suppose you have independent random variables, each with variance . The variance of their average is:
More trees, less variance. That is the entire motivation behind ensemble methods.
Bootstrap sampling
You only have one training set. How do you train different trees? You create new datasets by bootstrap sampling: drawing samples from your training points, with replacement.
With replacement means the same point can appear multiple times in a single bootstrap sample. Some points show up twice or three times. Others do not appear at all.
What fraction of the original data appears in a single bootstrap sample? The probability that a specific point is NOT selected in any single draw is . Over draws:
So roughly 63.2% of the original training points appear in each bootstrap sample, and about 36.8% are left out. This matters later when we discuss out-of-bag error.
Bagging: Bootstrap Aggregating
Bagging, short for Bootstrap Aggregating, is straightforward:
- Create bootstrap samples from the training data
- Train a full (unpruned) decision tree on each sample
- To predict, collect predictions from all trees
- For classification: take the majority vote
- For regression: take the average
That is it. No regularization tricks, no careful tuning. Each tree overfits, but the aggregate does not.
Worked Example 1: Bootstrap sampling and bagging
Suppose we have 6 training points for a binary classification problem (predicting whether a student passes an exam):
| Point | Hours studied () | Pass? () |
|---|---|---|
| A | 1 | 0 |
| B | 2 | 0 |
| C | 3 | 0 |
| D | 5 | 1 |
| E | 6 | 1 |
| F | 8 | 1 |
We draw 3 bootstrap samples (each of size 6, with replacement):
Bootstrap sample 1: {A, A, C, D, E, F} Bootstrap sample 2: {B, C, D, D, F, F} Bootstrap sample 3: {A, B, B, C, E, F}
Now we train a decision stump (a tree with one split) on each.
Tree 1 (trained on {A, A, C, D, E, F}):
- Points with : labels are {0, 0, 0}, predict 0
- Points with : labels are {1, 1, 1}, predict 1
- Best split: vs . Training error = 0/6 = 0.
Tree 2 (trained on {B, C, D, D, F, F}):
- Try split at : left = {0}, right = {0, 1, 1, 1, 1}. Misclassifications: 1 (C is class 0 on the right).
- Try split at : left = {0, 0}, right = {1, 1, 1, 1}. Misclassifications: 0.
- Best split: vs . Training error = 0/6 = 0.
Tree 3 (trained on {A, B, B, C, E, F}):
- Try split at : left = {0, 0, 0, 0}, right = {1, 1}. Misclassifications: 0.
- Best split: vs . Training error = 0/6 = 0.
Now let’s predict for a new point: (hours studied).
| Tree | Split rule | Prediction for |
|---|---|---|
| 1 | , else 1 | 1 |
| 2 | , else 1 | 1 |
| 3 | , else 1 | 1 |
Majority vote: 1 (pass). All three trees agree here.
Now predict for :
| Tree | Prediction for |
|---|---|
| 1 | 0 |
| 2 | 0 |
| 3 | 0 |
Majority vote: 0 (fail).
In this simple example the trees agreed because they found the same split. In practice, with more features and deeper trees, each bootstrap sample produces a different tree structure, and the majority vote smooths out the disagreements.
A single tree trained on the full dataset would also split at . The real benefit of bagging shows up when the data is noisier and the trees are deeper, so they each overfit in different directions.
Why correlation is the enemy
The variance formula assumes the trees are independent. They are not. All trees are trained on overlapping data from the same original dataset. If a strong feature dominates, every tree will split on it first, making the trees correlated.
For identically distributed random variables, each with variance and pairwise correlation , the variance of their average is:
As , the second term vanishes. You are left with . No matter how many trees you add, you cannot push the variance below this floor. The correlation between trees is the bottleneck.
This is exactly the problem Random Forests solve.
Single tree vs forest: error decreases with more trees
graph LR A["1 tree: high variance"] --> B["10 trees: lower variance"] B --> C["100 trees: much lower"] C --> D["500 trees: diminishing returns"] D --> E["Variance floor = rho * sigma-squared"]
Prediction error across 100 test samples: the forest (orange) is smoother and consistently lower than a single tree (blue)
Random Forests: decorrelating the trees
Random Forests add one twist to bagging: feature subsampling at each split. When deciding the best split at a node, instead of considering all features, the algorithm randomly selects a subset of features and picks the best split among those.
The standard choices for :
- Classification:
- Regression:
Why does this help? If one feature is very strong, a bagged ensemble will use it at the root of almost every tree. The trees look similar and are highly correlated. By forcing each split to consider only a random subset of features, you prevent this. Sometimes the dominant feature is not in the subset, so the tree must use other features. The trees become more diverse, their correlation drops, and the variance floor drops with it.
The tradeoff: each individual tree is slightly worse (higher bias, since it cannot always pick the best feature). But the reduction in correlation more than compensates. The ensemble is stronger overall.
Random Forest: bootstrap samples, feature subsets, aggregate
graph TD A["Original data"] --> B1["Bootstrap sample 1"] A --> B2["Bootstrap sample 2"] A --> B3["Bootstrap sample B"] B1 --> T1["Tree 1: random feature subset at each split"] B2 --> T2["Tree 2: random feature subset at each split"] B3 --> T3["Tree B: random feature subset at each split"] T1 --> V["Aggregate: majority vote or average"] T2 --> V T3 --> V V --> P["Final prediction"]
Here is a minimal Random Forest in Python using scikit-learn:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
rf = RandomForestClassifier(
n_estimators=200, # number of trees
max_features='sqrt', # sqrt(p) features per split
max_depth=None, # grow full trees
oob_score=True, # compute OOB error
random_state=42
)
rf.fit(X_train, y_train)
print(f"Test accuracy: {rf.score(X_test, y_test):.3f}")
print(f"OOB accuracy: {rf.oob_score_:.3f}")
Out-of-Bag (OOB) error
Remember that each bootstrap sample leaves out about 36.8% of the training data. These left-out points are called out-of-bag (OOB) samples. For each training point, roughly one-third of the trees never saw it during training. You can use those trees to predict that point, giving you a free validation estimate.
The procedure:
- For each training point , collect predictions only from trees where was NOT in the bootstrap sample
- Aggregate those predictions (majority vote or average)
- Compare to the true label
- The fraction of incorrect OOB predictions is the OOB error
This closely approximates leave-one-out cross-validation, but you get it for free, with no extra computation.
Worked Example 2: OOB error computation
Using the same 3 bootstrap samples and 6 data points from Example 1:
Bootstrap sample 1: {A, A, C, D, E, F} Bootstrap sample 2: {B, C, D, D, F, F} Bootstrap sample 3: {A, B, B, C, E, F}
First, identify which points are out-of-bag for each tree:
| Point | In sample 1? | In sample 2? | In sample 3? | OOB for trees |
|---|---|---|---|---|
| A | ✓ | ✗ | ✓ | {2} |
| B | ✗ | ✓ | ✓ | {1} |
| C | ✓ | ✓ | ✓ | {} |
| D | ✓ | ✓ | ✗ | {3} |
| E | ✓ | ✗ | ✓ | {2} |
| F | ✓ | ✓ | ✓ | {} |
Now compute OOB predictions. Recall all three trees learned the rule , else .
Point A (, true label = 0): OOB trees = {Tree 2}. Tree 2 predicts: , so 0. OOB prediction = 0. ✓ Correct.
Point B (, true label = 0): OOB trees = {Tree 1}. Tree 1 predicts: , so 0. OOB prediction = 0. ✓ Correct.
Point C (, true label = 0): No OOB trees available. Skip this point.
Point D (, true label = 1): OOB trees = {Tree 3}. Tree 3 predicts: , so 1. OOB prediction = 1. ✓ Correct.
Point E (, true label = 1): OOB trees = {Tree 2}. Tree 2 predicts: , so 1. OOB prediction = 1. ✓ Correct.
Point F (, true label = 1): No OOB trees available. Skip this point.
OOB error = number of incorrect OOB predictions / number of points with OOB predictions:
All four evaluable points were classified correctly. In practice, with hundreds of trees, every training point will be OOB for many trees, so you get a reliable error estimate for all points.
⚠ With only 3 trees, some points (C and F here) have no OOB predictions. This is why you need a reasonable number of trees, typically 100 or more.
Feature importance
Random Forests give you two ways to measure how important each feature is.
Impurity-based importance
Every time a feature is used for a split, it reduces the impurity (Gini or entropy) by some amount. Sum up the total impurity reduction across all trees for each feature. Features that produce larger reductions are more important.
This is fast to compute since it falls out of the training process. But it has a known bias: it favors features with many possible split points (high cardinality features like IDs or continuous variables).
Permutation importance
After training, take each feature one at a time and randomly shuffle its values across the test set. Measure how much the model’s accuracy drops. If shuffling feature causes a big drop, that feature was important.
This is more reliable than impurity-based importance because it measures actual predictive impact on held-out data. It is also model-agnostic; you can use it with any classifier.
Feature subsampling in Random Forests
graph TD A["All p features available"] --> B["At each split: randomly pick m features"] B --> C["Find best split among those m"] C --> D["Different splits use different subsets"] D --> E["Trees become diverse"] E --> F["Lower correlation between trees"] F --> G["Lower ensemble variance"]
from sklearn.inspection import permutation_importance
result = permutation_importance(rf, X_test, y_test, n_repeats=10, random_state=42)
for i in result.importances_mean.argsort()[::-1][:5]:
print(f"Feature {i}: {result.importances_mean[i]:.4f} +/- {result.importances_std[i]:.4f}")
Feature importance scores from a Random Forest trained on census data (permutation method)
Key hyperparameters
Random Forests have a few knobs worth tuning:
Number of trees (, or n_estimators): More trees is almost always better. Performance improves and then plateaus. It never gets worse (unlike boosting, which can overfit with too many rounds). Start with 200 to 500. If you have time, go higher. The only cost is computation.
Max features (, or max_features): Controls the feature subset size at each split. Smaller means more randomness, lower correlation between trees, but higher bias per tree. The defaults ( for classification, for regression) work well in most cases. Tuning this via cross-validation or OOB error can squeeze out a little more performance.
Max depth (max_depth): By default, trees are grown fully until leaves are pure or contain min_samples_split points. You can limit depth to speed up training, but full-depth trees usually work fine since bagging handles the variance.
Min samples per leaf (min_samples_leaf): Setting this to 5 or 10 instead of 1 can smooth predictions slightly and speed up training. Worth trying if you have noisy data.
Bootstrap (bootstrap): You can turn off bootstrap sampling to use the full dataset for each tree. This is called “pasting.” It removes the OOB error estimate and usually performs slightly worse.
When Random Forests fail
Random Forests are reliable across a wide range of problems, but they have real limitations.
Extrapolation. A Random Forest predicts by averaging tree outputs. Each tree output is a value seen in training. For regression, this means the model literally cannot predict values outside the range of training targets. If your training labels are all between 0 and 100, the Random Forest will never predict 105. Linear models and neural networks can extrapolate; trees cannot.
Very high-dimensional sparse data. Text data with thousands of sparse features (bag-of-words representations) is not a great fit. When is huge and each feature is rarely nonzero, the random feature subsets at each split are mostly zeros. Linear models or gradient-boosted trees with careful tuning tend to work better here.
Smooth functions. If the true relationship is a smooth, continuous function (like a sine wave), a Random Forest approximates it with a jagged step function. It works, but it needs a lot of data to get a good approximation. A model that assumes smoothness (like a Gaussian process or a neural network) will be more data-efficient.
Interpretability at scale. A single decision tree is easy to interpret. A forest of 500 trees is not. You can extract feature importances, but you lose the ability to trace a single decision path. If interpretability is critical, consider simpler models or use the forest alongside explanation tools like SHAP values.
Summary
| Concept | Key point |
|---|---|
| Bagging | Train trees on bootstrap samples, aggregate by vote/average |
| Bootstrap | Sample with replacement; ~63.2% of data appears in each sample |
| Variance reduction | Averaging uncorrelated predictors cuts variance by |
| Random Forest | Bagging + random feature subsets at each split |
| Feature subsampling | for classification, for regression |
| OOB error | Free validation using the ~36.8% of points left out of each tree |
| Correlation floor | Variance cannot drop below ; decorrelation is key |
The core insight is worth repeating: the power of Random Forests does not come from building better individual trees. It comes from building many diverse trees and combining them. Diversity, not individual strength, drives the ensemble.
What comes next
Bagging reduces variance by building independent trees and averaging them. But what if, instead of treating all trees equally, you built each new tree to fix the mistakes of the previous ones? That is the idea behind boosting.
In the next post, Boosting: AdaBoost and Gradient Boosting, we cover how to sequentially build trees that focus on hard examples, producing ensembles that reduce bias as well as variance. Where bagging builds trees in parallel, boosting builds them in sequence, and the result is one of the most effective off-the-shelf methods in machine learning.