Boosting: AdaBoost and Gradient Boosting
In this series (18 parts)
- What is machine learning: a map of the field
- Data, features, and the ML pipeline
- Linear regression
- Bias, variance, and the tradeoff
- Regularization: Ridge, Lasso, and ElasticNet
- Logistic regression and classification
- Evaluation metrics for classification
- Naive Bayes classifier
- K-Nearest Neighbors
- Decision trees
- Ensemble methods: Bagging and Random Forests
- Boosting: AdaBoost and Gradient Boosting
- Support Vector Machines
- K-Means clustering
- Dimensionality Reduction: PCA
- Gaussian mixture models and EM algorithm
- Model selection and cross-validation
- Feature engineering and selection
Boosting turns a collection of mediocre models into one strong model. The core idea: train models sequentially, where each new model pays extra attention to the examples the previous models got wrong. Over rounds, the ensemble zeros in on the hard cases and the combined prediction becomes highly accurate.
This is fundamentally different from bagging (like Random Forests), which trains models in parallel on random subsets of data. Bagging reduces variance. Boosting reduces bias. Bagging says “let’s average out the noise.” Boosting says “let’s fix the mistakes.”
A committee of weak experts
Imagine three friends who are mediocre at predicting sports outcomes. Alone, each one is barely better than flipping a coin.
| Game | Friend A | Friend B | Friend C | Actual | Majority vote |
|---|---|---|---|---|---|
| 1 | Win | Win | Lose | Win | Win, correct |
| 2 | Lose | Win | Win | Win | Win, correct |
| 3 | Lose | Lose | Win | Lose | Lose, correct |
| 4 | Win | Lose | Lose | Lose | Lose, correct |
| 5 | Win | Win | Lose | Win | Win, correct |
Each friend alone gets about 60% right. But their majority vote gets all 5 correct. Combining weak predictors produces a strong one.
Boosting takes this further. Instead of treating all friends equally, it makes each new friend focus on the games the previous friends got wrong. Friend B studies the games Friend A missed. Friend C studies the games both A and B missed. Each round targets the hardest cases.
Boosting builds models sequentially, each fixing prior mistakes
graph LR A["Model 1: learns from data"] --> B["Find mistakes"] B --> C["Model 2: focuses on mistakes"] C --> D["Find remaining mistakes"] D --> E["Model 3: focuses on those"] E --> F["Combine all models with weights"]
Now let’s define weak learners formally and walk through the AdaBoost algorithm step by step.
Weak Learners
A weak learner is any model that performs just slightly better than random guessing. For binary classification, that means accuracy above 50%. That’s a low bar, and that’s the point. Boosting doesn’t need strong base models. It builds strength through combination.
The most common weak learner is a decision stump, a decision tree with a single split. One feature, one threshold, two leaves. Alone, a stump is almost useless. But when you combine dozens or hundreds of them, each one trained to correct the errors of the ones before it, the result can be remarkably powerful.
AdaBoost
AdaBoost (Adaptive Boosting) was one of the first practical boosting algorithms. Here is how it works:
- Initialize weights. Give every training sample an equal weight: where is the number of samples.
- Train a weak learner. Fit a decision stump (or other weak model) using the weighted samples. The learner tries to minimize weighted classification error.
- Compute weighted error. Calculate the weighted error rate , which is the sum of weights for misclassified samples:
- Compute learner weight. Calculate how much say this learner gets in the final vote:
When is small (few errors), is large, meaning the model gets a strong vote. When is close to 0.5 (near random), is close to 0.
- Update sample weights. Increase weights for misclassified samples, decrease for correct ones:
Here and . When the prediction is wrong, , so the exponent is positive and the weight goes up. When correct, the weight goes down.
-
Normalize weights. Divide all weights by their sum so they add to 1.
-
Repeat for rounds.
-
Final prediction. The ensemble classifies by a weighted majority vote:
AdaBoost weight update cycle
graph TD A["Initialize equal sample weights"] --> B["Train weak learner on weighted data"] B --> C["Compute weighted error"] C --> D["Calculate learner weight alpha"] D --> E["Increase weights on misclassified samples"] E --> F["Decrease weights on correct samples"] F --> G["Normalize weights"] G --> B G --> H["After T rounds: weighted vote of all learners"]
Worked Example: 3 Rounds of AdaBoost
Let’s walk through AdaBoost on a tiny dataset with 5 points and 1 feature.
| Point | ||
|---|---|---|
| 1 | 1.0 | +1 |
| 2 | 2.0 | +1 |
| 3 | 3.0 | -1 |
| 4 | 4.0 | -1 |
| 5 | 5.0 | -1 |
Initialization: All weights are .
Round 1
We try all possible stumps. Suppose the best stump is: “if , predict ; else predict .” This classifies all 5 points correctly.
Weighted error:
When , goes to infinity, which means this one stump is perfect and we could stop. But that’s not very instructive, so let’s pick a slightly worse stump to show the real mechanics.
Let’s say we use the stump: “if , predict ; else predict .”
| Point | Prediction | Correct? | ||
|---|---|---|---|---|
| 1 | 1.0 | +1 | +1 | ✓ |
| 2 | 2.0 | +1 | -1 | ✗ |
| 3 | 3.0 | -1 | -1 | ✓ |
| 4 | 4.0 | -1 | -1 | ✓ |
| 5 | 5.0 | -1 | -1 | ✓ |
Weighted error:
Learner weight:
Update weights. For correctly classified points ():
For the misclassified point 2 ():
Unnormalized weights: . Sum = 0.8.
Normalized weights:
Point 2 now has weight 0.5. The next learner will focus heavily on getting it right.
Round 2
With the new weights, the best stump to minimize weighted error might be: “if , predict ; else predict .”
| Point | Prediction | Correct? | Weight | ||
|---|---|---|---|---|---|
| 1 | 1.0 | +1 | +1 | ✓ | 0.125 |
| 2 | 2.0 | +1 | +1 | ✓ | 0.5 |
| 3 | 3.0 | -1 | -1 | ✓ | 0.125 |
| 4 | 4.0 | -1 | -1 | ✓ | 0.125 |
| 5 | 5.0 | -1 | -1 | ✓ | 0.125 |
Weighted error: . This stump nails everything. In practice, we’d set to a large number (or clip it). Let’s say to keep things moving:
Since all points are correct, weights all decrease:
After normalization, weights are again roughly equal: .
Round 3
Let’s say this round picks the stump: “if , predict ; else predict .”
| Point | Prediction | Correct? | Weight | ||
|---|---|---|---|---|---|
| 1 | 1.0 | +1 | +1 | ✓ | 0.2 |
| 2 | 2.0 | +1 | +1 | ✓ | 0.2 |
| 3 | 3.0 | -1 | +1 | ✗ | 0.2 |
| 4 | 4.0 | -1 | +1 | ✗ | 0.2 |
| 5 | 5.0 | -1 | -1 | ✓ | 0.2 |
Weighted error:
Learner weight:
This stump is barely better than random, so it gets a small .
Final Ensemble
The final classifier combines all three stumps:
Let’s classify each point:
| Point | Score = | True | ||||
|---|---|---|---|---|---|---|
| 1 | +1 | +1 | +1 | +1 | +1 ✓ | |
| 2 | -1 | +1 | +1 | +1 | +1 ✓ | |
| 3 | -1 | -1 | +1 | -1 | -1 ✓ | |
| 4 | -1 | -1 | +1 | -1 | -1 ✓ | |
| 5 | -1 | -1 | -1 | -1 | -1 ✓ |
All 5 points classified correctly. Three weak stumps, none of which was perfect alone, combined into a perfect classifier.
Gradient Boosting
Gradient Boosting reframes the boosting idea using gradient descent. Instead of reweighting samples, we train each new model on the residuals of the current ensemble. The residuals are the negative gradient of the loss function with respect to the predictions.
Here is the general algorithm for gradient boosting:
- Start with an initial prediction , usually the mean of the target (for regression) or the log-odds (for classification).
- For each round :
- Compute the negative gradient of the loss (the “pseudo-residuals”):
- Fit a new tree to the pseudo-residuals .
- Update the model: , where is the learning rate.
- Final model: .
This is gradient descent, but instead of updating parameters, we are updating the function itself. Each tree takes a small step in the direction that reduces the loss the most.
Gradient boosting: fit residuals, add to ensemble, repeat
graph TD A["Start with initial prediction F0"] --> B["Compute residuals: y - F0"] B --> C["Fit new tree to residuals"] C --> D["Update: F1 = F0 + learning_rate * tree"] D --> E["Compute new residuals: y - F1"] E --> F["Fit next tree to new residuals"] F --> G["Repeat for T rounds"]
Gradient Boosting for Regression (MSE Loss)
When the loss function is mean squared error:
The negative gradient is simply the residual:
This is the most intuitive case. The pseudo-residuals are just the errors. Each new tree literally learns to predict “how far off are we?”
Gradient Boosting for Classification (Log Loss)
For binary classification with log loss (cross-entropy), the model outputs log-odds and the loss is:
where . The negative gradient (using the chain rule) turns out to be:
The pseudo-residual is the difference between the true label and the predicted probability. If a point has label 1 but we predict probability 0.3, the residual is , pushing the prediction higher. The math works out cleanly, but the residuals are more subtle than the regression case because we are working in log-odds space and converting to probabilities.
Worked Example: One Round of Gradient Boosting (Regression)
Consider 5 data points:
| Point | (true) | |
|---|---|---|
| 1 | 1 | 4 |
| 2 | 2 | 7 |
| 3 | 3 | 5 |
| 4 | 4 | 10 |
| 5 | 5 | 12 |
Step 1: Initial prediction. Start with the mean of :
Every point gets the same initial prediction of 7.6.
Step 2: Compute residuals.
| Point | Residual | ||
|---|---|---|---|
| 1 | 4 | 7.6 | |
| 2 | 7 | 7.6 | |
| 3 | 5 | 7.6 | |
| 4 | 10 | 7.6 | |
| 5 | 12 | 7.6 |
Step 3: Fit a tree to the residuals. We fit a decision stump. Suppose the best split is :
- Left leaf (points 1, 2, 3): average residual
- Right leaf (points 4, 5): average residual
Step 4: Update predictions with learning rate .
| Point | |||
|---|---|---|---|
| 1 | 7.6 | ||
| 2 | 7.6 | ||
| 3 | 7.6 | ||
| 4 | 7.6 | ||
| 5 | 7.6 |
Step 5: Verify MSE decreased.
Initial MSE:
New residuals after round 1:
| Point | New residual | ||
|---|---|---|---|
| 1 | 4 | 6.920 | |
| 2 | 7 | 6.920 | |
| 3 | 5 | 6.920 | |
| 4 | 10 | 8.620 | |
| 5 | 12 | 8.620 |
New MSE:
MSE dropped from 9.04 to 5.109. One tree, one small step, real improvement. Each subsequent round would fit a new tree to the updated residuals and push MSE down further.
Learning Rate and Shrinkage
You might wonder: why use a small learning rate instead of ? Wouldn’t that converge faster?
It would converge faster on the training set. But it would also overfit faster. A small learning rate combined with more trees gives better generalization. This is called shrinkage. Each tree contributes only a fraction of its full correction, forcing the ensemble to build up its predictions gradually. Think of it like regularization: you are constraining how much each individual tree can influence the final model.
In practice, learning rates of 0.01 to 0.1 with hundreds or thousands of trees tend to work best. There is a direct tradeoff: smaller needs more trees, which means more computation, but usually gives better test performance.
Other forms of regularization in gradient boosting include:
- Max tree depth: Limiting how deep each tree can grow (depths of 3 to 8 are common).
- Subsampling: Training each tree on a random subset of the data (this borrows from bagging and reduces variance).
- Min samples per leaf: Requiring a minimum number of samples in each leaf node.
Training error vs test error as boosting rounds increase
XGBoost and LightGBM
The gradient boosting idea spawned several high-performance libraries that dominate tabular data competitions and production systems.
XGBoost (Extreme Gradient Boosting) adds a regularization term directly to the objective function. It penalizes both the number of leaves and the magnitude of leaf weights:
where is the number of leaves in tree and are the leaf weights. XGBoost also uses a second-order Taylor expansion of the loss (using the Hessian), which gives it better split-finding accuracy.
LightGBM speeds things up with histogram-based splitting and a leaf-wise growth strategy instead of level-wise. It handles large datasets efficiently and supports categorical features natively.
Both libraries share the same core idea: gradient boosting with trees. The differences are in engineering optimizations and regularization details. If you understand the algorithm we covered above, you understand the foundation of both.
Here is a quick example using scikit-learn’s gradient boosting:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
X, y = make_classification(n_samples=1000, n_features=10, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model = GradientBoostingClassifier(
n_estimators=200,
learning_rate=0.1,
max_depth=3,
random_state=42
)
model.fit(X_train, y_train)
print(f"Test accuracy: {model.score(X_test, y_test):.3f}")
Bagging vs Boosting: When to Use Which
| Aspect | Bagging (Random Forest) | Boosting (GBM, XGBoost) |
|---|---|---|
| Training | Parallel | Sequential |
| Reduces | Variance | Bias |
| Overfitting risk | Low | Higher (needs tuning) |
| Sensitivity to noise | Robust | Can overfit noisy labels |
| Typical base learner | Full trees | Shallow trees (stumps) |
| Tuning effort | Minimal | More hyperparameters |
Bagging vs Boosting: parallel vs sequential
graph TD
subgraph Bagging["Bagging: parallel"]
B1["Bootstrap sample 1"] --> T1["Tree 1"]
B2["Bootstrap sample 2"] --> T2["Tree 2"]
B3["Bootstrap sample 3"] --> T3["Tree 3"]
T1 --> V["Average or vote"]
T2 --> V
T3 --> V
end
subgraph Boosting["Boosting: sequential"]
S1["Tree 1"] --> R1["Residuals"]
R1 --> S2["Tree 2"]
S2 --> R2["Residuals"]
R2 --> S3["Tree 3"]
S1 --> W["Weighted sum"]
S2 --> W
S3 --> W
end
Use bagging when you have noisy data and want a model that works well out of the box with minimal tuning.
Use boosting when you need maximum predictive accuracy and are willing to spend time tuning learning rate, tree depth, and number of rounds. Boosting usually wins on clean, structured data, which is why it dominates competitions on tabular datasets. But it can chase noise if you are not careful with regularization and early stopping.
In practice, the best approach is often to try both. Train a Random Forest as a strong baseline, then see if gradient boosting can beat it with proper tuning.
What Comes Next
Boosting handles structured tabular data extremely well, but some problems need a different geometric approach. Next up: Support Vector Machines, which find the optimal separating boundary by maximizing the margin between classes.