Search…

Boosting: AdaBoost and Gradient Boosting

In this series (18 parts)
  1. What is machine learning: a map of the field
  2. Data, features, and the ML pipeline
  3. Linear regression
  4. Bias, variance, and the tradeoff
  5. Regularization: Ridge, Lasso, and ElasticNet
  6. Logistic regression and classification
  7. Evaluation metrics for classification
  8. Naive Bayes classifier
  9. K-Nearest Neighbors
  10. Decision trees
  11. Ensemble methods: Bagging and Random Forests
  12. Boosting: AdaBoost and Gradient Boosting
  13. Support Vector Machines
  14. K-Means clustering
  15. Dimensionality Reduction: PCA
  16. Gaussian mixture models and EM algorithm
  17. Model selection and cross-validation
  18. Feature engineering and selection

Boosting turns a collection of mediocre models into one strong model. The core idea: train models sequentially, where each new model pays extra attention to the examples the previous models got wrong. Over rounds, the ensemble zeros in on the hard cases and the combined prediction becomes highly accurate.

This is fundamentally different from bagging (like Random Forests), which trains models in parallel on random subsets of data. Bagging reduces variance. Boosting reduces bias. Bagging says “let’s average out the noise.” Boosting says “let’s fix the mistakes.”

A committee of weak experts

Imagine three friends who are mediocre at predicting sports outcomes. Alone, each one is barely better than flipping a coin.

GameFriend AFriend BFriend CActualMajority vote
1WinWinLoseWinWin, correct
2LoseWinWinWinWin, correct
3LoseLoseWinLoseLose, correct
4WinLoseLoseLoseLose, correct
5WinWinLoseWinWin, correct

Each friend alone gets about 60% right. But their majority vote gets all 5 correct. Combining weak predictors produces a strong one.

Boosting takes this further. Instead of treating all friends equally, it makes each new friend focus on the games the previous friends got wrong. Friend B studies the games Friend A missed. Friend C studies the games both A and B missed. Each round targets the hardest cases.

Boosting builds models sequentially, each fixing prior mistakes

graph LR
  A["Model 1: learns from data"] --> B["Find mistakes"]
  B --> C["Model 2: focuses on mistakes"]
  C --> D["Find remaining mistakes"]
  D --> E["Model 3: focuses on those"]
  E --> F["Combine all models with weights"]

Now let’s define weak learners formally and walk through the AdaBoost algorithm step by step.

Weak Learners

A weak learner is any model that performs just slightly better than random guessing. For binary classification, that means accuracy above 50%. That’s a low bar, and that’s the point. Boosting doesn’t need strong base models. It builds strength through combination.

The most common weak learner is a decision stump, a decision tree with a single split. One feature, one threshold, two leaves. Alone, a stump is almost useless. But when you combine dozens or hundreds of them, each one trained to correct the errors of the ones before it, the result can be remarkably powerful.

AdaBoost

AdaBoost (Adaptive Boosting) was one of the first practical boosting algorithms. Here is how it works:

  1. Initialize weights. Give every training sample an equal weight: wi=1Nw_i = \frac{1}{N} where NN is the number of samples.
  2. Train a weak learner. Fit a decision stump (or other weak model) using the weighted samples. The learner tries to minimize weighted classification error.
  3. Compute weighted error. Calculate the weighted error rate ϵt\epsilon_t, which is the sum of weights for misclassified samples:

ϵt=i=1Nwi1(yiht(xi))i=1Nwi\epsilon_t = \frac{\sum_{i=1}^{N} w_i \cdot \mathbb{1}(y_i \neq h_t(x_i))}{\sum_{i=1}^{N} w_i}

  1. Compute learner weight. Calculate how much say this learner gets in the final vote:

αt=12ln(1ϵtϵt)\alpha_t = \frac{1}{2} \ln\left(\frac{1 - \epsilon_t}{\epsilon_t}\right)

When ϵt\epsilon_t is small (few errors), αt\alpha_t is large, meaning the model gets a strong vote. When ϵt\epsilon_t is close to 0.5 (near random), αt\alpha_t is close to 0.

  1. Update sample weights. Increase weights for misclassified samples, decrease for correct ones:

wiwiexp(αtyiht(xi))w_i \leftarrow w_i \cdot \exp(-\alpha_t \cdot y_i \cdot h_t(x_i))

Here yi{1,+1}y_i \in \{-1, +1\} and ht(xi){1,+1}h_t(x_i) \in \{-1, +1\}. When the prediction is wrong, yiht(xi)=1y_i \cdot h_t(x_i) = -1, so the exponent is positive and the weight goes up. When correct, the weight goes down.

  1. Normalize weights. Divide all weights by their sum so they add to 1.

  2. Repeat for TT rounds.

  3. Final prediction. The ensemble classifies by a weighted majority vote:

H(x)=sign(t=1Tαtht(x))H(x) = \text{sign}\left(\sum_{t=1}^{T} \alpha_t \cdot h_t(x)\right)

AdaBoost weight update cycle

graph TD
  A["Initialize equal sample weights"] --> B["Train weak learner on weighted data"]
  B --> C["Compute weighted error"]
  C --> D["Calculate learner weight alpha"]
  D --> E["Increase weights on misclassified samples"]
  E --> F["Decrease weights on correct samples"]
  F --> G["Normalize weights"]
  G --> B
  G --> H["After T rounds: weighted vote of all learners"]

Worked Example: 3 Rounds of AdaBoost

Let’s walk through AdaBoost on a tiny dataset with 5 points and 1 feature.

Pointxxyy
11.0+1
22.0+1
33.0-1
44.0-1
55.0-1

Initialization: All weights are wi=15=0.2w_i = \frac{1}{5} = 0.2.


Round 1

We try all possible stumps. Suppose the best stump is: “if x2.5x \leq 2.5, predict +1+1; else predict 1-1.” This classifies all 5 points correctly.

Weighted error:

ϵ1=0(no misclassifications)\epsilon_1 = 0 \quad \text{(no misclassifications)}

When ϵ=0\epsilon = 0, α\alpha goes to infinity, which means this one stump is perfect and we could stop. But that’s not very instructive, so let’s pick a slightly worse stump to show the real mechanics.

Let’s say we use the stump: “if x1.5x \leq 1.5, predict +1+1; else predict 1-1.”

PointxxyyPredictionCorrect?
11.0+1+1
22.0+1-1
33.0-1-1
44.0-1-1
55.0-1-1

Weighted error:

ϵ1=w2=0.2\epsilon_1 = w_2 = 0.2

Learner weight:

α1=12ln(10.20.2)=12ln(4)=12×1.386=0.693\alpha_1 = \frac{1}{2} \ln\left(\frac{1 - 0.2}{0.2}\right) = \frac{1}{2} \ln(4) = \frac{1}{2} \times 1.386 = 0.693

Update weights. For correctly classified points (yiht(xi)=+1y_i \cdot h_t(x_i) = +1):

wi0.2×exp(0.693)=0.2×0.5=0.1w_i \leftarrow 0.2 \times \exp(-0.693) = 0.2 \times 0.5 = 0.1

For the misclassified point 2 (yiht(xi)=1y_i \cdot h_t(x_i) = -1):

w20.2×exp(0.693)=0.2×2.0=0.4w_2 \leftarrow 0.2 \times \exp(0.693) = 0.2 \times 2.0 = 0.4

Unnormalized weights: [0.1,0.4,0.1,0.1,0.1][0.1, 0.4, 0.1, 0.1, 0.1]. Sum = 0.8.

Normalized weights:

[0.125,0.5,0.125,0.125,0.125][0.125, 0.5, 0.125, 0.125, 0.125]

Point 2 now has weight 0.5. The next learner will focus heavily on getting it right.


Round 2

With the new weights, the best stump to minimize weighted error might be: “if x2.5x \leq 2.5, predict +1+1; else predict 1-1.”

PointxxyyPredictionCorrect?Weight
11.0+1+10.125
22.0+1+10.5
33.0-1-10.125
44.0-1-10.125
55.0-1-10.125

Weighted error: ϵ2=0\epsilon_2 = 0. This stump nails everything. In practice, we’d set α2\alpha_2 to a large number (or clip it). Let’s say ϵ2=0.01\epsilon_2 = 0.01 to keep things moving:

α2=12ln(0.990.01)=12×4.595=2.298\alpha_2 = \frac{1}{2} \ln\left(\frac{0.99}{0.01}\right) = \frac{1}{2} \times 4.595 = 2.298

Since all points are correct, weights all decrease:

wiwi×exp(2.298)w_i \leftarrow w_i \times \exp(-2.298)

After normalization, weights are again roughly equal: [0.2,0.2,0.2,0.2,0.2][0.2, 0.2, 0.2, 0.2, 0.2].


Round 3

Let’s say this round picks the stump: “if x4.5x \leq 4.5, predict +1+1; else predict 1-1.”

PointxxyyPredictionCorrect?Weight
11.0+1+10.2
22.0+1+10.2
33.0-1+10.2
44.0-1+10.2
55.0-1-10.2

Weighted error:

ϵ3=0.2+0.2=0.4\epsilon_3 = 0.2 + 0.2 = 0.4

Learner weight:

α3=12ln(0.60.4)=12×0.405=0.203\alpha_3 = \frac{1}{2} \ln\left(\frac{0.6}{0.4}\right) = \frac{1}{2} \times 0.405 = 0.203

This stump is barely better than random, so it gets a small α\alpha.


Final Ensemble

The final classifier combines all three stumps:

H(x)=sign(0.693h1(x)+2.298h2(x)+0.203h3(x))H(x) = \text{sign}(0.693 \cdot h_1(x) + 2.298 \cdot h_2(x) + 0.203 \cdot h_3(x))

Let’s classify each point:

Pointh1h_1h2h_2h3h_3Score = 0.693h1+2.298h2+0.203h30.693 h_1 + 2.298 h_2 + 0.203 h_3sign\text{sign}True yy
1+1+1+10.693+2.298+0.203=3.1940.693 + 2.298 + 0.203 = 3.194+1+1 ✓
2-1+1+10.693+2.298+0.203=1.808-0.693 + 2.298 + 0.203 = 1.808+1+1 ✓
3-1-1+10.6932.298+0.203=2.788-0.693 - 2.298 + 0.203 = -2.788-1-1 ✓
4-1-1+10.6932.298+0.203=2.788-0.693 - 2.298 + 0.203 = -2.788-1-1 ✓
5-1-1-10.6932.2980.203=3.194-0.693 - 2.298 - 0.203 = -3.194-1-1 ✓

All 5 points classified correctly. Three weak stumps, none of which was perfect alone, combined into a perfect classifier.

Gradient Boosting

Gradient Boosting reframes the boosting idea using gradient descent. Instead of reweighting samples, we train each new model on the residuals of the current ensemble. The residuals are the negative gradient of the loss function with respect to the predictions.

Here is the general algorithm for gradient boosting:

  1. Start with an initial prediction F0(x)F_0(x), usually the mean of the target (for regression) or the log-odds (for classification).
  2. For each round t=1,2,,Tt = 1, 2, \ldots, T:
    • Compute the negative gradient of the loss (the “pseudo-residuals”): ri=L(yi,F(xi))F(xi)r_i = -\frac{\partial L(y_i, F(x_i))}{\partial F(x_i)}
    • Fit a new tree hth_t to the pseudo-residuals rir_i.
    • Update the model: Ft(x)=Ft1(x)+ηht(x)F_t(x) = F_{t-1}(x) + \eta \cdot h_t(x), where η\eta is the learning rate.
  3. Final model: FT(x)=F0(x)+ηt=1Tht(x)F_T(x) = F_0(x) + \eta \sum_{t=1}^{T} h_t(x).

This is gradient descent, but instead of updating parameters, we are updating the function itself. Each tree takes a small step in the direction that reduces the loss the most.

Gradient boosting: fit residuals, add to ensemble, repeat

graph TD
  A["Start with initial prediction F0"] --> B["Compute residuals: y - F0"]
  B --> C["Fit new tree to residuals"]
  C --> D["Update: F1 = F0 + learning_rate * tree"]
  D --> E["Compute new residuals: y - F1"]
  E --> F["Fit next tree to new residuals"]
  F --> G["Repeat for T rounds"]

Gradient Boosting for Regression (MSE Loss)

When the loss function is mean squared error:

L(yi,F(xi))=12(yiF(xi))2L(y_i, F(x_i)) = \frac{1}{2}(y_i - F(x_i))^2

The negative gradient is simply the residual:

ri=LF(xi)=yiF(xi)r_i = -\frac{\partial L}{\partial F(x_i)} = y_i - F(x_i)

This is the most intuitive case. The pseudo-residuals are just the errors. Each new tree literally learns to predict “how far off are we?”

Gradient Boosting for Classification (Log Loss)

For binary classification with log loss (cross-entropy), the model outputs log-odds and the loss is:

L(yi,F(xi))=[yiF(xi)ln(1+eF(xi))]L(y_i, F(x_i)) = -[y_i \cdot F(x_i) - \ln(1 + e^{F(x_i)})]

where yi{0,1}y_i \in \{0, 1\}. The negative gradient (using the chain rule) turns out to be:

ri=yipiwhere pi=11+eF(xi)r_i = y_i - p_i \quad \text{where } p_i = \frac{1}{1 + e^{-F(x_i)}}

The pseudo-residual is the difference between the true label and the predicted probability. If a point has label 1 but we predict probability 0.3, the residual is 10.3=0.71 - 0.3 = 0.7, pushing the prediction higher. The math works out cleanly, but the residuals are more subtle than the regression case because we are working in log-odds space and converting to probabilities.

Worked Example: One Round of Gradient Boosting (Regression)

Consider 5 data points:

Pointxxyy (true)
114
227
335
4410
5512

Step 1: Initial prediction. Start with the mean of yy:

F0=4+7+5+10+125=385=7.6F_0 = \frac{4 + 7 + 5 + 10 + 12}{5} = \frac{38}{5} = 7.6

Every point gets the same initial prediction of 7.6.

Step 2: Compute residuals.

PointyyF0F_0Residual r=yF0r = y - F_0
147.63.6-3.6
277.60.6-0.6
357.62.6-2.6
4107.6+2.4+2.4
5127.6+4.4+4.4

Step 3: Fit a tree to the residuals. We fit a decision stump. Suppose the best split is x3.5x \leq 3.5:

  • Left leaf (points 1, 2, 3): average residual =3.6+(0.6)+(2.6)3=6.83=2.267= \frac{-3.6 + (-0.6) + (-2.6)}{3} = \frac{-6.8}{3} = -2.267
  • Right leaf (points 4, 5): average residual =2.4+4.42=6.82=3.4= \frac{2.4 + 4.4}{2} = \frac{6.8}{2} = 3.4

Step 4: Update predictions with learning rate η=0.3\eta = 0.3.

F1(x)=F0(x)+0.3h1(x)F_1(x) = F_0(x) + 0.3 \cdot h_1(x)

PointF0F_0h1(x)h_1(x)F1=F0+0.3h1F_1 = F_0 + 0.3 \cdot h_1
17.62.267-2.2677.6+0.3(2.267)=6.9207.6 + 0.3(-2.267) = 6.920
27.62.267-2.2677.6+0.3(2.267)=6.9207.6 + 0.3(-2.267) = 6.920
37.62.267-2.2677.6+0.3(2.267)=6.9207.6 + 0.3(-2.267) = 6.920
47.63.43.47.6+0.3(3.4)=8.6207.6 + 0.3(3.4) = 8.620
57.63.43.47.6+0.3(3.4)=8.6207.6 + 0.3(3.4) = 8.620

Step 5: Verify MSE decreased.

Initial MSE:

MSE0=(3.6)2+(0.6)2+(2.6)2+(2.4)2+(4.4)25=12.96+0.36+6.76+5.76+19.365=45.25=9.04\text{MSE}_0 = \frac{(-3.6)^2 + (-0.6)^2 + (-2.6)^2 + (2.4)^2 + (4.4)^2}{5} = \frac{12.96 + 0.36 + 6.76 + 5.76 + 19.36}{5} = \frac{45.2}{5} = 9.04

New residuals after round 1:

PointyyF1F_1New residual
146.9202.920-2.920
276.920+0.080+0.080
356.9201.920-1.920
4108.620+1.380+1.380
5128.620+3.380+3.380

New MSE:

MSE1=(2.920)2+(0.080)2+(1.920)2+(1.380)2+(3.380)25=8.526+0.006+3.686+1.904+11.4245=25.5465=5.109\text{MSE}_1 = \frac{(-2.920)^2 + (0.080)^2 + (-1.920)^2 + (1.380)^2 + (3.380)^2}{5} = \frac{8.526 + 0.006 + 3.686 + 1.904 + 11.424}{5} = \frac{25.546}{5} = 5.109

MSE dropped from 9.04 to 5.109. One tree, one small step, real improvement. Each subsequent round would fit a new tree to the updated residuals and push MSE down further.

Learning Rate and Shrinkage

You might wonder: why use a small learning rate η=0.3\eta = 0.3 instead of η=1.0\eta = 1.0? Wouldn’t that converge faster?

It would converge faster on the training set. But it would also overfit faster. A small learning rate combined with more trees gives better generalization. This is called shrinkage. Each tree contributes only a fraction of its full correction, forcing the ensemble to build up its predictions gradually. Think of it like regularization: you are constraining how much each individual tree can influence the final model.

In practice, learning rates of 0.01 to 0.1 with hundreds or thousands of trees tend to work best. There is a direct tradeoff: smaller η\eta needs more trees, which means more computation, but usually gives better test performance.

Other forms of regularization in gradient boosting include:

  • Max tree depth: Limiting how deep each tree can grow (depths of 3 to 8 are common).
  • Subsampling: Training each tree on a random subset of the data (this borrows from bagging and reduces variance).
  • Min samples per leaf: Requiring a minimum number of samples in each leaf node.

Training error vs test error as boosting rounds increase

XGBoost and LightGBM

The gradient boosting idea spawned several high-performance libraries that dominate tabular data competitions and production systems.

XGBoost (Extreme Gradient Boosting) adds a regularization term directly to the objective function. It penalizes both the number of leaves and the magnitude of leaf weights:

Obj=iL(yi,y^i)+t(γTt+12λwt2)\text{Obj} = \sum_{i} L(y_i, \hat{y}_i) + \sum_{t} \left(\gamma T_t + \frac{1}{2}\lambda \|w_t\|^2\right)

where TtT_t is the number of leaves in tree tt and wtw_t are the leaf weights. XGBoost also uses a second-order Taylor expansion of the loss (using the Hessian), which gives it better split-finding accuracy.

LightGBM speeds things up with histogram-based splitting and a leaf-wise growth strategy instead of level-wise. It handles large datasets efficiently and supports categorical features natively.

Both libraries share the same core idea: gradient boosting with trees. The differences are in engineering optimizations and regularization details. If you understand the algorithm we covered above, you understand the foundation of both.

Here is a quick example using scikit-learn’s gradient boosting:

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

X, y = make_classification(n_samples=1000, n_features=10, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

model = GradientBoostingClassifier(
    n_estimators=200,
    learning_rate=0.1,
    max_depth=3,
    random_state=42
)
model.fit(X_train, y_train)
print(f"Test accuracy: {model.score(X_test, y_test):.3f}")

Bagging vs Boosting: When to Use Which

AspectBagging (Random Forest)Boosting (GBM, XGBoost)
TrainingParallelSequential
ReducesVarianceBias
Overfitting riskLowHigher (needs tuning)
Sensitivity to noiseRobustCan overfit noisy labels
Typical base learnerFull treesShallow trees (stumps)
Tuning effortMinimalMore hyperparameters

Bagging vs Boosting: parallel vs sequential

graph TD
  subgraph Bagging["Bagging: parallel"]
      B1["Bootstrap sample 1"] --> T1["Tree 1"]
      B2["Bootstrap sample 2"] --> T2["Tree 2"]
      B3["Bootstrap sample 3"] --> T3["Tree 3"]
      T1 --> V["Average or vote"]
      T2 --> V
      T3 --> V
  end
  subgraph Boosting["Boosting: sequential"]
      S1["Tree 1"] --> R1["Residuals"]
      R1 --> S2["Tree 2"]
      S2 --> R2["Residuals"]
      R2 --> S3["Tree 3"]
      S1 --> W["Weighted sum"]
      S2 --> W
      S3 --> W
  end

Use bagging when you have noisy data and want a model that works well out of the box with minimal tuning.

Use boosting when you need maximum predictive accuracy and are willing to spend time tuning learning rate, tree depth, and number of rounds. Boosting usually wins on clean, structured data, which is why it dominates competitions on tabular datasets. But it can chase noise if you are not careful with regularization and early stopping.

In practice, the best approach is often to try both. Train a Random Forest as a strong baseline, then see if gradient boosting can beat it with proper tuning.

What Comes Next

Boosting handles structured tabular data extremely well, but some problems need a different geometric approach. Next up: Support Vector Machines, which find the optimal separating boundary by maximizing the margin between classes.

Start typing to search across all content
navigate Enter open Esc close