Apr 3, 2026 · 20 min read · Machine Learning

Boosting: AdaBoost and Gradient Boosting

In this series (18 parts)

Boosting turns a collection of mediocre models into one strong model. The core idea: train models sequentially, where each new model pays extra attention to the examples the previous models got wrong. Over rounds, the ensemble zeros in on the hard cases and the combined prediction becomes highly accurate.

This is fundamentally different from bagging (like Random Forests), which trains models in parallel on random subsets of data. Bagging reduces variance. Boosting reduces bias. Bagging says “let’s average out the noise.” Boosting says “let’s fix the mistakes.”

A committee of weak experts

Imagine three friends who are mediocre at predicting sports outcomes. Alone, each one is barely better than flipping a coin.

Game	Friend A	Friend B	Friend C	Actual	Majority vote
1	Win	Win	Lose	Win	Win, correct
2	Lose	Win	Win	Win	Win, correct
3	Lose	Lose	Win	Lose	Lose, correct
4	Win	Lose	Lose	Lose	Lose, correct
5	Win	Win	Lose	Win	Win, correct

Each friend alone gets about 60% right. But their majority vote gets all 5 correct. Combining weak predictors produces a strong one.

Boosting takes this further. Instead of treating all friends equally, it makes each new friend focus on the games the previous friends got wrong. Friend B studies the games Friend A missed. Friend C studies the games both A and B missed. Each round targets the hardest cases.

Boosting builds models sequentially, each fixing prior mistakes

graph LR
  A["Model 1: learns from data"] --> B["Find mistakes"]
  B --> C["Model 2: focuses on mistakes"]
  C --> D["Find remaining mistakes"]
  D --> E["Model 3: focuses on those"]
  E --> F["Combine all models with weights"]

Now let’s define weak learners formally and walk through the AdaBoost algorithm step by step.

Weak Learners

A weak learner is any model that performs just slightly better than random guessing. For binary classification, that means accuracy above 50%. That’s a low bar, and that’s the point. Boosting doesn’t need strong base models. It builds strength through combination.

The most common weak learner is a decision stump, a decision tree with a single split. One feature, one threshold, two leaves. Alone, a stump is almost useless. But when you combine dozens or hundreds of them, each one trained to correct the errors of the ones before it, the result can be remarkably powerful.

AdaBoost

AdaBoost (Adaptive Boosting) was one of the first practical boosting algorithms. Here is how it works:

Initialize weights. Give every training sample an equal weight: $w_i = \frac{1}{N}$ where $N$ is the number of samples.
Train a weak learner. Fit a decision stump (or other weak model) using the weighted samples. The learner tries to minimize weighted classification error.
Compute weighted error. Calculate the weighted error rate $\epsilon_t$ , which is the sum of weights for misclassified samples:

$\epsilon_t = \frac{\sum_{i=1}^{N} w_i \cdot \mathbb{1}(y_i \neq h_t(x_i))}{\sum_{i=1}^{N} w_i}$

Compute learner weight. Calculate how much say this learner gets in the final vote:

$\alpha_t = \frac{1}{2} \ln\left(\frac{1 - \epsilon_t}{\epsilon_t}\right)$

When $\epsilon_t$ is small (few errors), $\alpha_t$ is large, meaning the model gets a strong vote. When $\epsilon_t$ is close to 0.5 (near random), $\alpha_t$ is close to 0.

Update sample weights. Increase weights for misclassified samples, decrease for correct ones:

$w_i \leftarrow w_i \cdot \exp(-\alpha_t \cdot y_i \cdot h_t(x_i))$

Here $y_i \in \{-1, +1\}$ and $h_t(x_i) \in \{-1, +1\}$ . When the prediction is wrong, $y_i \cdot h_t(x_i) = -1$ , so the exponent is positive and the weight goes up. When correct, the weight goes down.

Normalize weights. Divide all weights by their sum so they add to 1.
Repeat for $T$ rounds.
Final prediction. The ensemble classifies by a weighted majority vote:

$H(x) = \text{sign}\left(\sum_{t=1}^{T} \alpha_t \cdot h_t(x)\right)$

AdaBoost weight update cycle

graph TD
  A["Initialize equal sample weights"] --> B["Train weak learner on weighted data"]
  B --> C["Compute weighted error"]
  C --> D["Calculate learner weight alpha"]
  D --> E["Increase weights on misclassified samples"]
  E --> F["Decrease weights on correct samples"]
  F --> G["Normalize weights"]
  G --> B
  G --> H["After T rounds: weighted vote of all learners"]

Worked Example: 3 Rounds of AdaBoost

Let’s walk through AdaBoost on a tiny dataset with 5 points and 1 feature.

Point	$x$	$y$
1	1.0	+1
2	2.0	+1
3	3.0	-1
4	4.0	-1
5	5.0	-1

Initialization: All weights are $w_i = \frac{1}{5} = 0.2$ .

Round 1

We try all possible stumps. Suppose the best stump is: “if $x \leq 2.5$ , predict $+1$ ; else predict $-1$ .” This classifies all 5 points correctly.

Weighted error:

$\epsilon_1 = 0 \quad \text{(no misclassifications)}$

When $\epsilon = 0$ , $\alpha$ goes to infinity, which means this one stump is perfect and we could stop. But that’s not very instructive, so let’s pick a slightly worse stump to show the real mechanics.

Let’s say we use the stump: “if $x \leq 1.5$ , predict $+1$ ; else predict $-1$ .”

Point	$x$	$y$	Prediction	Correct?
1	1.0	+1	+1	✓
2	2.0	+1	-1	✗
3	3.0	-1	-1	✓
4	4.0	-1	-1	✓
5	5.0	-1	-1	✓

Weighted error:

$\epsilon_1 = w_2 = 0.2$

Learner weight:

$\alpha_1 = \frac{1}{2} \ln\left(\frac{1 - 0.2}{0.2}\right) = \frac{1}{2} \ln(4) = \frac{1}{2} \times 1.386 = 0.693$

Update weights. For correctly classified points ( $y_i \cdot h_t(x_i) = +1$ ):

$w_i \leftarrow 0.2 \times \exp(-0.693) = 0.2 \times 0.5 = 0.1$

For the misclassified point 2 ( $y_i \cdot h_t(x_i) = -1$ ):

$w_2 \leftarrow 0.2 \times \exp(0.693) = 0.2 \times 2.0 = 0.4$

Unnormalized weights: $[0.1, 0.4, 0.1, 0.1, 0.1]$ . Sum = 0.8.

Normalized weights:

$[0.125, 0.5, 0.125, 0.125, 0.125]$

Point 2 now has weight 0.5. The next learner will focus heavily on getting it right.

Round 2

With the new weights, the best stump to minimize weighted error might be: “if $x \leq 2.5$ , predict $+1$ ; else predict $-1$ .”

Point	$x$	$y$	Prediction	Correct?	Weight
1	1.0	+1	+1	✓	0.125
2	2.0	+1	+1	✓	0.5
3	3.0	-1	-1	✓	0.125
4	4.0	-1	-1	✓	0.125
5	5.0	-1	-1	✓	0.125

Weighted error: $\epsilon_2 = 0$ . This stump nails everything. In practice, we’d set $\alpha_2$ to a large number (or clip it). Let’s say $\epsilon_2 = 0.01$ to keep things moving:

$\alpha_2 = \frac{1}{2} \ln\left(\frac{0.99}{0.01}\right) = \frac{1}{2} \times 4.595 = 2.298$

Since all points are correct, weights all decrease:

$w_i \leftarrow w_i \times \exp(-2.298)$

After normalization, weights are again roughly equal: $[0.2, 0.2, 0.2, 0.2, 0.2]$ .

Round 3

Let’s say this round picks the stump: “if $x \leq 4.5$ , predict $+1$ ; else predict $-1$ .”

Point	$x$	$y$	Prediction	Correct?	Weight
1	1.0	+1	+1	✓	0.2
2	2.0	+1	+1	✓	0.2
3	3.0	-1	+1	✗	0.2
4	4.0	-1	+1	✗	0.2
5	5.0	-1	-1	✓	0.2

Weighted error:

$\epsilon_3 = 0.2 + 0.2 = 0.4$

Learner weight:

$\alpha_3 = \frac{1}{2} \ln\left(\frac{0.6}{0.4}\right) = \frac{1}{2} \times 0.405 = 0.203$

This stump is barely better than random, so it gets a small $\alpha$ .

Final Ensemble

The final classifier combines all three stumps:

$H(x) = \text{sign}(0.693 \cdot h_1(x) + 2.298 \cdot h_2(x) + 0.203 \cdot h_3(x))$

Let’s classify each point:

Point	$h_1$	$h_2$	$h_3$	Score = $0.693 h_1 + 2.298 h_2 + 0.203 h_3$	$\text{sign}$	True $y$
1	+1	+1	+1	$0.693 + 2.298 + 0.203 = 3.194$	+1	+1 ✓
2	-1	+1	+1	$-0.693 + 2.298 + 0.203 = 1.808$	+1	+1 ✓
3	-1	-1	+1	$-0.693 - 2.298 + 0.203 = -2.788$	-1	-1 ✓
4	-1	-1	+1	$-0.693 - 2.298 + 0.203 = -2.788$	-1	-1 ✓
5	-1	-1	-1	$-0.693 - 2.298 - 0.203 = -3.194$	-1	-1 ✓

All 5 points classified correctly. Three weak stumps, none of which was perfect alone, combined into a perfect classifier.

Gradient Boosting

Gradient Boosting reframes the boosting idea using gradient descent. Instead of reweighting samples, we train each new model on the residuals of the current ensemble. The residuals are the negative gradient of the loss function with respect to the predictions.

Here is the general algorithm for gradient boosting:

Start with an initial prediction $F_0(x)$ , usually the mean of the target (for regression) or the log-odds (for classification).
For each round $t = 1, 2, \ldots, T$ $t = 1, 2, \dots, T$ :
- Compute the negative gradient of the loss (the “pseudo-residuals”): $r_i = -\frac{\partial L(y_i, F(x_i))}{\partial F(x_i)}$
- Fit a new tree $h_t$ to the pseudo-residuals $r_i$ .
- Update the model: $F_t(x) = F_{t-1}(x) + \eta \cdot h_t(x)$ , where $\eta$ is the learning rate.
Final model: $F_T(x) = F_0(x) + \eta \sum_{t=1}^{T} h_t(x)$ .

This is gradient descent, but instead of updating parameters, we are updating the function itself. Each tree takes a small step in the direction that reduces the loss the most.

Gradient boosting: fit residuals, add to ensemble, repeat

graph TD
  A["Start with initial prediction F0"] --> B["Compute residuals: y - F0"]
  B --> C["Fit new tree to residuals"]
  C --> D["Update: F1 = F0 + learning_rate * tree"]
  D --> E["Compute new residuals: y - F1"]
  E --> F["Fit next tree to new residuals"]
  F --> G["Repeat for T rounds"]

Gradient Boosting for Regression (MSE Loss)

When the loss function is mean squared error:

$L(y_i, F(x_i)) = \frac{1}{2}(y_i - F(x_i))^2$

The negative gradient is simply the residual:

$r_i = -\frac{\partial L}{\partial F(x_i)} = y_i - F(x_i)$

This is the most intuitive case. The pseudo-residuals are just the errors. Each new tree literally learns to predict “how far off are we?”

Gradient Boosting for Classification (Log Loss)

For binary classification with log loss (cross-entropy), the model outputs log-odds and the loss is:

$L(y_i, F(x_i)) = -[y_i \cdot F(x_i) - \ln(1 + e^{F(x_i)})]$

where $y_i \in \{0, 1\}$ . The negative gradient (using the chain rule) turns out to be:

$r_i = y_i - p_i \quad \text{where } p_i = \frac{1}{1 + e^{-F(x_i)}}$

The pseudo-residual is the difference between the true label and the predicted probability. If a point has label 1 but we predict probability 0.3, the residual is $1 - 0.3 = 0.7$ , pushing the prediction higher. The math works out cleanly, but the residuals are more subtle than the regression case because we are working in log-odds space and converting to probabilities.

Worked Example: One Round of Gradient Boosting (Regression)

Consider 5 data points:

Point	$x$	$y$ (true)
1	1	4
2	2	7
3	3	5
4	4	10
5	5	12

Step 1: Initial prediction. Start with the mean of $y$ :

$F_0 = \frac{4 + 7 + 5 + 10 + 12}{5} = \frac{38}{5} = 7.6$

Every point gets the same initial prediction of 7.6.

Step 2: Compute residuals.

Point	$y$	$F_0$	Residual $r = y - F_0$
1	4	7.6	$-3.6$
2	7	7.6	$-0.6$
3	5	7.6	$-2.6$
4	10	7.6	$+2.4$
5	12	7.6	$+4.4$

Step 3: Fit a tree to the residuals. We fit a decision stump. Suppose the best split is $x \leq 3.5$ :

Left leaf (points 1, 2, 3): average residual $= \frac{-3.6 + (-0.6) + (-2.6)}{3} = \frac{-6.8}{3} = -2.267$
Right leaf (points 4, 5): average residual $= \frac{2.4 + 4.4}{2} = \frac{6.8}{2} = 3.4$

Step 4: Update predictions with learning rate $\eta = 0.3$ .

$F_1(x) = F_0(x) + 0.3 \cdot h_1(x)$

Point	$F_0$	$h_1(x)$	$F_1 = F_0 + 0.3 \cdot h_1$
1	7.6	$-2.267$	$7.6 + 0.3(-2.267) = 6.920$
2	7.6	$-2.267$	$7.6 + 0.3(-2.267) = 6.920$
3	7.6	$-2.267$	$7.6 + 0.3(-2.267) = 6.920$
4	7.6	$3.4$	$7.6 + 0.3(3.4) = 8.620$
5	7.6	$3.4$	$7.6 + 0.3(3.4) = 8.620$

Step 5: Verify MSE decreased.

Initial MSE:

$\text{MSE}_0 = \frac{(-3.6)^2 + (-0.6)^2 + (-2.6)^2 + (2.4)^2 + (4.4)^2}{5} = \frac{12.96 + 0.36 + 6.76 + 5.76 + 19.36}{5} = \frac{45.2}{5} = 9.04$

New residuals after round 1:

Point	$y$	$F_1$	New residual
1	4	6.920	$-2.920$
2	7	6.920	$+0.080$
3	5	6.920	$-1.920$
4	10	8.620	$+1.380$
5	12	8.620	$+3.380$

New MSE:

$\text{MSE}_1 = \frac{(-2.920)^2 + (0.080)^2 + (-1.920)^2 + (1.380)^2 + (3.380)^2}{5} = \frac{8.526 + 0.006 + 3.686 + 1.904 + 11.424}{5} = \frac{25.546}{5} = 5.109$

MSE dropped from 9.04 to 5.109. One tree, one small step, real improvement. Each subsequent round would fit a new tree to the updated residuals and push MSE down further.

Learning Rate and Shrinkage

You might wonder: why use a small learning rate $\eta = 0.3$ instead of $\eta = 1.0$ ? Wouldn’t that converge faster?

It would converge faster on the training set. But it would also overfit faster. A small learning rate combined with more trees gives better generalization. This is called shrinkage. Each tree contributes only a fraction of its full correction, forcing the ensemble to build up its predictions gradually. Think of it like regularization: you are constraining how much each individual tree can influence the final model.

In practice, learning rates of 0.01 to 0.1 with hundreds or thousands of trees tend to work best. There is a direct tradeoff: smaller $\eta$ needs more trees, which means more computation, but usually gives better test performance.

Other forms of regularization in gradient boosting include:

Max tree depth: Limiting how deep each tree can grow (depths of 3 to 8 are common).
Subsampling: Training each tree on a random subset of the data (this borrows from bagging and reduces variance).
Min samples per leaf: Requiring a minimum number of samples in each leaf node.

Training error vs test error as boosting rounds increase

XGBoost and LightGBM

The gradient boosting idea spawned several high-performance libraries that dominate tabular data competitions and production systems.

XGBoost (Extreme Gradient Boosting) adds a regularization term directly to the objective function. It penalizes both the number of leaves and the magnitude of leaf weights:

$\text{Obj} = \sum_{i} L(y_i, \hat{y}_i) + \sum_{t} \left(\gamma T_t + \frac{1}{2}\lambda \|w_t\|^2\right)$

where $T_t$ is the number of leaves in tree $t$ and $w_t$ are the leaf weights. XGBoost also uses a second-order Taylor expansion of the loss (using the Hessian), which gives it better split-finding accuracy.

LightGBM speeds things up with histogram-based splitting and a leaf-wise growth strategy instead of level-wise. It handles large datasets efficiently and supports categorical features natively.

Both libraries share the same core idea: gradient boosting with trees. The differences are in engineering optimizations and regularization details. If you understand the algorithm we covered above, you understand the foundation of both.

Here is a quick example using scikit-learn’s gradient boosting:

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

X, y = make_classification(n_samples=1000, n_features=10, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

model = GradientBoostingClassifier(
    n_estimators=200,
    learning_rate=0.1,
    max_depth=3,
    random_state=42
)
model.fit(X_train, y_train)
print(f"Test accuracy: {model.score(X_test, y_test):.3f}")

Bagging vs Boosting: When to Use Which

Aspect	Bagging (Random Forest)	Boosting (GBM, XGBoost)
Training	Parallel	Sequential
Reduces	Variance	Bias
Overfitting risk	Low	Higher (needs tuning)
Sensitivity to noise	Robust	Can overfit noisy labels
Typical base learner	Full trees	Shallow trees (stumps)
Tuning effort	Minimal	More hyperparameters

Bagging vs Boosting: parallel vs sequential

graph TD
  subgraph Bagging["Bagging: parallel"]
      B1["Bootstrap sample 1"] --> T1["Tree 1"]
      B2["Bootstrap sample 2"] --> T2["Tree 2"]
      B3["Bootstrap sample 3"] --> T3["Tree 3"]
      T1 --> V["Average or vote"]
      T2 --> V
      T3 --> V
  end
  subgraph Boosting["Boosting: sequential"]
      S1["Tree 1"] --> R1["Residuals"]
      R1 --> S2["Tree 2"]
      S2 --> R2["Residuals"]
      R2 --> S3["Tree 3"]
      S1 --> W["Weighted sum"]
      S2 --> W
      S3 --> W
  end

Use bagging when you have noisy data and want a model that works well out of the box with minimal tuning.

Use boosting when you need maximum predictive accuracy and are willing to spend time tuning learning rate, tree depth, and number of rounds. Boosting usually wins on clean, structured data, which is why it dominates competitions on tabular datasets. But it can chase noise if you are not careful with regularization and early stopping.

In practice, the best approach is often to try both. Train a Random Forest as a strong baseline, then see if gradient boosting can beat it with proper tuning.

What Comes Next

Boosting handles structured tabular data extremely well, but some problems need a different geometric approach. Next up: Support Vector Machines, which find the optimal separating boundary by maximizing the margin between classes.

← Back to all series

Point	$x$	$y$	Prediction	Correct?	Weight
1	1.0	+1	+1	✓	0.2
2	2.0	+1	+1	✓	0.2
3	3.0	-1	+1	✗	0.2
4	4.0	-1	+1	✗	0.2
5	5.0	-1	-1	✓	0.2

Point	$x$	$y$	Prediction	Correct?	Weight
1	1.0	+1	+1	✓	0.2
2	2.0	+1	+1	✓	0.2
3	3.0	-1	+1	✗	0.2
4	4.0	-1	+1	✗	0.2
5	5.0	-1	-1	✓	0.2

Point	$x$	$y$	Prediction	Correct?	Weight
1	1.0	+1	+1	✓	0.2
2	2.0	+1	+1	✓	0.2
3	3.0	-1	+1	✗	0.2
4	4.0	-1	+1	✗	0.2
5	5.0	-1	-1	✓	0.2