Decision trees
In this series (18 parts)
- What is machine learning: a map of the field
- Data, features, and the ML pipeline
- Linear regression
- Bias, variance, and the tradeoff
- Regularization: Ridge, Lasso, and ElasticNet
- Logistic regression and classification
- Evaluation metrics for classification
- Naive Bayes classifier
- K-Nearest Neighbors
- Decision trees
- Ensemble methods: Bagging and Random Forests
- Boosting: AdaBoost and Gradient Boosting
- Support Vector Machines
- K-Means clustering
- Dimensionality Reduction: PCA
- Gaussian mixture models and EM algorithm
- Model selection and cross-validation
- Feature engineering and selection
A decision tree splits your data into smaller and smaller groups by asking yes/no questions about the features. Each question divides the data at an internal node. When you reach a leaf, you get a prediction. That’s it. No matrix algebra, no gradient descent, no loss surface to navigate. Just a sequence of if-else rules learned from the data.
The simplicity is real, but so is the power. Decision trees are the building blocks of random forests and gradient-boosted trees, two of the most effective algorithms in applied ML. Understanding how a single tree works gives you the foundation for everything that follows.
Loan approval: a decision in plain language
A bank reviews loan applications. Here are 8 recent applicants.
| Applicant | Income | Credit score | Employed | Approved |
|---|---|---|---|---|
| 1 | 45k | 620 | Yes | No |
| 2 | 80k | 720 | Yes | Yes |
| 3 | 30k | 580 | No | No |
| 4 | 65k | 680 | Yes | Yes |
| 5 | 50k | 700 | Yes | Yes |
| 6 | 35k | 600 | No | No |
| 7 | 90k | 750 | Yes | Yes |
| 8 | 40k | 650 | Yes | No |
A decision tree asks yes/no questions to split these applicants into pure groups. It might first ask: “Is the credit score above 660?” Applicants above that threshold mostly get approved. Those below mostly get rejected. Then it refines each group with more questions.
A simple loan approval tree
graph TD A["Credit score > 660?"] -->|Yes| B["Income > 50k?"] A -->|No| C["Rejected"] B -->|Yes| D["Approved"] B -->|No| E["Borderline: check employment"]
The tree learns these questions automatically from data. It tries every possible feature and every possible split point, then picks the one that separates the classes most cleanly. The measure of “cleanly” is called impurity, and reducing impurity is the entire optimization objective.
Now let’s define impurity formally and see exactly how the algorithm chooses splits.
Recursive partitioning
Building a decision tree is a recursive process:
- Look at all your data at the current node.
- Try every feature and every possible split point.
- Pick the split that best separates the target classes (or best reduces prediction error for regression).
- Create two child nodes, one for each side of the split.
- Repeat on each child until a stopping condition is met.
The result is a binary tree where internal nodes are questions (“Is age > 30?”), branches are answers (yes/no), and leaves are predictions.
Recursive splitting: how a tree grows
graph TD A["Start: all data at root"] --> B["Find best feature and threshold"] B --> C["Split into left and right child"] C --> D["Left child: repeat"] C --> E["Right child: repeat"] D --> F["Stop if pure or max depth"] E --> F
Impurity measures
To pick the “best” split, you need a way to measure how mixed a node is. A node that contains only one class is pure. A node with a 50/50 mix is maximally impure. Two common measures handle this: Gini impurity and entropy.
Gini impurity
For a node with classes, Gini impurity is:
where is the fraction of samples belonging to class .
If a node has 6 positive and 4 negative samples:
A pure node has . The worst case for binary classification is (perfectly balanced classes).
Entropy and information gain
Entropy comes from information theory and measures the average surprise in a distribution:
For the same 6/4 split:
Information gain is the reduction in entropy after a split:
where is the number of samples in child and is the total.
Gini vs entropy
In practice, Gini and entropy almost always pick the same split. Gini is slightly faster to compute because it avoids the logarithm. Scikit-learn uses Gini by default. Entropy can be more sensitive to class balance in edge cases. Pick either one; the difference is rarely meaningful.
Gini vs Entropy: two ways to measure node impurity
graph TD
subgraph Gini["Gini Impurity"]
G1["Pure node: G = 0"]
G2["50/50 split: G = 0.5"]
G3["No logarithm needed"]
end
subgraph Entropy["Entropy"]
E1["Pure node: H = 0"]
E2["50/50 split: H = 1.0"]
E3["Uses log base 2"]
end
N["Both pick the same split in most cases"]
The splitting algorithm
Here is the procedure for finding the best split at a node:
def find_best_split(X, y):
best_score = float('inf')
best_feature = None
best_threshold = None
for feature in range(X.shape[1]):
thresholds = sorted(set(X[:, feature]))
for i in range(len(thresholds) - 1):
t = (thresholds[i] + thresholds[i + 1]) / 2
left_mask = X[:, feature] <= t
right_mask = ~left_mask
score = weighted_impurity(y[left_mask], y[right_mask])
if score < best_score:
best_score = score
best_feature = feature
best_threshold = t
return best_feature, best_threshold
For each feature, you sort its unique values and try thresholds between consecutive values. For each threshold, you compute the weighted impurity of the resulting left and right children. The split with the lowest weighted impurity wins.
The weighted impurity for a split is:
where and are the sizes of the left and right children.
Worked example 1: building a tree by hand with Gini impurity
Consider this small dataset. We want to predict whether someone buys a product based on Age and Income.
| Row | Age | Income | Buys |
|---|---|---|---|
| 1 | 25 | 40k | No |
| 2 | 30 | 50k | No |
| 3 | 35 | 60k | Yes |
| 4 | 40 | 55k | Yes |
| 5 | 45 | 70k | Yes |
| 6 | 50 | 45k | No |
| 7 | 28 | 65k | Yes |
| 8 | 33 | 42k | No |
We have 4 Yes and 4 No, so the root Gini is:
Try splitting on Income with threshold 52.5k (between 50k and 55k):
Left ( 52.5k): rows 1, 2, 6, 8 with labels [No, No, No, No]
Right ( 52.5k): rows 3, 4, 5, 7 with labels [Yes, Yes, Yes, Yes]
Weighted impurity:
This is a perfect split. But let’s also check another split for comparison.
Try splitting on Age with threshold 32.5 (between 30 and 33):
Left ( 32.5): rows 1, 2, 7 with labels [No, No, Yes]
Right ( 32.5): rows 3, 4, 5, 6, 8 with labels [Yes, Yes, Yes, No, No]
Weighted impurity:
The Income split (0.0) beats the Age split (0.467), so Income 52.5k is our root split.
Since both children are already pure, no further splitting is needed. Here is the resulting tree:
graph TD A["Income ≤ 52.5k?"] A -->|Yes| B["🔴 No<br/>4/4 No"] A -->|No| C["🟢 Yes<br/>4/4 Yes"] style B fill:#ffcccc,stroke:#cc0000,color:#000 style C fill:#ccffcc,stroke:#00cc00,color:#000
Decision boundary regions for a 2D classification problem
That was a clean example. Real data is messier, and you rarely get a single perfect split. In those cases, you keep splitting each child node recursively.
Worked example 2: information gain calculation
Let’s use a slightly different dataset to compare splits using entropy.
| Row | Feature A | Feature B | Label |
|---|---|---|---|
| 1 | 1 | 0 | + |
| 2 | 1 | 1 | + |
| 3 | 0 | 1 | + |
| 4 | 0 | 0 | - |
| 5 | 1 | 0 | - |
| 6 | 0 | 0 | - |
The root has 3 positive and 3 negative samples.
Root entropy:
Maximum entropy for binary classification. Totally mixed.
Split on Feature A (A = 1 vs A = 0):
Left (A = 1): rows 1, 2, 5 with labels [+, +, -]
Right (A = 0): rows 3, 4, 6 with labels [+, -, -]
Information gain for Feature A:
Split on Feature B (B = 1 vs B = 0):
Left (B = 1): rows 2, 3 with labels [+, +]
Right (B = 0): rows 1, 4, 5, 6 with labels [+, -, -, -]
Information gain for Feature B:
Feature B wins with information gain of 0.459 versus 0.082 for Feature A. Feature B produces a pure left child (both positive), which drops the overall entropy substantially. Feature A barely separates the classes at all.
This is the core idea: the best split is the one that gives you the most information about the target variable.
Building the tree recursively
Once you pick the best split for the root, you repeat the entire process on each child node. The left child becomes a new “mini-dataset,” and you search for the best split within it. Same for the right child. You keep going until one of these stopping conditions is met:
- The node is pure (all samples have the same label).
- The node has fewer samples than some minimum threshold.
- The tree has reached a maximum depth.
- No split improves impurity beyond a minimum threshold.
Without stopping conditions, the tree will keep splitting until every leaf contains exactly one sample. That’s a problem.
Overfitting and the bias-variance tradeoff
A deep, unrestricted tree will fit the training data perfectly. Every training example gets its own leaf. Training accuracy: 100%. But the tree has memorized the noise in the training data, and it will perform badly on new data.
This is the classic bias-variance tradeoff. A deep tree has low bias (it can represent complex decision boundaries) but high variance (small changes in the training data lead to completely different trees). A shallow tree has higher bias but lower variance.
You can spot overfitting when there is a large gap between training accuracy and validation accuracy. If your tree gets 99% on training but 72% on validation, it is almost certainly overfitting.
Pruning
Pruning reduces tree complexity to fight overfitting. It is a form of regularization, specific to tree-based models. There are two approaches.
Pruning: simplify the tree to fight overfitting
graph LR A["Full unpruned tree"] --> B["Many leaves, fits noise"] B --> C["Prune: remove weak splits"] C --> D["Simpler tree, fewer leaves"] D --> E["Better generalization"]
Pre-pruning (early stopping)
You set constraints before building the tree:
- max_depth: limit how deep the tree can grow.
- min_samples_split: require at least samples to attempt a split.
- min_samples_leaf: require at least samples in every leaf.
- min_impurity_decrease: only split if the impurity drops by at least some threshold.
These are hyperparameters you tune with cross-validation.
from sklearn.tree import DecisionTreeClassifier
tree = DecisionTreeClassifier(
max_depth=5,
min_samples_split=10,
min_samples_leaf=4
)
tree.fit(X_train, y_train)
Pre-pruning is simple and fast. The downside is that it might stop splitting too early. A split that looks useless on its own might enable a very useful split one level deeper. Pre-pruning cannot see that.
Post-pruning (cost-complexity pruning)
Post-pruning builds the full tree first, then trims it back. The most common method is cost-complexity pruning, which defines a cost:
where is the misclassification rate of tree , is the number of leaves, and is a complexity parameter. A larger penalizes bigger trees more heavily, resulting in more aggressive pruning.
You sweep over values of and pick the one that minimizes validation error. Scikit-learn supports this directly:
path = tree.cost_complexity_pruning_path(X_train, y_train)
alphas = path.ccp_alphas
best_alpha = None
best_score = 0
for a in alphas:
t = DecisionTreeClassifier(ccp_alpha=a)
t.fit(X_train, y_train)
score = t.score(X_val, y_val)
if score > best_score:
best_score = score
best_alpha = a
Handling continuous vs categorical features
Continuous features are straightforward. Sort the values, consider thresholds between consecutive unique values. For a feature with unique values, you have candidate thresholds.
Categorical features are trickier. If a feature has categories, there are possible binary partitions. For a feature with 10 categories, that is 511 possible splits. This gets expensive.
Some implementations (like scikit-learn) require you to one-hot encode categorical features. Others (like LightGBM) handle categories natively using an optimal algorithm that sorts categories by their average target value, then tries splits on this sorted order. This reduces the problem to candidate splits, the same as a continuous feature.
⚠ One-hot encoding deep categorical features (like zip codes with thousands of values) can degrade tree performance. Consider target encoding or using a library with native categorical support in those cases.
Decision trees for regression
Everything above applies to classification. For regression, the idea is the same, but the impurity measure and prediction change.
Instead of Gini or entropy, you use mean squared error (MSE) as the splitting criterion:
where is the mean of the target values in the node.
Each leaf predicts the mean of the target values that fall into it. The best split is the one that minimizes the weighted MSE of the children.
from sklearn.tree import DecisionTreeRegressor
reg_tree = DecisionTreeRegressor(max_depth=4)
reg_tree.fit(X_train, y_train)
predictions = reg_tree.predict(X_test)
Regression trees have the same overfitting problems as classification trees. A tree with enough depth will memorize every training target value. The same pruning techniques apply.
One important limitation: regression trees predict constant values within each leaf region. They cannot extrapolate beyond the range of training data. If your training targets range from 0 to 100, the tree will never predict 105. Keep this in mind when your data has trends that extend beyond the training range.
Strengths and limitations
Strengths:
- ✓ Easy to interpret and visualize. You can show a decision tree to a non-technical stakeholder and they will understand it.
- ✓ No feature scaling needed. Trees only care about the rank order of values, not their magnitude.
- ✓ Handle both numerical and categorical data.
- ✓ Capture non-linear relationships and feature interactions naturally.
Limitations:
- ✗ Unstable. Small changes in data can produce completely different trees.
- ✗ Greedy algorithm. Each split is locally optimal, not globally optimal.
- ✗ Axis-aligned splits only. A single tree cannot efficiently represent diagonal decision boundaries.
- ✗ Prone to overfitting without pruning or depth limits.
These limitations are exactly why ensemble methods exist. By combining many trees, you get the best of both worlds: the expressiveness of trees with the stability of averaging.
What comes next
A single decision tree is powerful but fragile. The natural next step is to combine many trees to create something more robust. In Ensemble Methods: Bagging and Random Forests, we build on everything here to show how averaging over hundreds of trees reduces variance while keeping bias low. If you understood how a single tree works, ensembles will click quickly.