Search…

Model selection and cross-validation

In this series (18 parts)
  1. What is machine learning: a map of the field
  2. Data, features, and the ML pipeline
  3. Linear regression
  4. Bias, variance, and the tradeoff
  5. Regularization: Ridge, Lasso, and ElasticNet
  6. Logistic regression and classification
  7. Evaluation metrics for classification
  8. Naive Bayes classifier
  9. K-Nearest Neighbors
  10. Decision trees
  11. Ensemble methods: Bagging and Random Forests
  12. Boosting: AdaBoost and Gradient Boosting
  13. Support Vector Machines
  14. K-Means clustering
  15. Dimensionality Reduction: PCA
  16. Gaussian mixture models and EM algorithm
  17. Model selection and cross-validation
  18. Feature engineering and selection

You have trained three models on the same data. Which one should you deploy? You cannot just pick the one with the lowest training error, because that model might be overfitting. You need a way to estimate how each model will perform on data it has never seen. Cross-validation gives you that estimate.

Prerequisites: You should understand the bias-variance tradeoff.

The problem with a single train/test split

The simplest approach is to split your data into a training set and a test set (say, 80/20). Train on the 80%, evaluate on the 20%. But this has problems:

  • Variance: Your estimate depends heavily on which points ended up in the test set. A different random split might give a very different score.
  • Wasted data: You are not using 20% of your data for training. With small datasets, this hurts.
  • No tuning data: If you use the test set to choose hyperparameters, you are leaking information and your final estimate is optimistic.

K-fold cross-validation

K-fold CV solves these problems by rotating which data serves as the test set.

  1. Split the data into KK roughly equal folds (partitions)
  2. For each fold k=1,,Kk = 1, \ldots, K:
    • Train on all data except fold kk
    • Evaluate on fold kk to get error eke_k
  3. Average the errors: CV=1Kk=1Kek\text{CV} = \frac{1}{K}\sum_{k=1}^K e_k

Every point is used for testing exactly once and for training K1K-1 times. Common choices are K=5K = 5 or K=10K = 10.

graph TB
  subgraph "K=5 fold CV"
      F1["Fold 1: TEST | Train | Train | Train | Train"]
      F2["Fold 2: Train | TEST | Train | Train | Train"]
      F3["Fold 3: Train | Train | TEST | Train | Train"]
      F4["Fold 4: Train | Train | Train | TEST | Train"]
      F5["Fold 5: Train | Train | Train | Train | TEST"]
  end

Leave-one-out CV

When K=nK = n (the number of data points), you get leave-one-out cross-validation (LOOCV). Each fold has exactly one test point. This uses the maximum amount of training data and has low bias, but:

  • It is expensive: you train nn models
  • The estimates are highly correlated (each training set shares n2n-2 out of n1n-1 points), which can increase variance

For most problems, 5-fold or 10-fold CV gives a good balance between bias and variance of the estimate.

Example 1: 3-fold CV on a small regression dataset

Consider 6 data points with a simple linear regression model y^=wx+b\hat{y} = wx + b:

iixxyy
112.1
223.8
336.2
447.9
5510.1
6612.0

We use K=3K = 3 folds, each with 2 points.

Fold assignment: Fold 1 = {1,2}\{1, 2\}, Fold 2 = {3,4}\{3, 4\}, Fold 3 = {5,6}\{5, 6\}

Round 1: test on Fold 1, train on Folds 2+3

Training data: (3,6.2),(4,7.9),(5,10.1),(6,12.0)(3, 6.2), (4, 7.9), (5, 10.1), (6, 12.0)

Fit a line using the training points. The normal equations give us:

xˉ=3+4+5+64=4.5,yˉ=6.2+7.9+10.1+12.04=9.05\bar{x} = \frac{3+4+5+6}{4} = 4.5, \quad \bar{y} = \frac{6.2+7.9+10.1+12.0}{4} = 9.05

w=(xixˉ)(yiyˉ)(xixˉ)2w = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sum (x_i - \bar{x})^2}

(xixˉ)(yiyˉ)=(1.5)(2.85)+(0.5)(1.15)+(0.5)(1.05)+(1.5)(2.95)\sum (x_i - \bar{x})(y_i - \bar{y}) = (-1.5)(-2.85) + (-0.5)(-1.15) + (0.5)(1.05) + (1.5)(2.95) =4.275+0.575+0.525+4.425=9.80= 4.275 + 0.575 + 0.525 + 4.425 = 9.80

(xixˉ)2=2.25+0.25+0.25+2.25=5.0\sum (x_i - \bar{x})^2 = 2.25 + 0.25 + 0.25 + 2.25 = 5.0

w=9.805.0=1.96,b=9.051.96×4.5=0.23w = \frac{9.80}{5.0} = 1.96, \quad b = 9.05 - 1.96 \times 4.5 = 0.23

Model: y^=1.96x+0.23\hat{y} = 1.96x + 0.23

Test on Fold 1:

y^1=1.96(1)+0.23=2.19,e1=(2.12.19)2=0.0081\hat{y}_1 = 1.96(1) + 0.23 = 2.19, \quad e_1 = (2.1 - 2.19)^2 = 0.0081 y^2=1.96(2)+0.23=4.15,e2=(3.84.15)2=0.1225\hat{y}_2 = 1.96(2) + 0.23 = 4.15, \quad e_2 = (3.8 - 4.15)^2 = 0.1225

MSE1=0.0081+0.12252=0.0653\text{MSE}_1 = \frac{0.0081 + 0.1225}{2} = 0.0653

Round 2: test on Fold 2, train on Folds 1+3

Training data: (1,2.1),(2,3.8),(5,10.1),(6,12.0)(1, 2.1), (2, 3.8), (5, 10.1), (6, 12.0)

xˉ=3.5,yˉ=7.0\bar{x} = 3.5, \quad \bar{y} = 7.0

(xixˉ)(yiyˉ)=(2.5)(4.9)+(1.5)(3.2)+(1.5)(3.1)+(2.5)(5.0)\sum (x_i - \bar{x})(y_i - \bar{y}) = (-2.5)(-4.9) + (-1.5)(-3.2) + (1.5)(3.1) + (2.5)(5.0) =12.25+4.80+4.65+12.50=34.20= 12.25 + 4.80 + 4.65 + 12.50 = 34.20

(xixˉ)2=6.25+2.25+2.25+6.25=17.0\sum (x_i - \bar{x})^2 = 6.25 + 2.25 + 2.25 + 6.25 = 17.0

w=34.2017.0=2.012,b=7.02.012×3.5=0.042w = \frac{34.20}{17.0} = 2.012, \quad b = 7.0 - 2.012 \times 3.5 = -0.042

Test on Fold 2:

y^3=2.012(3)0.042=5.994,e3=(6.25.994)2=0.0424\hat{y}_3 = 2.012(3) - 0.042 = 5.994, \quad e_3 = (6.2 - 5.994)^2 = 0.0424 y^4=2.012(4)0.042=8.006,e4=(7.98.006)2=0.0112\hat{y}_4 = 2.012(4) - 0.042 = 8.006, \quad e_4 = (7.9 - 8.006)^2 = 0.0112

MSE2=0.0424+0.01122=0.0268\text{MSE}_2 = \frac{0.0424 + 0.0112}{2} = 0.0268

Round 3: test on Fold 3, train on Folds 1+2

Training data: (1,2.1),(2,3.8),(3,6.2),(4,7.9)(1, 2.1), (2, 3.8), (3, 6.2), (4, 7.9)

xˉ=2.5,yˉ=5.0\bar{x} = 2.5, \quad \bar{y} = 5.0

(xixˉ)(yiyˉ)=(1.5)(2.9)+(0.5)(1.2)+(0.5)(1.2)+(1.5)(2.9)\sum (x_i - \bar{x})(y_i - \bar{y}) = (-1.5)(-2.9) + (-0.5)(-1.2) + (0.5)(1.2) + (1.5)(2.9) =4.35+0.60+0.60+4.35=9.90= 4.35 + 0.60 + 0.60 + 4.35 = 9.90

(xixˉ)2=2.25+0.25+0.25+2.25=5.0\sum (x_i - \bar{x})^2 = 2.25 + 0.25 + 0.25 + 2.25 = 5.0

w=9.905.0=1.98,b=5.01.98×2.5=0.05w = \frac{9.90}{5.0} = 1.98, \quad b = 5.0 - 1.98 \times 2.5 = 0.05

Test on Fold 3:

y^5=1.98(5)+0.05=9.95,e5=(10.19.95)2=0.0225\hat{y}_5 = 1.98(5) + 0.05 = 9.95, \quad e_5 = (10.1 - 9.95)^2 = 0.0225 y^6=1.98(6)+0.05=11.93,e6=(12.011.93)2=0.0049\hat{y}_6 = 1.98(6) + 0.05 = 11.93, \quad e_6 = (12.0 - 11.93)^2 = 0.0049

MSE3=0.0225+0.00492=0.0137\text{MSE}_3 = \frac{0.0225 + 0.0049}{2} = 0.0137

Final CV estimate

CV-MSE=0.0653+0.0268+0.01373=0.0353\text{CV-MSE} = \frac{0.0653 + 0.0268 + 0.0137}{3} = 0.0353

The average MSE across folds is approximately 0.0350.035. This is our estimate of the model’s generalization error.

Example 2: variance of the CV estimate

The three fold errors were {0.0653,0.0268,0.0137}\{0.0653, 0.0268, 0.0137\}. How stable is our CV estimate?

Mean: eˉ=0.0353\bar{e} = 0.0353

Standard deviation of fold errors:

s=(0.06530.0353)2+(0.02680.0353)2+(0.01370.0353)231s = \sqrt{\frac{(0.0653 - 0.0353)^2 + (0.0268 - 0.0353)^2 + (0.0137 - 0.0353)^2}{3 - 1}}

=(0.0300)2+(0.0085)2+(0.0216)22= \sqrt{\frac{(0.0300)^2 + (-0.0085)^2 + (-0.0216)^2}{2}}

=0.000900+0.000072+0.0004672=0.0014392=0.0007200.0268= \sqrt{\frac{0.000900 + 0.000072 + 0.000467}{2}} = \sqrt{\frac{0.001439}{2}} = \sqrt{0.000720} \approx 0.0268

Standard error of the mean:

SE=sK=0.026830.0155\text{SE} = \frac{s}{\sqrt{K}} = \frac{0.0268}{\sqrt{3}} \approx 0.0155

So our CV estimate is 0.035±0.0160.035 \pm 0.016 (one standard error). The variance is substantial because we only have 3 folds. With K=10K = 10 folds, the standard error would typically be smaller.

This is why reporting the standard error alongside the CV score matters. A model with CV error 0.035±0.0160.035 \pm 0.016 and a model with 0.042±0.0150.042 \pm 0.015 might not be meaningfully different.

Hyperparameter tuning

Most models have hyperparameters that you set before training: the regularization strength λ\lambda, the number of neighbors kk in KNN, or the kernel width in an SVM. Cross-validation tells you which hyperparameter value gives the best generalization.

Grid search: Define a grid of hyperparameter values and run CV for each one. Pick the value with the lowest CV error.

import numpy as np

# Example: tuning regularization strength
lambdas = [0.001, 0.01, 0.1, 1.0, 10.0]
cv_scores = []

for lam in lambdas:
    fold_errors = []
    for fold in range(K):
        X_train, X_val = split(X, fold)
        y_train, y_val = split(y, fold)
        model = RidgeRegression(alpha=lam)
        model.fit(X_train, y_train)
        error = mse(model.predict(X_val), y_val)
        fold_errors.append(error)
    cv_scores.append(np.mean(fold_errors))

best_lambda = lambdas[np.argmin(cv_scores)]

Random search: Instead of exhaustively searching a grid, sample hyperparameters randomly from a distribution. Bergstra and Bengio (2012) showed that random search is more efficient than grid search when only a few hyperparameters actually matter, because it explores more distinct values of the important ones.

The one-standard-error rule: Instead of picking the hyperparameter with the absolute lowest CV error, pick the simplest model whose CV error is within one standard error of the minimum. This favors simpler models and reduces overfitting.

Validation accuracy peaks near K=6 and drops for both very small K (overfitting) and very large K (underfitting). Error bars show standard deviation across folds.

Nested cross-validation

There is a subtle trap: if you use CV to choose hyperparameters, your CV score is now optimistically biased. You selected the best hyperparameter based on the CV scores, so the reported score is not a fair estimate of generalization performance.

Nested CV fixes this with two loops:

  • Outer loop (e.g., 5-fold): Estimates generalization error
  • Inner loop (e.g., 5-fold on the outer training set): Selects hyperparameters
graph TD
  A[Full dataset] --> B[Outer fold 1: 80% train, 20% test]
  A --> C[Outer fold 2: 80% train, 20% test]
  A --> D[...]
  B --> E[Inner CV on 80%: tune hyperparameters]
  E --> F[Train best model on 80%]
  F --> G[Evaluate on 20% test]

For each outer fold:

  1. Split training data again for inner CV
  2. Run grid search on the inner folds to pick the best hyperparameter
  3. Retrain on the full outer training set with that hyperparameter
  4. Evaluate on the outer test fold

The average of the outer fold scores is an unbiased estimate of generalization error for the entire model selection process.

Nested CV is expensive (e.g., 5 outer ×\times 5 inner = 25 model fits per hyperparameter setting), but it is the correct way to report performance when hyperparameter tuning is part of your pipeline.

Stratified CV

For classification problems, random splits can create folds where one class is underrepresented. Stratified K-fold ensures each fold has roughly the same proportion of each class as the full dataset.

This is especially important when classes are imbalanced. If 5% of your data is positive and a fold happens to have 0% positives, the error estimate for that fold will be meaningless.

Practical advice

  • Use 5 or 10 folds for most problems. 5-fold is faster; 10-fold has slightly less bias.
  • Shuffle your data before splitting into folds, unless the data has a natural time ordering (then use time-series CV).
  • Repeat CV with different random splits and average the results for a more stable estimate.
  • Use nested CV when reporting final results in a paper or comparison.
  • Never touch the test set until you have made all model choices. The test set is for a final, unbiased evaluation only.
  • Stratify for classification.

Summary

ConceptKey idea
K-fold CVRotate test set across K partitions; average errors
LOOCVK=nK = n; low bias, high variance, expensive
Standard errorReport SE=s/K\text{SE} = s / \sqrt{K} alongside CV score
Hyperparameter tuningGrid search or random search over CV scores
Nested CVOuter loop estimates error; inner loop tunes hyperparameters
One-SE rulePick simplest model within one SE of the minimum CV error

What comes next

Cross-validation tells you how well a model performs, but the quality of your features determines the ceiling. No amount of model tuning can fix bad features. In the next post on feature engineering and selection, you will learn how to create, transform, and select the features that matter most.

Start typing to search across all content
navigate Enter open Esc close