Apr 8, 2026 · 14 min read · Machine Learning

Model selection and cross-validation

In this series (18 parts)

You have trained three models on the same data. Which one should you deploy? You cannot just pick the one with the lowest training error, because that model might be overfitting. You need a way to estimate how each model will perform on data it has never seen. Cross-validation gives you that estimate.

Prerequisites: You should understand the bias-variance tradeoff.

The problem with a single train/test split

The simplest approach is to split your data into a training set and a test set (say, 80/20). Train on the 80%, evaluate on the 20%. But this has problems:

Variance: Your estimate depends heavily on which points ended up in the test set. A different random split might give a very different score.
Wasted data: You are not using 20% of your data for training. With small datasets, this hurts.
No tuning data: If you use the test set to choose hyperparameters, you are leaking information and your final estimate is optimistic.

K-fold cross-validation

K-fold CV solves these problems by rotating which data serves as the test set.

Split the data into $K$ roughly equal folds (partitions)
For each fold $k = 1, \ldots, K$ $k = 1, \dots, K$ :
- Train on all data except fold $k$
- Evaluate on fold $k$ to get error $e_k$
Average the errors: $\text{CV} = \frac{1}{K}\sum_{k=1}^K e_k$

Every point is used for testing exactly once and for training $K-1$ times. Common choices are $K = 5$ or $K = 10$ .

graph TB
  subgraph "K=5 fold CV"
      F1["Fold 1: TEST | Train | Train | Train | Train"]
      F2["Fold 2: Train | TEST | Train | Train | Train"]
      F3["Fold 3: Train | Train | TEST | Train | Train"]
      F4["Fold 4: Train | Train | Train | TEST | Train"]
      F5["Fold 5: Train | Train | Train | Train | TEST"]
  end

Leave-one-out CV

When $K = n$ (the number of data points), you get leave-one-out cross-validation (LOOCV). Each fold has exactly one test point. This uses the maximum amount of training data and has low bias, but:

It is expensive: you train $n$ models
The estimates are highly correlated (each training set shares $n-2$ out of $n-1$ points), which can increase variance

For most problems, 5-fold or 10-fold CV gives a good balance between bias and variance of the estimate.

Example 1: 3-fold CV on a small regression dataset

Consider 6 data points with a simple linear regression model $\hat{y} = wx + b$ :

$i$	$x$	$y$
1	1	2.1
2	2	3.8
3	3	6.2
4	4	7.9
5	5	10.1
6	6	12.0

We use $K = 3$ folds, each with 2 points.

Fold assignment: Fold 1 = $\{1, 2\}$ , Fold 2 = $\{3, 4\}$ , Fold 3 = $\{5, 6\}$

Round 1: test on Fold 1, train on Folds 2+3

Training data: $(3, 6.2), (4, 7.9), (5, 10.1), (6, 12.0)$

Fit a line using the training points. The normal equations give us:

$\bar{x} = \frac{3+4+5+6}{4} = 4.5, \quad \bar{y} = \frac{6.2+7.9+10.1+12.0}{4} = 9.05$

$w = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sum (x_i - \bar{x})^2}$

$\sum (x_i - \bar{x})(y_i - \bar{y}) = (-1.5)(-2.85) + (-0.5)(-1.15) + (0.5)(1.05) + (1.5)(2.95)$ $= 4.275 + 0.575 + 0.525 + 4.425 = 9.80$

$\sum (x_i - \bar{x})^2 = 2.25 + 0.25 + 0.25 + 2.25 = 5.0$

$w = \frac{9.80}{5.0} = 1.96, \quad b = 9.05 - 1.96 \times 4.5 = 0.23$

Model: $\hat{y} = 1.96x + 0.23$

Test on Fold 1:

$\hat{y}_1 = 1.96(1) + 0.23 = 2.19, \quad e_1 = (2.1 - 2.19)^2 = 0.0081$ $\hat{y}_2 = 1.96(2) + 0.23 = 4.15, \quad e_2 = (3.8 - 4.15)^2 = 0.1225$

$\text{MSE}_1 = \frac{0.0081 + 0.1225}{2} = 0.0653$

Round 2: test on Fold 2, train on Folds 1+3

Training data: $(1, 2.1), (2, 3.8), (5, 10.1), (6, 12.0)$

$\bar{x} = 3.5, \quad \bar{y} = 7.0$

$\sum (x_i - \bar{x})(y_i - \bar{y}) = (-2.5)(-4.9) + (-1.5)(-3.2) + (1.5)(3.1) + (2.5)(5.0)$ $= 12.25 + 4.80 + 4.65 + 12.50 = 34.20$

$\sum (x_i - \bar{x})^2 = 6.25 + 2.25 + 2.25 + 6.25 = 17.0$

$w = \frac{34.20}{17.0} = 2.012, \quad b = 7.0 - 2.012 \times 3.5 = -0.042$

Test on Fold 2:

$\hat{y}_3 = 2.012(3) - 0.042 = 5.994, \quad e_3 = (6.2 - 5.994)^2 = 0.0424$ $\hat{y}_4 = 2.012(4) - 0.042 = 8.006, \quad e_4 = (7.9 - 8.006)^2 = 0.0112$

$\text{MSE}_2 = \frac{0.0424 + 0.0112}{2} = 0.0268$

Round 3: test on Fold 3, train on Folds 1+2

Training data: $(1, 2.1), (2, 3.8), (3, 6.2), (4, 7.9)$

$\bar{x} = 2.5, \quad \bar{y} = 5.0$

$\sum (x_i - \bar{x})(y_i - \bar{y}) = (-1.5)(-2.9) + (-0.5)(-1.2) + (0.5)(1.2) + (1.5)(2.9)$ $= 4.35 + 0.60 + 0.60 + 4.35 = 9.90$

$\sum (x_i - \bar{x})^2 = 2.25 + 0.25 + 0.25 + 2.25 = 5.0$

$w = \frac{9.90}{5.0} = 1.98, \quad b = 5.0 - 1.98 \times 2.5 = 0.05$

Test on Fold 3:

$\hat{y}_5 = 1.98(5) + 0.05 = 9.95, \quad e_5 = (10.1 - 9.95)^2 = 0.0225$ $\hat{y}_6 = 1.98(6) + 0.05 = 11.93, \quad e_6 = (12.0 - 11.93)^2 = 0.0049$

$\text{MSE}_3 = \frac{0.0225 + 0.0049}{2} = 0.0137$

Final CV estimate

$\text{CV-MSE} = \frac{0.0653 + 0.0268 + 0.0137}{3} = 0.0353$

The average MSE across folds is approximately $0.035$ . This is our estimate of the model’s generalization error.

Example 2: variance of the CV estimate

The three fold errors were $\{0.0653, 0.0268, 0.0137\}$ . How stable is our CV estimate?

Mean: $\bar{e} = 0.0353$

Standard deviation of fold errors:

$s = \sqrt{\frac{(0.0653 - 0.0353)^2 + (0.0268 - 0.0353)^2 + (0.0137 - 0.0353)^2}{3 - 1}}$

$= \sqrt{\frac{(0.0300)^2 + (-0.0085)^2 + (-0.0216)^2}{2}}$

$= \sqrt{\frac{0.000900 + 0.000072 + 0.000467}{2}} = \sqrt{\frac{0.001439}{2}} = \sqrt{0.000720} \approx 0.0268$

Standard error of the mean:

$\text{SE} = \frac{s}{\sqrt{K}} = \frac{0.0268}{\sqrt{3}} \approx 0.0155$

So our CV estimate is $0.035 \pm 0.016$ (one standard error). The variance is substantial because we only have 3 folds. With $K = 10$ folds, the standard error would typically be smaller.

This is why reporting the standard error alongside the CV score matters. A model with CV error $0.035 \pm 0.016$ and a model with $0.042 \pm 0.015$ might not be meaningfully different.

Hyperparameter tuning

Most models have hyperparameters that you set before training: the regularization strength $\lambda$ , the number of neighbors $k$ in KNN, or the kernel width in an SVM. Cross-validation tells you which hyperparameter value gives the best generalization.

Grid search: Define a grid of hyperparameter values and run CV for each one. Pick the value with the lowest CV error.

import numpy as np

# Example: tuning regularization strength
lambdas = [0.001, 0.01, 0.1, 1.0, 10.0]
cv_scores = []

for lam in lambdas:
    fold_errors = []
    for fold in range(K):
        X_train, X_val = split(X, fold)
        y_train, y_val = split(y, fold)
        model = RidgeRegression(alpha=lam)
        model.fit(X_train, y_train)
        error = mse(model.predict(X_val), y_val)
        fold_errors.append(error)
    cv_scores.append(np.mean(fold_errors))

best_lambda = lambdas[np.argmin(cv_scores)]

Random search: Instead of exhaustively searching a grid, sample hyperparameters randomly from a distribution. Bergstra and Bengio (2012) showed that random search is more efficient than grid search when only a few hyperparameters actually matter, because it explores more distinct values of the important ones.

The one-standard-error rule: Instead of picking the hyperparameter with the absolute lowest CV error, pick the simplest model whose CV error is within one standard error of the minimum. This favors simpler models and reduces overfitting.

Validation accuracy peaks near K=6 and drops for both very small K (overfitting) and very large K (underfitting). Error bars show standard deviation across folds.

Nested cross-validation

There is a subtle trap: if you use CV to choose hyperparameters, your CV score is now optimistically biased. You selected the best hyperparameter based on the CV scores, so the reported score is not a fair estimate of generalization performance.

Nested CV fixes this with two loops:

Outer loop (e.g., 5-fold): Estimates generalization error
Inner loop (e.g., 5-fold on the outer training set): Selects hyperparameters

graph TD
  A[Full dataset] --> B[Outer fold 1: 80% train, 20% test]
  A --> C[Outer fold 2: 80% train, 20% test]
  A --> D[...]
  B --> E[Inner CV on 80%: tune hyperparameters]
  E --> F[Train best model on 80%]
  F --> G[Evaluate on 20% test]

For each outer fold:

Split training data again for inner CV
Run grid search on the inner folds to pick the best hyperparameter
Retrain on the full outer training set with that hyperparameter
Evaluate on the outer test fold

The average of the outer fold scores is an unbiased estimate of generalization error for the entire model selection process.

Nested CV is expensive (e.g., 5 outer $\times$ 5 inner = 25 model fits per hyperparameter setting), but it is the correct way to report performance when hyperparameter tuning is part of your pipeline.

Stratified CV

For classification problems, random splits can create folds where one class is underrepresented. Stratified K-fold ensures each fold has roughly the same proportion of each class as the full dataset.

This is especially important when classes are imbalanced. If 5% of your data is positive and a fold happens to have 0% positives, the error estimate for that fold will be meaningless.

Practical advice

Use 5 or 10 folds for most problems. 5-fold is faster; 10-fold has slightly less bias.
Shuffle your data before splitting into folds, unless the data has a natural time ordering (then use time-series CV).
Repeat CV with different random splits and average the results for a more stable estimate.
Use nested CV when reporting final results in a paper or comparison.
Never touch the test set until you have made all model choices. The test set is for a final, unbiased evaluation only.
Stratify for classification.

Summary

Concept	Key idea
K-fold CV	Rotate test set across K partitions; average errors
LOOCV	$K = n$ ; low bias, high variance, expensive
Standard error	Report $\text{SE} = s / \sqrt{K}$ alongside CV score
Hyperparameter tuning	Grid search or random search over CV scores
Nested CV	Outer loop estimates error; inner loop tunes hyperparameters
One-SE rule	Pick simplest model within one SE of the minimum CV error

What comes next

Cross-validation tells you how well a model performs, but the quality of your features determines the ceiling. No amount of model tuning can fix bad features. In the next post on feature engineering and selection, you will learn how to create, transform, and select the features that matter most.

← Back to all series