Model selection and cross-validation
In this series (18 parts)
- What is machine learning: a map of the field
- Data, features, and the ML pipeline
- Linear regression
- Bias, variance, and the tradeoff
- Regularization: Ridge, Lasso, and ElasticNet
- Logistic regression and classification
- Evaluation metrics for classification
- Naive Bayes classifier
- K-Nearest Neighbors
- Decision trees
- Ensemble methods: Bagging and Random Forests
- Boosting: AdaBoost and Gradient Boosting
- Support Vector Machines
- K-Means clustering
- Dimensionality Reduction: PCA
- Gaussian mixture models and EM algorithm
- Model selection and cross-validation
- Feature engineering and selection
You have trained three models on the same data. Which one should you deploy? You cannot just pick the one with the lowest training error, because that model might be overfitting. You need a way to estimate how each model will perform on data it has never seen. Cross-validation gives you that estimate.
Prerequisites: You should understand the bias-variance tradeoff.
The problem with a single train/test split
The simplest approach is to split your data into a training set and a test set (say, 80/20). Train on the 80%, evaluate on the 20%. But this has problems:
- Variance: Your estimate depends heavily on which points ended up in the test set. A different random split might give a very different score.
- Wasted data: You are not using 20% of your data for training. With small datasets, this hurts.
- No tuning data: If you use the test set to choose hyperparameters, you are leaking information and your final estimate is optimistic.
K-fold cross-validation
K-fold CV solves these problems by rotating which data serves as the test set.
- Split the data into roughly equal folds (partitions)
- For each fold :
- Train on all data except fold
- Evaluate on fold to get error
- Average the errors:
Every point is used for testing exactly once and for training times. Common choices are or .
graph TB
subgraph "K=5 fold CV"
F1["Fold 1: TEST | Train | Train | Train | Train"]
F2["Fold 2: Train | TEST | Train | Train | Train"]
F3["Fold 3: Train | Train | TEST | Train | Train"]
F4["Fold 4: Train | Train | Train | TEST | Train"]
F5["Fold 5: Train | Train | Train | Train | TEST"]
end
Leave-one-out CV
When (the number of data points), you get leave-one-out cross-validation (LOOCV). Each fold has exactly one test point. This uses the maximum amount of training data and has low bias, but:
- It is expensive: you train models
- The estimates are highly correlated (each training set shares out of points), which can increase variance
For most problems, 5-fold or 10-fold CV gives a good balance between bias and variance of the estimate.
Example 1: 3-fold CV on a small regression dataset
Consider 6 data points with a simple linear regression model :
| 1 | 1 | 2.1 |
| 2 | 2 | 3.8 |
| 3 | 3 | 6.2 |
| 4 | 4 | 7.9 |
| 5 | 5 | 10.1 |
| 6 | 6 | 12.0 |
We use folds, each with 2 points.
Fold assignment: Fold 1 = , Fold 2 = , Fold 3 =
Round 1: test on Fold 1, train on Folds 2+3
Training data:
Fit a line using the training points. The normal equations give us:
Model:
Test on Fold 1:
Round 2: test on Fold 2, train on Folds 1+3
Training data:
Test on Fold 2:
Round 3: test on Fold 3, train on Folds 1+2
Training data:
Test on Fold 3:
Final CV estimate
The average MSE across folds is approximately . This is our estimate of the model’s generalization error.
Example 2: variance of the CV estimate
The three fold errors were . How stable is our CV estimate?
Mean:
Standard deviation of fold errors:
Standard error of the mean:
So our CV estimate is (one standard error). The variance is substantial because we only have 3 folds. With folds, the standard error would typically be smaller.
This is why reporting the standard error alongside the CV score matters. A model with CV error and a model with might not be meaningfully different.
Hyperparameter tuning
Most models have hyperparameters that you set before training: the regularization strength , the number of neighbors in KNN, or the kernel width in an SVM. Cross-validation tells you which hyperparameter value gives the best generalization.
Grid search: Define a grid of hyperparameter values and run CV for each one. Pick the value with the lowest CV error.
import numpy as np
# Example: tuning regularization strength
lambdas = [0.001, 0.01, 0.1, 1.0, 10.0]
cv_scores = []
for lam in lambdas:
fold_errors = []
for fold in range(K):
X_train, X_val = split(X, fold)
y_train, y_val = split(y, fold)
model = RidgeRegression(alpha=lam)
model.fit(X_train, y_train)
error = mse(model.predict(X_val), y_val)
fold_errors.append(error)
cv_scores.append(np.mean(fold_errors))
best_lambda = lambdas[np.argmin(cv_scores)]
Random search: Instead of exhaustively searching a grid, sample hyperparameters randomly from a distribution. Bergstra and Bengio (2012) showed that random search is more efficient than grid search when only a few hyperparameters actually matter, because it explores more distinct values of the important ones.
The one-standard-error rule: Instead of picking the hyperparameter with the absolute lowest CV error, pick the simplest model whose CV error is within one standard error of the minimum. This favors simpler models and reduces overfitting.
Validation accuracy peaks near K=6 and drops for both very small K (overfitting) and very large K (underfitting). Error bars show standard deviation across folds.
Nested cross-validation
There is a subtle trap: if you use CV to choose hyperparameters, your CV score is now optimistically biased. You selected the best hyperparameter based on the CV scores, so the reported score is not a fair estimate of generalization performance.
Nested CV fixes this with two loops:
- Outer loop (e.g., 5-fold): Estimates generalization error
- Inner loop (e.g., 5-fold on the outer training set): Selects hyperparameters
graph TD A[Full dataset] --> B[Outer fold 1: 80% train, 20% test] A --> C[Outer fold 2: 80% train, 20% test] A --> D[...] B --> E[Inner CV on 80%: tune hyperparameters] E --> F[Train best model on 80%] F --> G[Evaluate on 20% test]
For each outer fold:
- Split training data again for inner CV
- Run grid search on the inner folds to pick the best hyperparameter
- Retrain on the full outer training set with that hyperparameter
- Evaluate on the outer test fold
The average of the outer fold scores is an unbiased estimate of generalization error for the entire model selection process.
Nested CV is expensive (e.g., 5 outer 5 inner = 25 model fits per hyperparameter setting), but it is the correct way to report performance when hyperparameter tuning is part of your pipeline.
Stratified CV
For classification problems, random splits can create folds where one class is underrepresented. Stratified K-fold ensures each fold has roughly the same proportion of each class as the full dataset.
This is especially important when classes are imbalanced. If 5% of your data is positive and a fold happens to have 0% positives, the error estimate for that fold will be meaningless.
Practical advice
- Use 5 or 10 folds for most problems. 5-fold is faster; 10-fold has slightly less bias.
- Shuffle your data before splitting into folds, unless the data has a natural time ordering (then use time-series CV).
- Repeat CV with different random splits and average the results for a more stable estimate.
- Use nested CV when reporting final results in a paper or comparison.
- Never touch the test set until you have made all model choices. The test set is for a final, unbiased evaluation only.
- Stratify for classification.
Summary
| Concept | Key idea |
|---|---|
| K-fold CV | Rotate test set across K partitions; average errors |
| LOOCV | ; low bias, high variance, expensive |
| Standard error | Report alongside CV score |
| Hyperparameter tuning | Grid search or random search over CV scores |
| Nested CV | Outer loop estimates error; inner loop tunes hyperparameters |
| One-SE rule | Pick simplest model within one SE of the minimum CV error |
What comes next
Cross-validation tells you how well a model performs, but the quality of your features determines the ceiling. No amount of model tuning can fix bad features. In the next post on feature engineering and selection, you will learn how to create, transform, and select the features that matter most.