Search…

Feature engineering and selection

In this series (18 parts)
  1. What is machine learning: a map of the field
  2. Data, features, and the ML pipeline
  3. Linear regression
  4. Bias, variance, and the tradeoff
  5. Regularization: Ridge, Lasso, and ElasticNet
  6. Logistic regression and classification
  7. Evaluation metrics for classification
  8. Naive Bayes classifier
  9. K-Nearest Neighbors
  10. Decision trees
  11. Ensemble methods: Bagging and Random Forests
  12. Boosting: AdaBoost and Gradient Boosting
  13. Support Vector Machines
  14. K-Means clustering
  15. Dimensionality Reduction: PCA
  16. Gaussian mixture models and EM algorithm
  17. Model selection and cross-validation
  18. Feature engineering and selection

The best algorithm with bad features will lose to a simple algorithm with good features. Feature engineering is the process of transforming raw data into inputs that make machine learning models work better. Feature selection is figuring out which of those inputs actually matter.

Prerequisites: You should be comfortable with the data pipeline and regularization.

Why features matter

A linear model can only learn linear relationships. If the true relationship is y=x2y = x^2, no amount of training will make y=wx+by = wx + b fit well. But if you create a new feature x2x^2 and feed the model [x,x2][x, x^2], it can learn y=w1x+w2x2+by = w_1 x + w_2 x^2 + b perfectly.

Feature engineering is how you encode your domain knowledge into the model. It bridges the gap between what the data looks like and what the model can learn.

Polynomial features

The simplest feature expansion takes a feature xx and creates powers of it: x,x2,x3,,xdx, x^2, x^3, \ldots, x^d. For multiple features, you also include all cross-terms.

For two features x1,x2x_1, x_2 with degree 2, the expanded feature set is:

[1,x1,x2,x12,x1x2,x22][1, x_1, x_2, x_1^2, x_1 x_2, x_2^2]

The number of features grows fast. With dd original features and polynomial degree pp, the number of expanded features is (d+pp)\binom{d + p}{p}. For d=10d = 10 and p=3p = 3, that is 286 features.

Example 1: polynomial features by hand

Suppose you are predicting house price from two features: size (in hundreds of sqft) and age (in decades).

HouseSize (x1x_1)Age (x2x_2)Price (yy, thousands)
A13150
B21280
C32400
D41500

Step 1: Create degree-2 polynomial features.

We add x12x_1^2, x1x2x_1 x_2 (interaction term), and x22x_2^2:

Housex1x_1x2x_2x12x_1^2x1x2x_1 x_2x22x_2^2
A13139
B21421
C32964
D411641

Step 2: Fit a linear model with original features only.

Using the normal equations, fit y^=w1x1+w2x2+b\hat{y} = w_1 x_1 + w_2 x_2 + b.

Solving w=(XTX)1XTyw = (X^T X)^{-1} X^T y gives:

w1115.3,w23.3,b50.0w_1 \approx 115.3, \quad w_2 \approx -3.3, \quad b \approx 50.0

Model: y^=115.3x13.3x2+50.0\hat{y} = 115.3 x_1 - 3.3 x_2 + 50.0

Houseyyy^\hat{y}Error
A150115.3(1)3.3(3)+50.0=155.3115.3(1) - 3.3(3) + 50.0 = 155.35.3-5.3
B280115.3(2)3.3(1)+50.0=277.3115.3(2) - 3.3(1) + 50.0 = 277.3+2.7+2.7
C400115.3(3)3.3(2)+50.0=389.3115.3(3) - 3.3(2) + 50.0 = 389.3+10.7+10.7
D500115.3(4)3.3(1)+50.0=508.0115.3(4) - 3.3(1) + 50.0 = 508.08.0-8.0

MSElinear=5.32+2.72+10.72+8.024=28.1+7.3+114.5+64.0453.3\text{MSE}_{\text{linear}} = \frac{5.3^2 + 2.7^2 + 10.7^2 + 8.0^2}{4} = \frac{28.1 + 7.3 + 114.5 + 64.0}{4} \approx 53.3

Step 3: Fit with polynomial features.

Now use all 5 features: x1,x2,x12,x1x2,x22x_1, x_2, x_1^2, x_1 x_2, x_2^2. With 4 data points and 6 parameters (including bias), we have an underdetermined system, so the model can fit the data exactly.

MSEpoly=0\text{MSE}_{\text{poly}} = 0

This is a perfect fit, but with 6 parameters for 4 data points, we are almost certainly overfitting. This is where regularization becomes critical. A ridge regression penalty on the polynomial model would give a non-zero but much smaller MSE than the linear model.

Takeaway: Polynomial features let a linear model capture nonlinear relationships, but they increase the risk of overfitting. Always pair them with regularization or cross-validation.

Interaction terms

An interaction term x1x2x_1 x_2 captures the idea that the effect of x1x_1 on yy depends on the value of x2x_2. In the house price example, the value of extra square footage might depend on the age of the house: new square footage might be worth more in newer houses.

You do not always need full polynomial expansion. Sometimes adding just the interaction terms (without squared terms) is enough.

Encoding categorical variables

Machine learning models need numbers, not categories. If you have a feature “color” with values {red, blue, green}, you need to encode it.

One-hot encoding

Create a binary column for each category:

Coloris_redis_blueis_green
red100
blue010
green001

Drop one column to avoid perfect multicollinearity (the “dummy variable trap”). If is_red=0\text{is\_red} = 0 and is_blue=0\text{is\_blue} = 0, we know the color is green.

Label encoding

Assign an integer to each category: red = 0, blue = 1, green = 2. This is simple but dangerous for most models because it implies an ordering (blue is “between” red and green) and a distance (green is “twice as far” from red as blue). Only use this for tree-based models that split on thresholds.

Target encoding

Replace each category with the mean of the target variable for that category. For example, if the average price of red houses is $200k, encode red as 200. This is powerful but risks data leakage. Always compute target encodings on the training fold only, never on the test data.

Feature scaling

Features on different scales cause problems. If feature x1x_1 ranges from 0 to 1 and x2x_2 ranges from 0 to 1000, models that use distance or gradient descent will be dominated by x2x_2.

Standardization (Z-score):

x=xμσx' = \frac{x - \mu}{\sigma}

Centers at 0, scales to unit variance. Good for models that assume normality or use distance.

Min-max scaling:

x=xxminxmaxxminx' = \frac{x - x_{\min}}{x_{\max} - x_{\min}}

Scales to [0, 1]. Sensitive to outliers.

Always fit the scaler on the training data and apply it to both training and test data. Never compute statistics from the test set.

Example 2: impact of feature engineering on model performance

Let’s see how feature engineering affects a regression model. We have 5 data points:

xxyy
11
25
37
411
520

The relationship looks nonlinear (possibly quadratic or exponential).

Model A: linear regression on raw feature.

xˉ=3,yˉ=8.8\bar{x} = 3, \quad \bar{y} = 8.8

w=(xi3)(yi8.8)(xi3)2=(2)(7.8)+(1)(3.8)+(0)(1.8)+(1)(2.2)+(2)(11.2)4+1+0+1+4w = \frac{\sum (x_i - 3)(y_i - 8.8)}{\sum (x_i - 3)^2} = \frac{(-2)(-7.8) + (-1)(-3.8) + (0)(-1.8) + (1)(2.2) + (2)(11.2)}{4 + 1 + 0 + 1 + 4}

=15.6+3.8+0+2.2+22.410=44.010=4.4= \frac{15.6 + 3.8 + 0 + 2.2 + 22.4}{10} = \frac{44.0}{10} = 4.4

b=8.84.4×3=4.4b = 8.8 - 4.4 \times 3 = -4.4

Model A: y^=4.4x4.4\hat{y} = 4.4x - 4.4

xxyyy^\hat{y}(yy^)2(y - \hat{y})^2
110.01.00
254.40.36
378.83.24
41113.24.84
52017.65.76

MSEA=1.00+0.36+3.24+4.84+5.765=15.205=3.04\text{MSE}_A = \frac{1.00 + 0.36 + 3.24 + 4.84 + 5.76}{5} = \frac{15.20}{5} = 3.04

Model B: add a quadratic feature x2x^2.

Now fit y^=w1x+w2x2+b\hat{y} = w_1 x + w_2 x^2 + b using features [x,x2][x, x^2]:

xxx2x^2yy
111
245
397
41611
52520

Solving the normal equations (three unknowns, five equations) gives:

b1.60,w10.74,w20.86b \approx 1.60, \quad w_1 \approx -0.74, \quad w_2 \approx 0.86

Model B: y^=0.86x20.74x+1.60\hat{y} = 0.86x^2 - 0.74x + 1.60

xxyyy^\hat{y}(yy^)2(y - \hat{y})^2
110.860.74+1.60=1.710.86 - 0.74 + 1.60 = 1.710.50
253.431.49+1.60=3.543.43 - 1.49 + 1.60 = 3.542.13
377.712.23+1.60=7.097.71 - 2.23 + 1.60 = 7.090.01
41113.712.97+1.60=12.3413.71 - 2.97 + 1.60 = 12.341.80
52021.433.71+1.60=19.3121.43 - 3.71 + 1.60 = 19.310.48

MSEB=0.50+2.13+0.01+1.80+0.485=4.925=0.98\text{MSE}_B = \frac{0.50 + 2.13 + 0.01 + 1.80 + 0.48}{5} = \frac{4.92}{5} = 0.98

The quadratic feature reduced MSE from 3.04 to 0.98, a 68% improvement. The model captures the upward curvature that the linear model misses.

import numpy as np

x = np.array([1, 2, 3, 4, 5])
y = np.array([1, 5, 7, 11, 20])

# Model A: linear
X_lin = np.column_stack([np.ones(5), x])
w_lin = np.linalg.lstsq(X_lin, y, rcond=None)[0]
mse_lin = np.mean((y - X_lin @ w_lin)**2)

# Model B: quadratic
X_quad = np.column_stack([np.ones(5), x, x**2])
w_quad = np.linalg.lstsq(X_quad, y, rcond=None)[0]
mse_quad = np.mean((y - X_quad @ w_quad)**2)

print(f"Linear MSE:    {mse_lin:.2f}")    # 3.04
print(f"Quadratic MSE: {mse_quad:.2f}")   # 0.98

Feature importance before and after engineering. Raw Size dominates, but after adding polynomial terms, log-transforms, and proper encoding, the importance spreads more evenly and Location becomes a strong signal.

Feature selection methods

More features is not always better. Irrelevant features add noise, increase computation, and can cause overfitting. Feature selection picks the subset that matters.

Filter methods

Evaluate each feature independently, without training a model.

  • Correlation: Rank features by their absolute correlation with the target. Drop features with low correlation.
  • Mutual information: Measures how much knowing the feature reduces uncertainty about the target. Works for nonlinear relationships too.
  • Variance threshold: Drop features with near-zero variance (they carry no information).

Filter methods are fast but ignore interactions between features. A feature that is useless alone might be powerful when combined with another.

Wrapper methods

Train a model with different feature subsets and pick the best one.

  • Forward selection: Start with no features. Add the one that most improves the model. Repeat until adding features stops helping.
  • Backward elimination: Start with all features. Remove the one whose removal least hurts the model. Repeat.
  • Recursive feature elimination (RFE): Train a model, remove the feature with the smallest weight, and repeat.

Wrapper methods are more accurate than filter methods but expensive, especially with many features.

Embedded methods

Feature selection happens as part of model training.

  • Lasso (L1 regularization): Drives unimportant feature weights to exactly zero. The features with non-zero weights are selected automatically.
  • Tree-based importance: Decision trees and random forests compute feature importance based on how much each feature reduces impurity. No extra step needed.
  • Elastic net: Combines L1 and L2 regularization. Gets the sparsity of Lasso with the stability of Ridge.

Embedded methods are usually the best default choice. Lasso in particular is very popular because it does model fitting and feature selection simultaneously.

Comparison

MethodSpeedHandles interactionsExample
FilterFastNoCorrelation, mutual info
WrapperSlowYesForward selection, RFE
EmbeddedModerateYesLasso, tree importance

Dimensionality reduction as feature engineering

PCA creates new features that are linear combinations of the originals, keeping the ones with the most variance. This is an alternative to feature selection: instead of choosing a subset, you create a new, smaller set that captures most of the information.

PCA is especially useful when you have many correlated features. It transforms them into uncorrelated principal components, which often improves model stability.

Practical tips

  • Start simple. Try raw features first. Add complexity only if performance is insufficient.
  • Use domain knowledge. A doctor knows that BMI (weight/height2^2) matters more than weight and height separately. Encode that.
  • Log-transform skewed features (income, population, prices). This often helps linear models.
  • Create ratios when they make physical sense: price per square foot, clicks per impression.
  • Be careful with high-cardinality categoricals. A feature with 10,000 categories (like zip code) will create 10,000 binary columns with one-hot encoding. Use target encoding or embeddings instead.
  • Feature engineering is iterative. Create features, evaluate with cross-validation, refine.
  • Avoid data leakage. All feature engineering (scaling, encoding, polynomial expansion) must be fit on the training set only.

Summary

ConceptKey idea
Polynomial featuresAdd x2,x3,x1x2x^2, x^3, x_1 x_2 to capture nonlinear patterns
Interaction termsCapture “effect of A depends on B” relationships
One-hot encodingBinary columns for categorical variables
Feature scalingStandardize or normalize before distance-based models
Filter selectionScore features independently (fast, ignores interactions)
Wrapper selectionEvaluate feature subsets by training models (accurate, slow)
Embedded selectionLasso, tree importance (best default choice)

What comes next

This post wraps up the core machine learning series. You now have the tools to understand data, build models, evaluate them fairly, and engineer the features that feed them. The next frontier is deep learning, where models learn their own features automatically. Start with the introduction to neural networks to see how layers of simple functions can learn complex patterns without manual feature engineering.

Start typing to search across all content
navigate Enter open Esc close