Apr 9, 2026 · 14 min read · Machine Learning

Feature engineering and selection

In this series (18 parts)

The best algorithm with bad features will lose to a simple algorithm with good features. Feature engineering is the process of transforming raw data into inputs that make machine learning models work better. Feature selection is figuring out which of those inputs actually matter.

Prerequisites: You should be comfortable with the data pipeline and regularization.

Why features matter

A linear model can only learn linear relationships. If the true relationship is $y = x^2$ , no amount of training will make $y = wx + b$ fit well. But if you create a new feature $x^2$ and feed the model $[x, x^2]$ , it can learn $y = w_1 x + w_2 x^2 + b$ perfectly.

Feature engineering is how you encode your domain knowledge into the model. It bridges the gap between what the data looks like and what the model can learn.

Polynomial features

The simplest feature expansion takes a feature $x$ and creates powers of it: $x, x^2, x^3, \ldots, x^d$ . For multiple features, you also include all cross-terms.

For two features $x_1, x_2$ with degree 2, the expanded feature set is:

$[1, x_1, x_2, x_1^2, x_1 x_2, x_2^2]$

The number of features grows fast. With $d$ original features and polynomial degree $p$ , the number of expanded features is $\binom{d + p}{p}$ . For $d = 10$ and $p = 3$ , that is 286 features.

Example 1: polynomial features by hand

Suppose you are predicting house price from two features: size (in hundreds of sqft) and age (in decades).

House	Size ( $x_1$ )	Age ( $x_2$ )	Price ( $y$ , thousands)
A	1	3	150
B	2	1	280
C	3	2	400
D	4	1	500

Step 1: Create degree-2 polynomial features.

We add $x_1^2$ , $x_1 x_2$ (interaction term), and $x_2^2$ :

House	$x_1$	$x_2$	$x_1^2$	$x_1 x_2$	$x_2^2$
A	1	3	1	3	9
B	2	1	4	2	1
C	3	2	9	6	4
D	4	1	16	4	1

Step 2: Fit a linear model with original features only.

Using the normal equations, fit $\hat{y} = w_1 x_1 + w_2 x_2 + b$ .

Solving $w = (X^T X)^{-1} X^T y$ gives:

$w_1 \approx 115.3, \quad w_2 \approx -3.3, \quad b \approx 50.0$

Model: $\hat{y} = 115.3 x_1 - 3.3 x_2 + 50.0$

House	$y$	$\hat{y}$	Error
A	150	$115.3(1) - 3.3(3) + 50.0 = 155.3$	$-5.3$
B	280	$115.3(2) - 3.3(1) + 50.0 = 277.3$	$+2.7$
C	400	$115.3(3) - 3.3(2) + 50.0 = 389.3$	$+10.7$
D	500	$115.3(4) - 3.3(1) + 50.0 = 508.0$	$-8.0$

$\text{MSE}_{\text{linear}} = \frac{5.3^2 + 2.7^2 + 10.7^2 + 8.0^2}{4} = \frac{28.1 + 7.3 + 114.5 + 64.0}{4} \approx 53.3$

Step 3: Fit with polynomial features.

Now use all 5 features: $x_1, x_2, x_1^2, x_1 x_2, x_2^2$ . With 4 data points and 6 parameters (including bias), we have an underdetermined system, so the model can fit the data exactly.

$\text{MSE}_{\text{poly}} = 0$

This is a perfect fit, but with 6 parameters for 4 data points, we are almost certainly overfitting. This is where regularization becomes critical. A ridge regression penalty on the polynomial model would give a non-zero but much smaller MSE than the linear model.

Takeaway: Polynomial features let a linear model capture nonlinear relationships, but they increase the risk of overfitting. Always pair them with regularization or cross-validation.

Interaction terms

An interaction term $x_1 x_2$ captures the idea that the effect of $x_1$ on $y$ depends on the value of $x_2$ . In the house price example, the value of extra square footage might depend on the age of the house: new square footage might be worth more in newer houses.

You do not always need full polynomial expansion. Sometimes adding just the interaction terms (without squared terms) is enough.

Encoding categorical variables

Machine learning models need numbers, not categories. If you have a feature “color” with values {red, blue, green}, you need to encode it.

One-hot encoding

Create a binary column for each category:

Color	is_red	is_blue	is_green
red	1	0	0
blue	0	1	0
green	0	0	1

Drop one column to avoid perfect multicollinearity (the “dummy variable trap”). If $\text{is\_red} = 0$ and $\text{is\_blue} = 0$ , we know the color is green.

Label encoding

Assign an integer to each category: red = 0, blue = 1, green = 2. This is simple but dangerous for most models because it implies an ordering (blue is “between” red and green) and a distance (green is “twice as far” from red as blue). Only use this for tree-based models that split on thresholds.

Target encoding

Replace each category with the mean of the target variable for that category. For example, if the average price of red houses is $200k, encode red as 200. This is powerful but risks data leakage. Always compute target encodings on the training fold only, never on the test data.

Feature scaling

Features on different scales cause problems. If feature $x_1$ ranges from 0 to 1 and $x_2$ ranges from 0 to 1000, models that use distance or gradient descent will be dominated by $x_2$ .

Standardization (Z-score):

$x' = \frac{x - \mu}{\sigma}$

Centers at 0, scales to unit variance. Good for models that assume normality or use distance.

Min-max scaling:

$x' = \frac{x - x_{\min}}{x_{\max} - x_{\min}}$

Scales to [0, 1]. Sensitive to outliers.

Always fit the scaler on the training data and apply it to both training and test data. Never compute statistics from the test set.

Example 2: impact of feature engineering on model performance

Let’s see how feature engineering affects a regression model. We have 5 data points:

$x$	$y$
1	1
2	5
3	7
4	11
5	20

The relationship looks nonlinear (possibly quadratic or exponential).

Model A: linear regression on raw feature.

$\bar{x} = 3, \quad \bar{y} = 8.8$

$w = \frac{\sum (x_i - 3)(y_i - 8.8)}{\sum (x_i - 3)^2} = \frac{(-2)(-7.8) + (-1)(-3.8) + (0)(-1.8) + (1)(2.2) + (2)(11.2)}{4 + 1 + 0 + 1 + 4}$

$= \frac{15.6 + 3.8 + 0 + 2.2 + 22.4}{10} = \frac{44.0}{10} = 4.4$

$b = 8.8 - 4.4 \times 3 = -4.4$

Model A: $\hat{y} = 4.4x - 4.4$

$x$	$y$	$\hat{y}$	$(y - \hat{y})^2$
1	1	0.0	1.00
2	5	4.4	0.36
3	7	8.8	3.24
4	11	13.2	4.84
5	20	17.6	5.76

$\text{MSE}_A = \frac{1.00 + 0.36 + 3.24 + 4.84 + 5.76}{5} = \frac{15.20}{5} = 3.04$

Model B: add a quadratic feature $x^2$ .

Now fit $\hat{y} = w_1 x + w_2 x^2 + b$ using features $[x, x^2]$ :

$x$	$x^2$	$y$
1	1	1
2	4	5
3	9	7
4	16	11
5	25	20

Solving the normal equations (three unknowns, five equations) gives:

$b \approx 1.60, \quad w_1 \approx -0.74, \quad w_2 \approx 0.86$

Model B: $\hat{y} = 0.86x^2 - 0.74x + 1.60$

$x$	$y$	$\hat{y}$	$(y - \hat{y})^2$
1	1	$0.86 - 0.74 + 1.60 = 1.71$	0.50
2	5	$3.43 - 1.49 + 1.60 = 3.54$	2.13
3	7	$7.71 - 2.23 + 1.60 = 7.09$	0.01
4	11	$13.71 - 2.97 + 1.60 = 12.34$	1.80
5	20	$21.43 - 3.71 + 1.60 = 19.31$	0.48

$\text{MSE}_B = \frac{0.50 + 2.13 + 0.01 + 1.80 + 0.48}{5} = \frac{4.92}{5} = 0.98$

The quadratic feature reduced MSE from 3.04 to 0.98, a 68% improvement. The model captures the upward curvature that the linear model misses.

import numpy as np

x = np.array([1, 2, 3, 4, 5])
y = np.array([1, 5, 7, 11, 20])

# Model A: linear
X_lin = np.column_stack([np.ones(5), x])
w_lin = np.linalg.lstsq(X_lin, y, rcond=None)[0]
mse_lin = np.mean((y - X_lin @ w_lin)**2)

# Model B: quadratic
X_quad = np.column_stack([np.ones(5), x, x**2])
w_quad = np.linalg.lstsq(X_quad, y, rcond=None)[0]
mse_quad = np.mean((y - X_quad @ w_quad)**2)

print(f"Linear MSE:    {mse_lin:.2f}")    # 3.04
print(f"Quadratic MSE: {mse_quad:.2f}")   # 0.98

Feature importance before and after engineering. Raw Size dominates, but after adding polynomial terms, log-transforms, and proper encoding, the importance spreads more evenly and Location becomes a strong signal.

Feature selection methods

More features is not always better. Irrelevant features add noise, increase computation, and can cause overfitting. Feature selection picks the subset that matters.

Filter methods

Evaluate each feature independently, without training a model.

Correlation: Rank features by their absolute correlation with the target. Drop features with low correlation.
Mutual information: Measures how much knowing the feature reduces uncertainty about the target. Works for nonlinear relationships too.
Variance threshold: Drop features with near-zero variance (they carry no information).

Filter methods are fast but ignore interactions between features. A feature that is useless alone might be powerful when combined with another.

Wrapper methods

Train a model with different feature subsets and pick the best one.

Forward selection: Start with no features. Add the one that most improves the model. Repeat until adding features stops helping.
Backward elimination: Start with all features. Remove the one whose removal least hurts the model. Repeat.
Recursive feature elimination (RFE): Train a model, remove the feature with the smallest weight, and repeat.

Wrapper methods are more accurate than filter methods but expensive, especially with many features.

Embedded methods

Feature selection happens as part of model training.

Lasso (L1 regularization): Drives unimportant feature weights to exactly zero. The features with non-zero weights are selected automatically.
Tree-based importance: Decision trees and random forests compute feature importance based on how much each feature reduces impurity. No extra step needed.
Elastic net: Combines L1 and L2 regularization. Gets the sparsity of Lasso with the stability of Ridge.

Embedded methods are usually the best default choice. Lasso in particular is very popular because it does model fitting and feature selection simultaneously.

Comparison

Method	Speed	Handles interactions	Example
Filter	Fast	No	Correlation, mutual info
Wrapper	Slow	Yes	Forward selection, RFE
Embedded	Moderate	Yes	Lasso, tree importance

Dimensionality reduction as feature engineering

PCA creates new features that are linear combinations of the originals, keeping the ones with the most variance. This is an alternative to feature selection: instead of choosing a subset, you create a new, smaller set that captures most of the information.

PCA is especially useful when you have many correlated features. It transforms them into uncorrelated principal components, which often improves model stability.

Practical tips

Start simple. Try raw features first. Add complexity only if performance is insufficient.
Use domain knowledge. A doctor knows that BMI (weight/height $^2$ ) matters more than weight and height separately. Encode that.
Log-transform skewed features (income, population, prices). This often helps linear models.
Create ratios when they make physical sense: price per square foot, clicks per impression.
Be careful with high-cardinality categoricals. A feature with 10,000 categories (like zip code) will create 10,000 binary columns with one-hot encoding. Use target encoding or embeddings instead.
Feature engineering is iterative. Create features, evaluate with cross-validation, refine.
Avoid data leakage. All feature engineering (scaling, encoding, polynomial expansion) must be fit on the training set only.

Summary

Concept	Key idea
Polynomial features	Add $x^2, x^3, x_1 x_2$ to capture nonlinear patterns
Interaction terms	Capture “effect of A depends on B” relationships
One-hot encoding	Binary columns for categorical variables
Feature scaling	Standardize or normalize before distance-based models
Filter selection	Score features independently (fast, ignores interactions)
Wrapper selection	Evaluate feature subsets by training models (accurate, slow)
Embedded selection	Lasso, tree importance (best default choice)

What comes next

This post wraps up the core machine learning series. You now have the tools to understand data, build models, evaluate them fairly, and engineer the features that feed them. The next frontier is deep learning, where models learn their own features automatically. Start with the introduction to neural networks to see how layers of simple functions can learn complex patterns without manual feature engineering.

← Back to all series