Feature engineering and selection
In this series (18 parts)
- What is machine learning: a map of the field
- Data, features, and the ML pipeline
- Linear regression
- Bias, variance, and the tradeoff
- Regularization: Ridge, Lasso, and ElasticNet
- Logistic regression and classification
- Evaluation metrics for classification
- Naive Bayes classifier
- K-Nearest Neighbors
- Decision trees
- Ensemble methods: Bagging and Random Forests
- Boosting: AdaBoost and Gradient Boosting
- Support Vector Machines
- K-Means clustering
- Dimensionality Reduction: PCA
- Gaussian mixture models and EM algorithm
- Model selection and cross-validation
- Feature engineering and selection
The best algorithm with bad features will lose to a simple algorithm with good features. Feature engineering is the process of transforming raw data into inputs that make machine learning models work better. Feature selection is figuring out which of those inputs actually matter.
Prerequisites: You should be comfortable with the data pipeline and regularization.
Why features matter
A linear model can only learn linear relationships. If the true relationship is , no amount of training will make fit well. But if you create a new feature and feed the model , it can learn perfectly.
Feature engineering is how you encode your domain knowledge into the model. It bridges the gap between what the data looks like and what the model can learn.
Polynomial features
The simplest feature expansion takes a feature and creates powers of it: . For multiple features, you also include all cross-terms.
For two features with degree 2, the expanded feature set is:
The number of features grows fast. With original features and polynomial degree , the number of expanded features is . For and , that is 286 features.
Example 1: polynomial features by hand
Suppose you are predicting house price from two features: size (in hundreds of sqft) and age (in decades).
| House | Size () | Age () | Price (, thousands) |
|---|---|---|---|
| A | 1 | 3 | 150 |
| B | 2 | 1 | 280 |
| C | 3 | 2 | 400 |
| D | 4 | 1 | 500 |
Step 1: Create degree-2 polynomial features.
We add , (interaction term), and :
| House | |||||
|---|---|---|---|---|---|
| A | 1 | 3 | 1 | 3 | 9 |
| B | 2 | 1 | 4 | 2 | 1 |
| C | 3 | 2 | 9 | 6 | 4 |
| D | 4 | 1 | 16 | 4 | 1 |
Step 2: Fit a linear model with original features only.
Using the normal equations, fit .
Solving gives:
Model:
| House | Error | ||
|---|---|---|---|
| A | 150 | ||
| B | 280 | ||
| C | 400 | ||
| D | 500 |
Step 3: Fit with polynomial features.
Now use all 5 features: . With 4 data points and 6 parameters (including bias), we have an underdetermined system, so the model can fit the data exactly.
This is a perfect fit, but with 6 parameters for 4 data points, we are almost certainly overfitting. This is where regularization becomes critical. A ridge regression penalty on the polynomial model would give a non-zero but much smaller MSE than the linear model.
Takeaway: Polynomial features let a linear model capture nonlinear relationships, but they increase the risk of overfitting. Always pair them with regularization or cross-validation.
Interaction terms
An interaction term captures the idea that the effect of on depends on the value of . In the house price example, the value of extra square footage might depend on the age of the house: new square footage might be worth more in newer houses.
You do not always need full polynomial expansion. Sometimes adding just the interaction terms (without squared terms) is enough.
Encoding categorical variables
Machine learning models need numbers, not categories. If you have a feature “color” with values {red, blue, green}, you need to encode it.
One-hot encoding
Create a binary column for each category:
| Color | is_red | is_blue | is_green |
|---|---|---|---|
| red | 1 | 0 | 0 |
| blue | 0 | 1 | 0 |
| green | 0 | 0 | 1 |
Drop one column to avoid perfect multicollinearity (the “dummy variable trap”). If and , we know the color is green.
Label encoding
Assign an integer to each category: red = 0, blue = 1, green = 2. This is simple but dangerous for most models because it implies an ordering (blue is “between” red and green) and a distance (green is “twice as far” from red as blue). Only use this for tree-based models that split on thresholds.
Target encoding
Replace each category with the mean of the target variable for that category. For example, if the average price of red houses is $200k, encode red as 200. This is powerful but risks data leakage. Always compute target encodings on the training fold only, never on the test data.
Feature scaling
Features on different scales cause problems. If feature ranges from 0 to 1 and ranges from 0 to 1000, models that use distance or gradient descent will be dominated by .
Standardization (Z-score):
Centers at 0, scales to unit variance. Good for models that assume normality or use distance.
Min-max scaling:
Scales to [0, 1]. Sensitive to outliers.
Always fit the scaler on the training data and apply it to both training and test data. Never compute statistics from the test set.
Example 2: impact of feature engineering on model performance
Let’s see how feature engineering affects a regression model. We have 5 data points:
| 1 | 1 |
| 2 | 5 |
| 3 | 7 |
| 4 | 11 |
| 5 | 20 |
The relationship looks nonlinear (possibly quadratic or exponential).
Model A: linear regression on raw feature.
Model A:
| 1 | 1 | 0.0 | 1.00 |
| 2 | 5 | 4.4 | 0.36 |
| 3 | 7 | 8.8 | 3.24 |
| 4 | 11 | 13.2 | 4.84 |
| 5 | 20 | 17.6 | 5.76 |
Model B: add a quadratic feature .
Now fit using features :
| 1 | 1 | 1 |
| 2 | 4 | 5 |
| 3 | 9 | 7 |
| 4 | 16 | 11 |
| 5 | 25 | 20 |
Solving the normal equations (three unknowns, five equations) gives:
Model B:
| 1 | 1 | 0.50 | |
| 2 | 5 | 2.13 | |
| 3 | 7 | 0.01 | |
| 4 | 11 | 1.80 | |
| 5 | 20 | 0.48 |
The quadratic feature reduced MSE from 3.04 to 0.98, a 68% improvement. The model captures the upward curvature that the linear model misses.
import numpy as np
x = np.array([1, 2, 3, 4, 5])
y = np.array([1, 5, 7, 11, 20])
# Model A: linear
X_lin = np.column_stack([np.ones(5), x])
w_lin = np.linalg.lstsq(X_lin, y, rcond=None)[0]
mse_lin = np.mean((y - X_lin @ w_lin)**2)
# Model B: quadratic
X_quad = np.column_stack([np.ones(5), x, x**2])
w_quad = np.linalg.lstsq(X_quad, y, rcond=None)[0]
mse_quad = np.mean((y - X_quad @ w_quad)**2)
print(f"Linear MSE: {mse_lin:.2f}") # 3.04
print(f"Quadratic MSE: {mse_quad:.2f}") # 0.98
Feature importance before and after engineering. Raw Size dominates, but after adding polynomial terms, log-transforms, and proper encoding, the importance spreads more evenly and Location becomes a strong signal.
Feature selection methods
More features is not always better. Irrelevant features add noise, increase computation, and can cause overfitting. Feature selection picks the subset that matters.
Filter methods
Evaluate each feature independently, without training a model.
- Correlation: Rank features by their absolute correlation with the target. Drop features with low correlation.
- Mutual information: Measures how much knowing the feature reduces uncertainty about the target. Works for nonlinear relationships too.
- Variance threshold: Drop features with near-zero variance (they carry no information).
Filter methods are fast but ignore interactions between features. A feature that is useless alone might be powerful when combined with another.
Wrapper methods
Train a model with different feature subsets and pick the best one.
- Forward selection: Start with no features. Add the one that most improves the model. Repeat until adding features stops helping.
- Backward elimination: Start with all features. Remove the one whose removal least hurts the model. Repeat.
- Recursive feature elimination (RFE): Train a model, remove the feature with the smallest weight, and repeat.
Wrapper methods are more accurate than filter methods but expensive, especially with many features.
Embedded methods
Feature selection happens as part of model training.
- Lasso (L1 regularization): Drives unimportant feature weights to exactly zero. The features with non-zero weights are selected automatically.
- Tree-based importance: Decision trees and random forests compute feature importance based on how much each feature reduces impurity. No extra step needed.
- Elastic net: Combines L1 and L2 regularization. Gets the sparsity of Lasso with the stability of Ridge.
Embedded methods are usually the best default choice. Lasso in particular is very popular because it does model fitting and feature selection simultaneously.
Comparison
| Method | Speed | Handles interactions | Example |
|---|---|---|---|
| Filter | Fast | No | Correlation, mutual info |
| Wrapper | Slow | Yes | Forward selection, RFE |
| Embedded | Moderate | Yes | Lasso, tree importance |
Dimensionality reduction as feature engineering
PCA creates new features that are linear combinations of the originals, keeping the ones with the most variance. This is an alternative to feature selection: instead of choosing a subset, you create a new, smaller set that captures most of the information.
PCA is especially useful when you have many correlated features. It transforms them into uncorrelated principal components, which often improves model stability.
Practical tips
- Start simple. Try raw features first. Add complexity only if performance is insufficient.
- Use domain knowledge. A doctor knows that BMI (weight/height) matters more than weight and height separately. Encode that.
- Log-transform skewed features (income, population, prices). This often helps linear models.
- Create ratios when they make physical sense: price per square foot, clicks per impression.
- Be careful with high-cardinality categoricals. A feature with 10,000 categories (like zip code) will create 10,000 binary columns with one-hot encoding. Use target encoding or embeddings instead.
- Feature engineering is iterative. Create features, evaluate with cross-validation, refine.
- Avoid data leakage. All feature engineering (scaling, encoding, polynomial expansion) must be fit on the training set only.
Summary
| Concept | Key idea |
|---|---|
| Polynomial features | Add to capture nonlinear patterns |
| Interaction terms | Capture “effect of A depends on B” relationships |
| One-hot encoding | Binary columns for categorical variables |
| Feature scaling | Standardize or normalize before distance-based models |
| Filter selection | Score features independently (fast, ignores interactions) |
| Wrapper selection | Evaluate feature subsets by training models (accurate, slow) |
| Embedded selection | Lasso, tree importance (best default choice) |
What comes next
This post wraps up the core machine learning series. You now have the tools to understand data, build models, evaluate them fairly, and engineer the features that feed them. The next frontier is deep learning, where models learn their own features automatically. Start with the introduction to neural networks to see how layers of simple functions can learn complex patterns without manual feature engineering.