Search…

Data, features, and the ML pipeline

In this series (18 parts)
  1. What is machine learning: a map of the field
  2. Data, features, and the ML pipeline
  3. Linear regression
  4. Bias, variance, and the tradeoff
  5. Regularization: Ridge, Lasso, and ElasticNet
  6. Logistic regression and classification
  7. Evaluation metrics for classification
  8. Naive Bayes classifier
  9. K-Nearest Neighbors
  10. Decision trees
  11. Ensemble methods: Bagging and Random Forests
  12. Boosting: AdaBoost and Gradient Boosting
  13. Support Vector Machines
  14. K-Means clustering
  15. Dimensionality Reduction: PCA
  16. Gaussian mixture models and EM algorithm
  17. Model selection and cross-validation
  18. Feature engineering and selection

Prerequisites: This article assumes you’ve read What is machine learning.

Your model is only as good as your data. You can have the fanciest algorithm in the world, but if your data is messy, leaked, or poorly scaled, your results will be garbage. This article covers the practical foundations: splitting data, engineering features, and scaling them properly.

A house-price dataset, end to end

Before any formulas, here is a concrete scenario. You have a spreadsheet of 8 houses. The first five columns describe each house. The last column is what you want to predict.

HouseSize (sq ft)BedroomsAge (years)GarageNeighborhoodPrice ($k)
11200215NoDowntown240
2180035YesSuburbs350
3900130NoDowntown150
4220042YesSuburbs450
51500310NoSuburbs280
61100220NoRural180
7250041YesSuburbs520
8160038YesDowntown310

Size, Bedrooms, Age, Garage, and Neighborhood are features. Price is the target. Notice that some features are numbers (Size, Age) while others are categories (Garage, Neighborhood). ML models need everything as numbers, so categorical features require conversion. More on that below.

Number of distinct values per feature. Numerical features (Size, Age) vary widely, while categorical features (Garage, Neighborhood) have only a few levels.

The full ML pipeline from raw data to deployment

graph LR
  A["Collect data"] --> B["Clean and preprocess"]
  B --> C["Split into train/test"]
  C --> D["Scale features"]
  D --> E["Train model"]
  E --> F["Evaluate on test set"]
  F --> G["Deploy"]

Features feed into the model to predict the target

graph LR
  F1["Size"] --> M["ML Model"]
  F2["Bedrooms"] --> M
  F3["Age"] --> M
  F4["Garage"] --> M
  F5["Neighborhood"] --> M
  M --> T["Price prediction"]

Now let’s look at each step in detail, starting with how to organize features and targets mathematically.

Features and targets

A dataset is a table. Each row is one example (also called a sample or observation). Each column is a feature (also called a variable or attribute). One special column is the target, the thing you want to predict.

Size (sq ft)BedroomsAge (years)Price ($k)
1200215240
180035350
900130150
220042450

Here, Size, Bedrooms, and Age are features. Price is the target. In math notation:

X=[1200215180035900130220042],y=[240350150450]X = \begin{bmatrix} 1200 & 2 & 15 \\ 1800 & 3 & 5 \\ 900 & 1 & 30 \\ 2200 & 4 & 2 \end{bmatrix}, \quad y = \begin{bmatrix} 240 \\ 350 \\ 150 \\ 450 \end{bmatrix}

Each row of XX is a feature vector xiRdx_i \in \mathbb{R}^d where d=3d = 3 (three features).

Train, validation, and test splits

You never evaluate a model on the data it trained on. That’s like grading a student using the exact questions they practiced. You need held-out data to measure how well the model generalizes.

The standard split:

  • Training set (60-80%): the model learns from this.
  • Validation set (10-20%): you use this to tune hyperparameters and choose between models.
  • Test set (10-20%): you touch this once, at the very end, to report final performance.
graph LR
  D[Full Dataset] --> Train["Training Set (70%)"]
  D --> Val["Validation Set (15%)"]
  D --> Test["Test Set (15%)"]
  Train --> Model[Train Model]
  Val --> Tune[Tune Hyperparams]
  Test --> Final[Final Evaluation]

For small datasets, use k-fold cross-validation instead of a fixed validation set. Split the training data into kk equal parts, train on k1k-1 parts, validate on the remaining one, and rotate. This gives you kk validation scores whose average is more reliable.

Common train/validation/test split ratios

graph TD
  A["Full Dataset: 100%"] --> B["Training: 70%"]
  A --> C["Validation: 15%"]
  A --> D["Test: 15%"]
  B --> E["Model learns patterns"]
  C --> F["Tune hyperparameters"]
  D --> G["Final one-time evaluation"]
from sklearn.model_selection import train_test_split

# First split: separate test set
X_temp, X_test, y_temp, y_test = train_test_split(X, y, test_size=0.15, random_state=42)

# Second split: train and validation
X_train, X_val, y_train, y_val = train_test_split(X_temp, y_temp, test_size=0.176, random_state=42)
# 0.176 of 85% ≈ 15% of total

Data leakage

Data leakage is when information from outside the training set sneaks into your model during training. It makes your model look great on paper but fail in production.

Common causes:

  1. Scaling before splitting. If you compute the mean and standard deviation on the entire dataset (including test data) and then split, your training features contain information about the test set.
  2. Using future data. If you’re predicting stock prices, and your features include tomorrow’s trading volume, that’s leakage.
  3. Target leakage. A feature that’s derived from the target or is a proxy for it. Example: a “has_been_treated” column when predicting “needs_treatment.”

The rule is simple: anything you compute from data must come from the training set only. Means, standard deviations, min/max values, vocabulary for text, all of it.

Feature scaling

Different features live on different scales. House size might range from 500 to 5000, while number of bedrooms ranges from 1 to 6. This mismatch causes problems.

Why does scale matter? Algorithms that use distances or gradients are sensitive to feature magnitude. Gradient descent will take huge steps in the direction of large-magnitude features and tiny steps for small ones, making convergence slow and erratic.

Two main approaches: normalization and standardization.

Normalization (min-max scaling)

Rescales each feature to [0,1][0, 1]:

xnorm=xxminxmaxxminx_{\text{norm}} = \frac{x - x_{\min}}{x_{\max} - x_{\min}}

Good when you know the bounds of your data and the distribution isn’t strongly skewed.

Standardization (z-score scaling)

Centers each feature at mean 0 with standard deviation 1:

xstd=xμσx_{\text{std}} = \frac{x - \mu}{\sigma}

where μ\mu is the mean and σ\sigma is the standard deviation. This is usually the default choice. It handles outliers better than normalization because it doesn’t squash everything into a fixed range.

Example 1: standardizing a small dataset by hand

Let’s standardize the Size feature from our housing data: [1200,1800,900,2200][1200, 1800, 900, 2200].

Step 1: compute the mean.

μ=1200+1800+900+22004=61004=1525\mu = \frac{1200 + 1800 + 900 + 2200}{4} = \frac{6100}{4} = 1525

Step 2: compute the standard deviation.

First, the variance:

σ2=(12001525)2+(18001525)2+(9001525)2+(22001525)24\sigma^2 = \frac{(1200 - 1525)^2 + (1800 - 1525)^2 + (900 - 1525)^2 + (2200 - 1525)^2}{4}

=(325)2+(275)2+(625)2+(675)24= \frac{(-325)^2 + (275)^2 + (-625)^2 + (675)^2}{4}

=105625+75625+390625+4556254= \frac{105625 + 75625 + 390625 + 455625}{4}

=10275004=256875= \frac{1027500}{4} = 256875

σ=256875506.8\sigma = \sqrt{256875} \approx 506.8

Step 3: standardize each value.

x1=12001525506.8=325506.80.641x_1 = \frac{1200 - 1525}{506.8} = \frac{-325}{506.8} \approx -0.641

x2=18001525506.8=275506.80.543x_2 = \frac{1800 - 1525}{506.8} = \frac{275}{506.8} \approx 0.543

x3=9001525506.8=625506.81.233x_3 = \frac{900 - 1525}{506.8} = \frac{-625}{506.8} \approx -1.233

x4=22001525506.8=675506.81.332x_4 = \frac{2200 - 1525}{506.8} = \frac{675}{506.8} \approx 1.332

The standardized values are approximately [0.641,0.543,1.233,1.332][-0.641, 0.543, -1.233, 1.332]. Notice they center around 0 with roughly unit spread.

The critical point: when you get new data at test time, you use the training set’s μ=1525\mu = 1525 and σ=506.8\sigma = 506.8. You do NOT recompute these from the test set.

import numpy as np

# Training data
X_train_size = np.array([1200, 1800, 900, 2200])

# Fit on training data only
mu = X_train_size.mean()       # 1525.0
sigma = X_train_size.std()     # 506.8

# Transform training data
X_train_scaled = (X_train_size - mu) / sigma

# Transform test data using TRAINING stats
X_test_size = np.array([1500, 2000])
X_test_scaled = (X_test_size - mu) / sigma

Example 2: what happens without scaling

Consider a simple 2-feature problem where you’re running gradient descent to minimize:

L(w1,w2)=(w1x1+w2x2y)2L(w_1, w_2) = (w_1 \cdot x_1 + w_2 \cdot x_2 - y)^2

Suppose feature 1 (income) has values around 50,00050{,}000 and feature 2 (age) has values around 3030.

The gradient with respect to w1w_1 scales with x1x_1, and the gradient with respect to w2w_2 scales with x2x_2. So:

Lw1x150,000\frac{\partial L}{\partial w_1} \propto x_1 \approx 50{,}000

Lw2x230\frac{\partial L}{\partial w_2} \propto x_2 \approx 30

The gradient in the w1w_1 direction is roughly 1,667 times larger than in the w2w_2 direction. If you pick a learning rate α\alpha that works for w1w_1, it will be too small for w2w_2. If you pick one that works for w2w_2, it will overshoot w1w_1.

The loss surface looks like a long, narrow valley:

graph TD
  A["Unscaled: elongated ellipse"] --> B["Gradient descent zigzags"]
  C["Scaled: circular contours"] --> D["Gradient descent goes straight to minimum"]

After standardizing both features to mean 0 and standard deviation 1, the gradients are comparable in magnitude. The loss surface becomes more circular, and gradient descent converges much faster.

Let’s see it concretely. With x1=50,000x_1 = 50{,}000 and x2=30x_2 = 30, true weights w1=0.002w_1^* = 0.002 and w2=5w_2^* = 5, and target y=250y = 250:

Without scaling, learning rate α=0.0000000001\alpha = 0.0000000001 (very tiny to avoid divergence):

Step 0: w1=0,w2=0w_1 = 0, w_2 = 0, prediction =0= 0, error =250= -250

Lw1=2(250)50000=25,000,000\frac{\partial L}{\partial w_1} = 2 \cdot (-250) \cdot 50000 = -25{,}000{,}000

Lw2=2(250)30=15,000\frac{\partial L}{\partial w_2} = 2 \cdot (-250) \cdot 30 = -15{,}000

w100.0000000001(25,000,000)=0.0025w_1 \leftarrow 0 - 0.0000000001 \cdot (-25{,}000{,}000) = 0.0025

w200.0000000001(15,000)=0.0000015w_2 \leftarrow 0 - 0.0000000001 \cdot (-15{,}000) = 0.0000015

After one step, w1w_1 has moved significantly toward 0.002, but w2w_2 has barely budged from 0 toward its target of 5. It will take millions of steps for w2w_2 to catch up.

With scaling (both features standardized to similar ranges), a single learning rate works well for both weights simultaneously. That’s why you always scale your features before running gradient-based optimization.

Feature engineering

Raw data rarely goes directly into a model. You transform, combine, and create features to help the model learn.

Feature types and how to handle them

graph TD
  A["Feature Types"] --> B["Numerical"]
  A --> C["Categorical"]
  B --> D["Continuous: Size, Age"]
  B --> E["Discrete: Bedrooms, Floors"]
  C --> F["Nominal: Color, City"]
  C --> G["Ordinal: Low/Med/High"]
  D --> H["Scale directly"]
  E --> H
  F --> I["One-hot encode"]
  G --> J["Ordinal encode or one-hot"]

Common techniques:

One-hot encoding for categorical features

If a feature is “color” with values {red, green, blue}, you create three binary columns:

coloris_redis_greenis_blue
red100
green010
blue001

You can’t feed “red” as a string into a math equation. Numbers are required.

Polynomial features

If you suspect a nonlinear relationship, add powers and interactions:

x=[x1,x2][x1,x2,x12,x22,x1x2]x = [x_1, x_2] \rightarrow [x_1, x_2, x_1^2, x_2^2, x_1 x_2]

This lets a linear model fit curves. A model using these features computes:

y^=w1x1+w2x2+w3x12+w4x22+w5x1x2+b\hat{y} = w_1 x_1 + w_2 x_2 + w_3 x_1^2 + w_4 x_2^2 + w_5 x_1 x_2 + b

which is a second-degree polynomial surface, even though the model itself is linear in its parameters.

Log transforms

When a feature spans several orders of magnitude (population of cities, income), taking the log compresses the range and often makes the relationship more linear:

x=log(x+1)x' = \log(x + 1)

The +1+1 avoids log(0)\log(0).

Handling missing data

Options, roughly ordered from simplest to most sophisticated:

  1. Drop rows with missing values (fine if few rows are affected).
  2. Impute with the mean or median of the training set.
  3. Add an indicator column: 1 if the value was missing, 0 otherwise. Then impute. This lets the model learn that “missingness” itself might be informative.

Putting it all together: the ML pipeline

A pipeline chains preprocessing and modeling steps into a single object. This guarantees that the same transformations applied during training are applied during prediction.

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('model', LinearRegression())
])

# Fit on training data: scaler learns mean/std, model learns weights
pipeline.fit(X_train, y_train)

# Predict on test data: scaler uses TRAINING mean/std, then model predicts
predictions = pipeline.predict(X_test)

This prevents leakage automatically. The scaler fits only on training data, even though the predict call processes test data through the same transformations.

flowchart LR
  Raw[Raw Data] --> Split[Train/Test Split]
  Split --> Scale["Scale (fit on train)"]
  Scale --> Eng[Feature Engineering]
  Eng --> Train[Train Model]
  Train --> Eval["Evaluate (on test)"]

Summary

ConceptWhat it doesWatch out for
Train/val/test splitSeparates learning data from evaluation dataDon’t peek at the test set
StandardizationCenters features to mean 0, std 1Fit on train, transform both
NormalizationScales features to [0, 1]Sensitive to outliers
One-hot encodingConverts categories to numbersCan create many columns
PipelinesChains preprocessing + modelAlways use them to prevent leakage

What comes next

With clean, properly scaled data, you’re ready to build your first model. The next article, Linear regression, covers fitting a line to data using both the normal equations and gradient descent, with full derivations and code.

Start typing to search across all content
navigate Enter open Esc close