Mar 24, 2026 · 16 min read · Machine Learning

Data, features, and the ML pipeline

Updated Apr 23, 2026

In this series (18 parts)

Prerequisites: This article assumes you’ve read What is machine learning.

Your model is only as good as your data. You can have the fanciest algorithm in the world, but if your data is messy, leaked, or poorly scaled, your results will be garbage. This article covers the practical foundations: splitting data, engineering features, and scaling them properly.

A house-price dataset, end to end

Before any formulas, here is a concrete scenario. You have a spreadsheet of 8 houses, the first five columns describe each house, the last column is what you want to predict.

House	Size (sq ft)	Bedrooms	Age (years)	Garage	Neighborhood	Price ($k)
1	1200	2	15	No	Downtown	240
2	1800	3	5	Yes	Suburbs	350
3	900	1	30	No	Downtown	150
4	2200	4	2	Yes	Suburbs	450
5	1500	3	10	No	Suburbs	280
6	1100	2	20	No	Rural	180
7	2500	4	1	Yes	Suburbs	520
8	1600	3	8	Yes	Downtown	310

Size, Bedrooms, Age, Garage, and Neighborhood are features or input variables. Price is the target or output. Notice that some features are numbers (Size, Age) while others are categories (Garage, Neighborhood). ML models need everything as numbers, so categorical features require conversion, more on that below.

Number of distinct values per feature. Numerical features (Size, Age) vary widely, while categorical features (Garage, Neighborhood) have only a few levels.

The full ML pipeline from raw data to deployment

graph LR
  A["Collect data"] --> B["Clean and preprocess"]
  B --> C["Split into train/test"]
  C --> D["Scale features"]
  D --> E["Train model"]
  E --> F["Evaluate on test set"]
  F --> G["Deploy"]

Features feed into the model to predict the target

graph LR
  F1["Size"] --> M["ML Model"]
  F2["Bedrooms"] --> M
  F3["Age"] --> M
  F4["Garage"] --> M
  F5["Neighborhood"] --> M
  M --> T["Price prediction"]

Now let’s look at each step in detail, starting with how to organize features and targets mathematically.

Features and targets

A dataset is generally a table. Each row is one example (also called a sample or observation) and each column is a feature (also called a variable or attribute). One special column is the target, the thing you want to predict.

Size (sq ft)	Bedrooms	Age (years)	Price ($k)
1200	2	15	240
1800	3	5	350
900	1	30	150
2200	4	2	450

Here, Size, Bedrooms, and Age are features and price is the target. In math notation:

$X = \begin{bmatrix} 1200 & 2 & 15 \\ 1800 & 3 & 5 \\ 900 & 1 & 30 \\ 2200 & 4 & 2 \end{bmatrix}, \quad y = \begin{bmatrix} 240 \\ 350 \\ 150 \\ 450 \end{bmatrix}$

Each row of $X$ is a feature vector $x_i \in \mathbb{R}^d$ where $d = 3$ (three features).

Train, validation, and test splits

You never evaluate a model on the data it trained on. That’s like grading a student using the exact questions they practiced. You need held-out data to measure how well the model generalizes, this is where split comes into picture. You split the data into training set, testing set and validation set, each having a certain percentage of the data.

The standard split:

Training set (60-80% of the data): the model learns from this.
Validation set (10-20% of the data): you use this to tune hyperparameters and choose between models.
Test set (10-20% data): you touch this once, at the very end, to report final performance.

graph LR
  D[Full Dataset] --> Train["Training Set (70%)"]
  D --> Val["Validation Set (15%)"]
  D --> Test["Test Set (15%)"]
  Train --> Model[Train Model]
  Val --> Tune[Tune Hyperparams]
  Test --> Final[Final Evaluation]

For small datasets, use k-fold cross-validation instead of a fixed validation set i.e., Split the training data into $k$ equal parts, train on $k-1$ parts, validate on the remaining one, and rotate. This gives you $k$ validation scores whose average is more reliable.

Common train/validation/test split ratios

graph TD
  A["Full Dataset: 100%"] --> B["Training: 70%"]
  A --> C["Validation: 15%"]
  A --> D["Test: 15%"]
  B --> E["Model learns patterns"]
  C --> F["Tune hyperparameters"]
  D --> G["Final one-time evaluation"]

from sklearn.model_selection import train_test_split

# First split: separate test set
X_temp, X_test, y_temp, y_test = train_test_split(X, y, test_size=0.15, random_state=42)

# Second split: train and validation
X_train, X_val, y_train, y_val = train_test_split(X_temp, y_temp, test_size=0.176, random_state=42)
# 0.176 of 85% ≈ 15% of total

Data leakage

Data leakage is when information from outside the training set sneaks into your model during training. It makes your model look great on paper but fail in production.

Common causes:

Scaling before splitting. If you compute the mean and standard deviation on the entire dataset (including test data) and then split, your training features contain information about the test set.
Using future data. If you’re predicting stock prices, and your features include tomorrow’s trading volume, that’s leakage.
Target leakage. A feature that’s derived from the target or is a proxy for it. Example: a “has_been_treated” column when predicting “needs_treatment.”

The rule is simple: anything you compute from data must come from the training set only. Means, standard deviations, min/max values, vocabulary for text, all of it.

Feature scaling

Different features live on different scales e.g., house size might range from 500 to 5000 (sq feet), while number of bedrooms ranges from 1 to 6. If assigned incorrectly, this mismatch causes problems.

Why does scale matter? Algorithms that use distances or gradients are sensitive to feature magnitude. Gradient descent will take huge steps in the direction of large-magnitude features and tiny steps for small ones, making convergence slow and erratic.

The two main approaches for scaling are normalization and standardization.

Normalization (min-max scaling)

Rescales each feature to a number between the range $[0, 1]$ :

$x_{\text{norm}} = \frac{x - x_{\min}}{x_{\max} - x_{\min}}$

It helps when you know the bounds of your data and the distribution isn’t strongly skewed.

Standardization (z-score scaling)

Standardization transforms your features so they share the same scale, centering them at a mean ( $\mu$ ) of 0 with a standard deviation ( $\sigma$ ) of 1. This process ensures that features with large numerical ranges don’t dominate your model’s learning process.

$x_{\text{std}} = \frac{x - \mu}{\sigma}$

where $\mu$ is the mean and $\sigma$ is the standard deviation. This is usually the default choice, it handles outliers better than normalization because it doesn’t squash everything into a fixed range.

Example 1: standardizing a small dataset by hand

Let’s standardize the Size feature from our housing data: $[1200, 1800, 900, 2200]$ .

Step 1: compute the mean.

$\mu = \frac{1200 + 1800 + 900 + 2200}{4} = \frac{6100}{4} = 1525$

Step 2: compute the standard deviation.

First, the variance:

$\sigma^2 = \frac{(1200 - 1525)^2 + (1800 - 1525)^2 + (900 - 1525)^2 + (2200 - 1525)^2}{4}$

$= \frac{(-325)^2 + (275)^2 + (-625)^2 + (675)^2}{4}$

$= \frac{105625 + 75625 + 390625 + 455625}{4}$

$= \frac{1027500}{4} = 256875$

$\sigma = \sqrt{256875} \approx 506.8$

Step 3: standardize each value.

$x_1 = \frac{1200 - 1525}{506.8} = \frac{-325}{506.8} \approx -0.641$

$x_2 = \frac{1800 - 1525}{506.8} = \frac{275}{506.8} \approx 0.543$

$x_3 = \frac{900 - 1525}{506.8} = \frac{-625}{506.8} \approx -1.233$

$x_4 = \frac{2200 - 1525}{506.8} = \frac{675}{506.8} \approx 1.332$

The standardized values are approximately $[-0.641, 0.543, -1.233, 1.332]$ . Notice they center around 0 with roughly unit spread.

The critical point: when you get new data at test time, you use the training set’s $\mu = 1525$ and $\sigma = 506.8$ . You do NOT recompute these from the test set.

import numpy as np

# Training data
X_train_size = np.array([1200, 1800, 900, 2200])

# Fit on training data only
mu = X_train_size.mean()       # 1525.0
sigma = X_train_size.std()     # 506.8

# Transform training data
X_train_scaled = (X_train_size - mu) / sigma

# Transform test data using TRAINING stats
X_test_size = np.array([1500, 2000])
X_test_scaled = (X_test_size - mu) / sigma

Example 2: what happens without scaling

Consider a simple 2-feature problem where you’re running gradient descent to minimize:

$L(w_1, w_2) = (w_1 \cdot x_1 + w_2 \cdot x_2 - y)^2$

Suppose feature 1 (income) has values around $50{,}000$ and feature 2 (age) has values around $30$ .

The gradient with respect to $w_1$ scales with $x_1$ , and the gradient with respect to $w_2$ scales with $x_2$ . So:

$\frac{\partial L}{\partial w_1} \propto x_1 \approx 50{,}000$

$\frac{\partial L}{\partial w_2} \propto x_2 \approx 30$

The gradient in the $w_1$ direction is roughly 1,667 times larger than in the $w_2$ direction. If you pick a learning rate $\alpha$ that works for $w_1$ , it will be too small for $w_2$ . If you pick one that works for $w_2$ , it will overshoot $w_1$ .

The loss surface looks like a long, narrow valley:

graph TD
  A["Unscaled: elongated ellipse"] --> B["Gradient descent zigzags"]
  C["Scaled: circular contours"] --> D["Gradient descent goes straight to minimum"]

After standardizing both features to mean 0 and standard deviation 1, the gradients are comparable in magnitude. The loss surface becomes more circular, and gradient descent converges much faster.

Let’s see it concretely. With $x_1 = 50{,}000$ and $x_2 = 30$ , true weights $w_1^* = 0.002$ and $w_2^* = 5$ , and target $y = 250$ :

Without scaling, learning rate $\alpha = 0.0000000001$ (very tiny to avoid divergence):

Step 0: $w_1 = 0, w_2 = 0$ , prediction $= 0$ , error $= -250$

$\frac{\partial L}{\partial w_1} = 2 \cdot (-250) \cdot 50000 = -25{,}000{,}000$

$\frac{\partial L}{\partial w_2} = 2 \cdot (-250) \cdot 30 = -15{,}000$

$w_1 \leftarrow 0 - 0.0000000001 \cdot (-25{,}000{,}000) = 0.0025$

$w_2 \leftarrow 0 - 0.0000000001 \cdot (-15{,}000) = 0.0000015$

After one step, $w_1$ has moved significantly toward 0.002, but $w_2$ has barely budged from 0 toward its target of 5. It will take millions of steps for $w_2$ to catch up.

With scaling (both features standardized to similar ranges), a single learning rate works well for both weights simultaneously. That’s why you always scale your features before running gradient-based optimization.

Feature engineering

Raw data rarely goes directly into a model. Before training, you transform, combine, and create features to help the model learn.

Feature types and how to handle them

graph TD
  A["Feature Types"] --> B["Numerical"]
  A --> C["Categorical"]
  B --> D["Continuous: Size, Age"]
  B --> E["Discrete: Bedrooms, Floors"]
  C --> F["Nominal: Color, City"]
  C --> G["Ordinal: Low/Med/High"]
  D --> H["Scale directly"]
  E --> H
  F --> I["One-hot encode"]
  G --> J["Ordinal encode or one-hot"]

Common feature engineering techniques:

One-hot encoding for categorical features

If a feature is “color” with values {red, green, blue}, you create three binary columns:

color	is_red	is_green	is_blue
red	1	0	0
green	0	1	0
blue	0	0	1

You can’t feed “red” as a string into a math equation, numbers are required.

Polynomial features

If you suspect a nonlinear relationship, add powers and interactions:

$x = [x_1, x_2] \rightarrow [x_1, x_2, x_1^2, x_2^2, x_1 x_2]$

This lets a linear model fit curves. A model using these features computes:

$\hat{y} = w_1 x_1 + w_2 x_2 + w_3 x_1^2 + w_4 x_2^2 + w_5 x_1 x_2 + b$

which is a second-degree polynomial surface, even though the model itself is linear in its parameters.

Log transforms

When a feature spans several orders of magnitude (population of cities, income), taking the log compresses the range and often makes the relationship more linear:

$x' = \log(x + 1)$

The $+1$ avoids $\log(0)$ .

Handling missing data

Options, roughly ordered from simplest to most sophisticated:

Drop rows with missing values (fine if few rows are affected).
Impute with the mean or median of the training set.
Add an indicator column: 1 if the value was missing, 0 otherwise. Then impute. This lets the model learn that “missingness” itself might be informative.

Putting it all together: the ML pipeline

A pipeline chains preprocessing and modeling steps into a single object. This guarantees that the same transformations applied during training are applied during prediction.

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('model', LinearRegression())
])

# Fit on training data: scaler learns mean/std, model learns weights
pipeline.fit(X_train, y_train)

# Predict on test data: scaler uses TRAINING mean/std, then model predicts
predictions = pipeline.predict(X_test)

This prevents leakage automatically. The scaler fits only on training data, even though the predict call processes test data through the same transformations.

flowchart LR
  Raw[Raw Data] --> Split[Train/Test Split]
  Split --> Scale["Scale (fit on train)"]
  Scale --> Eng[Feature Engineering]
  Eng --> Train[Train Model]
  Train --> Eval["Evaluate (on test)"]

Summary

Concept	What it does	Watch out for
Train/val/test split	Separates learning data from evaluation data	Don’t peek at the test set
Standardization	Centers features to mean 0, std 1	Fit on train, transform both
Normalization	Scales features to [0, 1]	Sensitive to outliers
One-hot encoding	Converts categories to numbers	Can create many columns
Pipelines	Chains preprocessing + model	Always use them to prevent leakage

What comes next

With clean, properly scaled data, you’re ready to build your first model. The next article, Linear regression, covers fitting a line to data using both the normal equations and gradient descent, with full derivations and code.

← Back to all series