Data, features, and the ML pipeline
In this series (18 parts)
- What is machine learning: a map of the field
- Data, features, and the ML pipeline
- Linear regression
- Bias, variance, and the tradeoff
- Regularization: Ridge, Lasso, and ElasticNet
- Logistic regression and classification
- Evaluation metrics for classification
- Naive Bayes classifier
- K-Nearest Neighbors
- Decision trees
- Ensemble methods: Bagging and Random Forests
- Boosting: AdaBoost and Gradient Boosting
- Support Vector Machines
- K-Means clustering
- Dimensionality Reduction: PCA
- Gaussian mixture models and EM algorithm
- Model selection and cross-validation
- Feature engineering and selection
Prerequisites: This article assumes you’ve read What is machine learning.
Your model is only as good as your data. You can have the fanciest algorithm in the world, but if your data is messy, leaked, or poorly scaled, your results will be garbage. This article covers the practical foundations: splitting data, engineering features, and scaling them properly.
A house-price dataset, end to end
Before any formulas, here is a concrete scenario. You have a spreadsheet of 8 houses. The first five columns describe each house. The last column is what you want to predict.
| House | Size (sq ft) | Bedrooms | Age (years) | Garage | Neighborhood | Price ($k) |
|---|---|---|---|---|---|---|
| 1 | 1200 | 2 | 15 | No | Downtown | 240 |
| 2 | 1800 | 3 | 5 | Yes | Suburbs | 350 |
| 3 | 900 | 1 | 30 | No | Downtown | 150 |
| 4 | 2200 | 4 | 2 | Yes | Suburbs | 450 |
| 5 | 1500 | 3 | 10 | No | Suburbs | 280 |
| 6 | 1100 | 2 | 20 | No | Rural | 180 |
| 7 | 2500 | 4 | 1 | Yes | Suburbs | 520 |
| 8 | 1600 | 3 | 8 | Yes | Downtown | 310 |
Size, Bedrooms, Age, Garage, and Neighborhood are features. Price is the target. Notice that some features are numbers (Size, Age) while others are categories (Garage, Neighborhood). ML models need everything as numbers, so categorical features require conversion. More on that below.
Number of distinct values per feature. Numerical features (Size, Age) vary widely, while categorical features (Garage, Neighborhood) have only a few levels.
The full ML pipeline from raw data to deployment
graph LR A["Collect data"] --> B["Clean and preprocess"] B --> C["Split into train/test"] C --> D["Scale features"] D --> E["Train model"] E --> F["Evaluate on test set"] F --> G["Deploy"]
Features feed into the model to predict the target
graph LR F1["Size"] --> M["ML Model"] F2["Bedrooms"] --> M F3["Age"] --> M F4["Garage"] --> M F5["Neighborhood"] --> M M --> T["Price prediction"]
Now let’s look at each step in detail, starting with how to organize features and targets mathematically.
Features and targets
A dataset is a table. Each row is one example (also called a sample or observation). Each column is a feature (also called a variable or attribute). One special column is the target, the thing you want to predict.
| Size (sq ft) | Bedrooms | Age (years) | Price ($k) |
|---|---|---|---|
| 1200 | 2 | 15 | 240 |
| 1800 | 3 | 5 | 350 |
| 900 | 1 | 30 | 150 |
| 2200 | 4 | 2 | 450 |
Here, Size, Bedrooms, and Age are features. Price is the target. In math notation:
Each row of is a feature vector where (three features).
Train, validation, and test splits
You never evaluate a model on the data it trained on. That’s like grading a student using the exact questions they practiced. You need held-out data to measure how well the model generalizes.
The standard split:
- Training set (60-80%): the model learns from this.
- Validation set (10-20%): you use this to tune hyperparameters and choose between models.
- Test set (10-20%): you touch this once, at the very end, to report final performance.
graph LR D[Full Dataset] --> Train["Training Set (70%)"] D --> Val["Validation Set (15%)"] D --> Test["Test Set (15%)"] Train --> Model[Train Model] Val --> Tune[Tune Hyperparams] Test --> Final[Final Evaluation]
For small datasets, use k-fold cross-validation instead of a fixed validation set. Split the training data into equal parts, train on parts, validate on the remaining one, and rotate. This gives you validation scores whose average is more reliable.
Common train/validation/test split ratios
graph TD A["Full Dataset: 100%"] --> B["Training: 70%"] A --> C["Validation: 15%"] A --> D["Test: 15%"] B --> E["Model learns patterns"] C --> F["Tune hyperparameters"] D --> G["Final one-time evaluation"]
from sklearn.model_selection import train_test_split
# First split: separate test set
X_temp, X_test, y_temp, y_test = train_test_split(X, y, test_size=0.15, random_state=42)
# Second split: train and validation
X_train, X_val, y_train, y_val = train_test_split(X_temp, y_temp, test_size=0.176, random_state=42)
# 0.176 of 85% ≈ 15% of total
Data leakage
Data leakage is when information from outside the training set sneaks into your model during training. It makes your model look great on paper but fail in production.
Common causes:
- Scaling before splitting. If you compute the mean and standard deviation on the entire dataset (including test data) and then split, your training features contain information about the test set.
- Using future data. If you’re predicting stock prices, and your features include tomorrow’s trading volume, that’s leakage.
- Target leakage. A feature that’s derived from the target or is a proxy for it. Example: a “has_been_treated” column when predicting “needs_treatment.”
The rule is simple: anything you compute from data must come from the training set only. Means, standard deviations, min/max values, vocabulary for text, all of it.
Feature scaling
Different features live on different scales. House size might range from 500 to 5000, while number of bedrooms ranges from 1 to 6. This mismatch causes problems.
Why does scale matter? Algorithms that use distances or gradients are sensitive to feature magnitude. Gradient descent will take huge steps in the direction of large-magnitude features and tiny steps for small ones, making convergence slow and erratic.
Two main approaches: normalization and standardization.
Normalization (min-max scaling)
Rescales each feature to :
Good when you know the bounds of your data and the distribution isn’t strongly skewed.
Standardization (z-score scaling)
Centers each feature at mean 0 with standard deviation 1:
where is the mean and is the standard deviation. This is usually the default choice. It handles outliers better than normalization because it doesn’t squash everything into a fixed range.
Example 1: standardizing a small dataset by hand
Let’s standardize the Size feature from our housing data: .
Step 1: compute the mean.
Step 2: compute the standard deviation.
First, the variance:
Step 3: standardize each value.
The standardized values are approximately . Notice they center around 0 with roughly unit spread.
The critical point: when you get new data at test time, you use the training set’s and . You do NOT recompute these from the test set.
import numpy as np
# Training data
X_train_size = np.array([1200, 1800, 900, 2200])
# Fit on training data only
mu = X_train_size.mean() # 1525.0
sigma = X_train_size.std() # 506.8
# Transform training data
X_train_scaled = (X_train_size - mu) / sigma
# Transform test data using TRAINING stats
X_test_size = np.array([1500, 2000])
X_test_scaled = (X_test_size - mu) / sigma
Example 2: what happens without scaling
Consider a simple 2-feature problem where you’re running gradient descent to minimize:
Suppose feature 1 (income) has values around and feature 2 (age) has values around .
The gradient with respect to scales with , and the gradient with respect to scales with . So:
The gradient in the direction is roughly 1,667 times larger than in the direction. If you pick a learning rate that works for , it will be too small for . If you pick one that works for , it will overshoot .
The loss surface looks like a long, narrow valley:
graph TD A["Unscaled: elongated ellipse"] --> B["Gradient descent zigzags"] C["Scaled: circular contours"] --> D["Gradient descent goes straight to minimum"]
After standardizing both features to mean 0 and standard deviation 1, the gradients are comparable in magnitude. The loss surface becomes more circular, and gradient descent converges much faster.
Let’s see it concretely. With and , true weights and , and target :
Without scaling, learning rate (very tiny to avoid divergence):
Step 0: , prediction , error
After one step, has moved significantly toward 0.002, but has barely budged from 0 toward its target of 5. It will take millions of steps for to catch up.
With scaling (both features standardized to similar ranges), a single learning rate works well for both weights simultaneously. That’s why you always scale your features before running gradient-based optimization.
Feature engineering
Raw data rarely goes directly into a model. You transform, combine, and create features to help the model learn.
Feature types and how to handle them
graph TD A["Feature Types"] --> B["Numerical"] A --> C["Categorical"] B --> D["Continuous: Size, Age"] B --> E["Discrete: Bedrooms, Floors"] C --> F["Nominal: Color, City"] C --> G["Ordinal: Low/Med/High"] D --> H["Scale directly"] E --> H F --> I["One-hot encode"] G --> J["Ordinal encode or one-hot"]
Common techniques:
One-hot encoding for categorical features
If a feature is “color” with values {red, green, blue}, you create three binary columns:
| color | is_red | is_green | is_blue |
|---|---|---|---|
| red | 1 | 0 | 0 |
| green | 0 | 1 | 0 |
| blue | 0 | 0 | 1 |
You can’t feed “red” as a string into a math equation. Numbers are required.
Polynomial features
If you suspect a nonlinear relationship, add powers and interactions:
This lets a linear model fit curves. A model using these features computes:
which is a second-degree polynomial surface, even though the model itself is linear in its parameters.
Log transforms
When a feature spans several orders of magnitude (population of cities, income), taking the log compresses the range and often makes the relationship more linear:
The avoids .
Handling missing data
Options, roughly ordered from simplest to most sophisticated:
- Drop rows with missing values (fine if few rows are affected).
- Impute with the mean or median of the training set.
- Add an indicator column: 1 if the value was missing, 0 otherwise. Then impute. This lets the model learn that “missingness” itself might be informative.
Putting it all together: the ML pipeline
A pipeline chains preprocessing and modeling steps into a single object. This guarantees that the same transformations applied during training are applied during prediction.
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
pipeline = Pipeline([
('scaler', StandardScaler()),
('model', LinearRegression())
])
# Fit on training data: scaler learns mean/std, model learns weights
pipeline.fit(X_train, y_train)
# Predict on test data: scaler uses TRAINING mean/std, then model predicts
predictions = pipeline.predict(X_test)
This prevents leakage automatically. The scaler fits only on training data, even though the predict call processes test data through the same transformations.
flowchart LR Raw[Raw Data] --> Split[Train/Test Split] Split --> Scale["Scale (fit on train)"] Scale --> Eng[Feature Engineering] Eng --> Train[Train Model] Train --> Eval["Evaluate (on test)"]
Summary
| Concept | What it does | Watch out for |
|---|---|---|
| Train/val/test split | Separates learning data from evaluation data | Don’t peek at the test set |
| Standardization | Centers features to mean 0, std 1 | Fit on train, transform both |
| Normalization | Scales features to [0, 1] | Sensitive to outliers |
| One-hot encoding | Converts categories to numbers | Can create many columns |
| Pipelines | Chains preprocessing + model | Always use them to prevent leakage |
What comes next
With clean, properly scaled data, you’re ready to build your first model. The next article, Linear regression, covers fitting a line to data using both the normal equations and gradient descent, with full derivations and code.