Mar 23, 2026 · 12 min read · Machine Learning

What is machine learning: a map of the field

In this series (18 parts)

Machine learning is programming where you don’t write the rules. You give the computer data, tell it what the right answers look like, and it figures out the rules on its own. That single idea powers everything from spam filters to self-driving cars.

Traditional programming vs machine learning

In traditional programming, you write explicit rules:

def classify_email(email):
    if "viagra" in email.lower():
        return "spam"
    if "winner" in email.lower() and "click here" in email.lower():
        return "spam"
    return "not spam"

You sit down, think about patterns, and encode them by hand. This works fine when you have a small, well-understood set of rules. But what happens when spammers change their wording? You update rules forever.

Machine learning flips this around. Instead of writing rules, you provide examples:

Email text	Label
”Buy cheap medicine now”	spam
”Meeting at 3pm tomorrow”	not spam
”You won a free iPhone”	spam
”Quarterly report attached”	not spam

The ML algorithm looks at these examples and learns a function $f$ that maps email text to labels. When a new email arrives, $f$ produces a prediction without you writing a single if statement.

Example 1: learning a simple rule from data

Suppose you have four houses with their sizes and prices:

Size (sq ft)	Price ($k)
800	160
1000	200
1200	240
1500	300

A traditional programmer might eyeball this and write price = 0.2 * size. An ML algorithm does the same thing, but automatically. It starts with a guess, say $f(x) = w \cdot x$ , and adjusts $w$ until predictions match the data.

Let’s check: if $w = 0.2$ :

$f(800) = 0.2 \times 800 = 160 \; ✓$ $f(1000) = 0.2 \times 1000 = 200 \; ✓$ $f(1200) = 0.2 \times 1200 = 240 \; ✓$ $f(1500) = 0.2 \times 1500 = 300 \; ✓$

The algorithm found $w = 0.2$ by minimizing prediction errors across all examples. You never told it the rule. It discovered the rule from data.

Example 2: why rules break down

Now imagine classifying handwritten digits (0 through 9). Try writing if statements for that. What makes a “7” a “7”? A horizontal stroke at the top and a diagonal stroke going down-left? Some people cross their sevens. Some write them with serifs. The number of rules explodes.

ML handles this naturally. You show the algorithm 60,000 labeled images of digits. Each image is a grid of pixel values, say $28 \times 28 = 784$ numbers. The algorithm learns a function:

$f: \mathbb{R}^{784} \rightarrow \{0, 1, 2, \ldots, 9\}$

This function maps 784 pixel intensities to one of 10 digit classes. No hand-written rules needed.

Where ML sits in the bigger picture

People use “AI,” “machine learning,” and “deep learning” interchangeably, but they are nested concepts:

Artificial Intelligence is the broadest term. Any system that performs tasks normally requiring human intelligence. This includes rule-based expert systems, search algorithms, and ML.
Machine Learning is a subset of AI. Systems that learn from data instead of being explicitly programmed.
Deep Learning is a subset of ML. It uses neural networks with many layers to learn complex patterns.

graph TD
  AI["Artificial Intelligence"] --> ML["Machine Learning"]
  ML --> DL["Deep Learning"]
  AI --> RuleBased["Rule-Based Systems"]
  AI --> Search["Search & Planning"]
  ML --> Classical["Classical ML (trees, SVMs, etc.)"]

ML also overlaps heavily with statistics. The difference is mostly cultural. Statisticians tend to focus on inference (understanding relationships in data), while ML practitioners focus on prediction (making accurate guesses on new data). The math is often identical.

The three branches of machine learning

Supervised learning

You give the algorithm input-output pairs. It learns the mapping. This is the most common type of ML in practice.

Two main flavors:

Regression: output is a continuous number. Predicting house prices, stock returns, temperature.
Classification: output is a category. Spam or not spam, cat or dog, digit 0 through 9.

The training data looks like $\{(x_1, y_1), (x_2, y_2), \ldots, (x_n, y_n)\}$ where $x_i$ is the input (features) and $y_i$ is the label (target).

The algorithm finds a function $\hat{f}$ that minimizes some measure of error between predictions $\hat{f}(x_i)$ and true labels $y_i$ .

Common supervised algorithms include linear regression (for continuous targets), logistic regression (for categories), decision trees, support vector machines, and neural networks. We’ll cover each of these in depth throughout this series.

A simple 2D classification task. Points above the dashed line (high reputation, few suspicious words) are “not spam.” Points below (low reputation, many suspicious words) are “spam.” The classifier learns this boundary from labeled examples.

Unsupervised learning

You give the algorithm inputs only, no labels. It finds structure on its own.

Common tasks:

Clustering: group similar data points together. Customer segmentation, document grouping.
Dimensionality reduction: compress high-dimensional data into fewer dimensions while preserving structure. PCA is the classic example.
Density estimation: learn the underlying probability distribution of the data.

No one tells the algorithm “these customers belong together.” It discovers groupings based on patterns in the features.

Reinforcement learning

An agent takes actions in an environment, receives rewards or penalties, and learns a strategy (policy) to maximize total reward over time.

Think of it like training a dog. The dog (agent) tries actions (sit, roll over). You give treats (positive reward) or say “no” (negative reward). Over time the dog learns which actions lead to treats.

Formally, at each time step $t$ :

The agent observes state $s_t$
Takes action $a_t$
Receives reward $r_t$
Transitions to new state $s_{t+1}$

The goal is to learn a policy $\pi(s) \rightarrow a$ that maximizes the expected cumulative reward:

$\max_{\pi} \; \mathbb{E}\left[\sum_{t=0}^{T} \gamma^t r_t\right]$

where $\gamma \in [0, 1]$ is a discount factor that makes the agent value immediate rewards more than distant ones.

Reinforcement learning powers game-playing agents (AlphaGo, Atari), robotics, and recommendation systems.

Key terminology

Before going further, let’s nail down terms you’ll see everywhere:

Features ( $x$ ): the input variables. House size, number of bedrooms, zip code.
Labels/targets ( $y$ ): the thing you’re predicting. House price, spam/not-spam.
Training set: the data you learn from.
Test set: data held out to evaluate how well your model generalizes.
Model: the function $f$ with its learned parameters.
Parameters: the numbers the algorithm adjusts during training (weights $w$ , bias $b$ ).
Hyperparameters: settings you choose before training (learning rate, number of layers, regularization strength).
Loss function: measures how bad your predictions are. Lower is better. You’ll see MSE for regression and cross-entropy for classification.

How a model actually learns

Let’s make the learning process concrete. Consider a linear model $\hat{y} = wx$ for a single feature. The model has one parameter, $w$ . Training means finding the $w$ that minimizes the loss.

Start with a random $w$ , compute the loss, then ask: “If I increase $w$ slightly, does the loss go up or down?” That’s the gradient. If the gradient is positive, decreasing $w$ reduces the loss. If negative, increasing $w$ helps.

The update rule for gradient descent is:

$w \leftarrow w - \alpha \cdot \frac{\partial L}{\partial w}$

where $\alpha$ is the learning rate, a small positive number that controls step size. Too large and you overshoot. Too small and training takes forever.

This loop repeats: predict, compute loss, compute gradient, update. Each pass through the full training set is called an epoch. Most models need hundreds or thousands of epochs to converge.

The ML workflow at a high level

Every ML project follows roughly the same steps:

Define the problem. What are you predicting? What data do you have?
Collect and prepare data. Clean it, handle missing values, split into train/test.
Choose a model. Start simple (linear regression, logistic regression) and increase complexity as needed.
Train. Feed data to the algorithm, let it adjust parameters.
Evaluate. Check performance on held-out test data.
Iterate. Try different features, models, hyperparameters.

flowchart LR
  A[Define Problem] --> B[Prepare Data]
  B --> C[Choose Model]
  C --> D[Train]
  D --> E[Evaluate]
  E -->|Not good enough| B
  E -->|Good enough| F[Deploy]

What this series covers

This series builds ML from the ground up. Each article includes the math, worked examples, and code. Here’s the roadmap:

What is ML (you are here)
Data, features, and the ML pipeline - how to prepare data properly
Linear regression - your first ML algorithm, solved two ways
Bias and variance - why models fail and how to diagnose it
Regularization - preventing overfitting with Ridge, Lasso, and ElasticNet
Logistic regression - moving from regression to classification

Later articles will cover decision trees, SVMs, neural networks, and more. The math prerequisites are covered in companion series on calculus, linear algebra, and optimization.

What comes next

The next article, Data, features, and the ML pipeline, covers how to prepare your data for ML: splitting datasets, scaling features, and avoiding data leakage. Getting data right is half the battle.

← Back to all series