Search…

But what is Machine Learning?

Updated

In this series (18 parts)
  1. But what is Machine Learning?
  2. Data, features, and the ML pipeline
  3. Linear regression
  4. Bias, variance, and the tradeoff
  5. Regularization: Ridge, Lasso, and ElasticNet
  6. Logistic regression and classification
  7. Evaluation metrics for classification
  8. Naive Bayes classifier
  9. K-Nearest Neighbors
  10. Decision trees
  11. Ensemble methods: Bagging and Random Forests
  12. Boosting: AdaBoost and Gradient Boosting
  13. Support Vector Machines
  14. K-Means clustering
  15. Dimensionality Reduction: PCA
  16. Gaussian mixture models and EM algorithm
  17. Model selection and cross-validation
  18. Feature engineering and selection

Think of Machine learning as a way to teach computers, without explicitly writing instructions/ programs. You give the computer data, tell it what the right answers look like, and it figures out the rules on its own. That single idea powers everything from spam filters to self-driving cars.

Supervised Learning
Unsupervised Learning
Reinforcement Learning

Machine learning is a subset of artificial intelligence (AI), focused on algorithms that learn from data. It’s the most popular and successful approach to AI today, but it’s just one piece of the puzzle. To understand where ML fits, we need to look at the broader AI landscape.

Artificial Intelligence Systems that mimic human intelligence Machine Learning Learn from data Deep Learning Neural networks Generative AI Creates new content (GPT, DALL·E) NLP Language understanding ChatGPT LLMs

Traditional programming vs machine learning

Take an example of classifying emails as spam or not spam. In traditional programming, you write explicit rules:

def classify_email(email):
    if "lottery" in email.lower():
        return "spam"
    if "winner" in email.lower() and "click here" in email.lower():
        return "spam"
    return "not spam"

You sit down, think about patterns, and write them by hand. This works fine when you have a small, well-understood set of rules. But what happens when spammers change their wording? You update the rules again and again to catch new tricks. It’s a never-ending cat and mouse game.

Machine learning makes the detection simpler, instead of writing rules, you provide examples:

Email text Label target
Buy cheap medicine now spam
Meeting at 3pm tomorrow not spam
You won a free iPhone spam
Quarterly report attached not spam

Spam detection training examples

The ML algorithm will analyze these examples and learn patterns that distinguish spam from not spam. It might learn that certain words (“cheap,” “free,” “winner”) are strong indicators of spam, while others (“meeting,” “report”) suggest not spam. This eliminates the need for you to write explicit if else statements. The model generalizes from the examples to make predictions on new, unseen emails.

Example 1: learning a simple rule from data

Let’s take a simpler example: predicting house prices based on size. You have data like this:

Size (sq ft) Price ($k) target
800 160
1000 200
1200 240
1500 300

House prices by size

A traditional programmer might just write price = 0.2 * size. Although this is a simple linear relationship where the price is directly proportional to the size. What happens if the relationship is more complex, say price = 0.1 * size + 50? Or if there are multiple inputs like size, number of bedrooms, location? Writing rules by hand becomes impractical here.

An ML algorithm does the same thing as above, but automatically, by learning patterns from the data. It starts with a guess, say f(x)=wxf(x) = w \cdot x, and adjusts ww until predictions match the data.

Let’s check what happens to the predictions if w=0.2w = 0.2:

f(x)=f(800)=wx=0.2×800=160  f(x) = f(800) = w \cdot x = 0.2 \times 800 = 160 \; ✓

f(x)=f(1000)=wx=0.2×1000=200  f(x) = f(1000) = w \cdot x = 0.2 \times 1000 = 200 \; ✓

f(x)=f(1200)=wx=0.2×1200=240  f(x) = f(1200) = w \cdot x = 0.2 \times 1200 = 240 \; ✓

f(x)=f(1500)=wx=0.2×1500=300  f(x) = f(1500) = w \cdot x = 0.2 \times 1500 = 300 \; ✓

The parameter w=0.2w = 0.2, also known as the weight, is found by the ML algorithm to minimize the error between predictions and actual prices. The algorithm iteratively adjusts ww until it finds the best fit for the data. We will learn more on how this works in the next section.

Example 2: number recognition

Now imagine classifying handwritten digits (0 through 9). Think of writing if statements for that. What makes a “7” a “7”? A horizontal stroke at the top and a diagonal stroke going down-left? Some people cross their sevens, some write them with serifs, everyone have different handwritings, the number of rules is infintely many here.

ML handles this naturally. You show the algorithm 60,000 labeled images of digits. Each image is a grid of pixel values, and the label is the digit it represents. The algorithm learns to map pixel patterns to digit classes without you having to write any rules.

Where ML sits in the bigger picture

People use “AI,” “machine learning,” and “deep learning” interchangeably, but they are nested concepts:

  • Artificial Intelligence is the broadest term. AI is any system that performs tasks normally requiring human intelligence. This includes rule-based expert systems, search algorithms, and ML.
  • Machine Learning is a subset of AI. These are systems that learn from data instead of being explicitly programmed.
  • Deep Learning is a subset of ML. It uses neural networks with many layers to learn complex patterns.
graph TD
  AI["Artificial Intelligence"] --> ML["Machine Learning"]
  ML --> DL["Deep Learning"]
  AI --> RuleBased["Rule-Based Systems"]
  AI --> Search["Search & Planning"]
  ML --> Classical["Classical ML (trees, SVMs, etc.)"]

ML also overlaps heavily with statistics. Statisticians tend to focus on inference (understanding relationships in data), while ML practitioners focus on prediction (making accurate guesses on new data). The math is often identical in both fields, but the final goal is different.

The three branches of machine learning

Supervised learning

It is a machine learning technique where you provide the algorithm with labeled examples. The model learns to map inputs (features) to outputs (labels) based on this training data. The goal is to make accurate predictions on new or unseen data.

Supervised learning has two main types:

  • Regression: The predicted output is a continuous number in this case e.g., predicting house prices, stock returns, temperature, etc.
  • Classification: The predicted output is a category e.g., spam or not spam, cat or dog, digit 0 through 9.

Below is an interactive example of a regression problem — predicting house prices from size. Hit play to watch the model learn from labeled data:

Labeled data loaded
Each dot is a house with a known size and price — these are our labeled examples. The goal: learn a line that predicts price from size.

You can think of training data for supervised learning as tabular Refer to the house prediction example above. ithi^{th} house size can be represented as xix_i and yiy_i as the ithi^{th} predicted price. and it can also be represented like {(x1,y1),(x2,y2),,(xn,yn)}\{(x_1, y_1), (x_2, y_2), \ldots, (x_n, y_n)\} where xix_i is the input (features) and yiy_i is the label (target).

The algorithm finds a function f^\hat{f} that minimizes some measure of error between predictions f^(xi)\hat{f}(x_i) and true labels The actual correct answer. For the house example, the actual price of the house. yiy_i.

Common supervised algorithms include linear regression (for continuous targets), logistic regression (for categories), decision trees, support vector machines, and neural networks. We’ll cover each of these in depth throughout this series.

Below is a simple example of a classification problem, spam vs not-spam.

  • Each email is represented by two features: the number of suspicious words and the sender’s reputation score.
  • The labels/ targets are “spam” or “not spam”. Points above the dashed line (high reputation, few suspicious words) are “not spam.”
  • Points below (low reputation, many suspicious words) are “spam.” The classifier learns this boundary from labeled examples.

Unsupervised learning

It is a machine learning technique where you provide the algorithm with unlabeled data e.g., customer purchase history, website click data, or sensor readings. The model tries to find patterns, groupings, or structure in the data without any explicit guidance e.g., “these are high value customers” or “these are similar documents.” The goal is to discover hidden insights or organize data in a meaningful way.

Interact with the below visualization to see how an unsupervised learning algorithm discovers clusters in data without any labels.

Raw customer data — no labels
Each dot is a customer. We only know how often they visit and how much they spend — no labels telling us who's who.

Common tasks:

  • Clustering: Involves grouping similar data points together e.g., customer classification, document grouping.
  • Dimensionality reduction: Compresses high-dimensional data into fewer dimensions while preserving structure e.g., removing unwanted columns/ features from data.
  • Density estimation: Learns the underlying probability distribution of the data.

The key idea here is: no one tells the algorithm “these customers belong together.” It discovers groupings based on patterns in the features.

Reinforcement learning

Reinforcement learning (RL) is a machine learning method where an agent learns to make decisions by interacting with an environment. The agent takes actions, receives feedback in the form of rewards or penalties, and learns a strategy (policy) to maximize cumulative reward over time.

Think of it like training a dog. The dog (agent) tries actions (sit, roll over). You give treats (positive reward) or say “no” (negative reward). Over time the dog learns which actions lead to treats.

Formally, at each time step tt:

  1. The agent observes state sts_t
  2. Takes action ata_t
  3. Receives reward rtr_t
  4. Transitions to new state st+1s_{t+1}

The goal is to learn a policy π(s)a\pi(s) \rightarrow a that maximizes the expected cumulative reward:

maxπ  E[t=0Tγtrt]\max_{\pi} \; \mathbb{E}\left[\sum_{t=0}^{T} \gamma^t r_t\right]

where γ[0,1]\gamma \in [0, 1] is a discount factor that makes the agent value immediate rewards more than distant ones.

Do not worry if the above formula looks intimidating. We’ll break it down in future articles. For now, just remember: the agent learns by trial and error, guided by rewards.

Reinforcement learning powers game-playing agents (AlphaGo, Atari), robotics, and recommendation systems.

Click “Start Episode” below to see a simple RL agent learn to navigate a grid. The agent starts with no knowledge and learns to reach the goal while avoiding walls, guided only by rewards.

Episode 0 | Untrained agent
A delivery robot must learn the best route from Start to the Package 📦. It has no map — it learns by trial and error, receiving rewards (+10 for delivery, −1 per step, −5 for walls).

Key terminology

Before going further, let’s nail down terms you’ll see everywhere:

  • Features (xx): the input variables e.g., house size, number of bedrooms, zip code.
  • Labels/targets (yy): the output you’re predicting e.g., house price, spam/not-spam.
  • Training set: the dataset used to train the model.
  • Test set: data held out to evaluate how well your model generalizes.
  • Parameters: the variables that the algorithm adjusts during training (weights ww, bias bb).
  • Hyperparameters: settings you choose before training (learning rate, number of layers, regularization strength).
  • Model: A model is nothing but a function ff with learned parameters.
  • Loss function: measures how bad your predictions are. Lower is better. You’ll see MSE for regression and cross-entropy for classification.

How a model actually learns

Consider a linear model y^=wx\hat{y} = wx for a single feature. The model has one parameter, ww. Training means finding the ww that minimizes the loss.

Start with a random ww, compute the loss, then ask: “If I increase ww slightly, does the loss go up or down?” This is also known as the gradient, if the gradient is positive, decreasing ww reduces the loss, if negative, increasing ww helps.

Adjust the learning rate in the below visualization to see how the model converges to the optimal ww that minimizes the loss. To learn more about how this works, check out the article on gradient descent.

Random starting point
The curve shows Loss vs weight (w). The ball starts at a random position. Each step, we compute the gradient (slope) and move downhill toward the minimum.

The update rule for gradient descent is:

wwαLww \leftarrow w - \alpha \cdot \frac{\partial L}{\partial w}

where α\alpha is the learning rate, a small positive number that controls step size how much the parameters are adjusted in each update . Too small value means the model learns slowly, too large means it overshoots the minimum and diverges.

This loop repeats: predict, compute loss, compute gradient, update. Each pass through the full training set is called an epoch. Most models need hundreds or thousands of epochs to converge.

The ML workflow at a high level

Every ML project follows roughly the same steps:

  1. Define the problem. What are you predicting? What data do you have?
  2. Collect and prepare data. Clean it, handle missing values, split into train/test.
  3. Choose a model. Start simple (linear regression, logistic regression) and increase complexity as needed.
  4. Train. Feed data to the algorithm, let it adjust parameters.
  5. Evaluate. Check performance on held-out test data.
  6. Iterate. Try different features, models, hyperparameters.
flowchart LR
  A[Define Problem] --> B[Prepare Data]
  B --> C[Choose Model]
  C --> D[Train]
  D --> E[Evaluate]
  E -->|Not good enough| B
  E -->|Good enough| F[Deploy]

What this series covers

This series builds ML from the ground up, each article includes the math, worked examples, and code. Here’s the roadmap:

  1. What is ML (you are here)
  2. Data, features, and the ML pipeline - how to prepare data properly
  3. Linear regression - your first ML algorithm, solved two ways
  4. Bias and variance - why models fail and how to diagnose it
  5. Regularization - preventing overfitting with Ridge, Lasso, and ElasticNet
  6. Logistic regression - moving from regression to classification

Later articles will cover decision trees, SVMs, neural networks, and more. The math prerequisites are covered in companion series on calculus, linear algebra, and optimization.

What comes next

The next article, Data, features, and the ML pipeline, covers how to prepare your data for ML: splitting datasets, scaling features, and avoiding data leakage. Getting data right is half the battle.

Start typing to search across all content
navigate Enter open Esc close