But what is Machine Learning?
Updated
In this series (18 parts)
- But what is Machine Learning?
- Data, features, and the ML pipeline
- Linear regression
- Bias, variance, and the tradeoff
- Regularization: Ridge, Lasso, and ElasticNet
- Logistic regression and classification
- Evaluation metrics for classification
- Naive Bayes classifier
- K-Nearest Neighbors
- Decision trees
- Ensemble methods: Bagging and Random Forests
- Boosting: AdaBoost and Gradient Boosting
- Support Vector Machines
- K-Means clustering
- Dimensionality Reduction: PCA
- Gaussian mixture models and EM algorithm
- Model selection and cross-validation
- Feature engineering and selection
Think of Machine learning as a way to teach computers, without explicitly writing instructions/ programs. You give the computer data, tell it what the right answers look like, and it figures out the rules on its own. That single idea powers everything from spam filters to self-driving cars.
Machine learning is a subset of artificial intelligence (AI), focused on algorithms that learn from data. It’s the most popular and successful approach to AI today, but it’s just one piece of the puzzle. To understand where ML fits, we need to look at the broader AI landscape.
Traditional programming vs machine learning
Take an example of classifying emails as spam or not spam. In traditional programming, you write explicit rules:
def classify_email(email):
if "lottery" in email.lower():
return "spam"
if "winner" in email.lower() and "click here" in email.lower():
return "spam"
return "not spam"
You sit down, think about patterns, and write them by hand. This works fine when you have a small, well-understood set of rules. But what happens when spammers change their wording? You update the rules again and again to catch new tricks. It’s a never-ending cat and mouse game.
Machine learning makes the detection simpler, instead of writing rules, you provide examples:
| Email text | Label target |
|---|---|
| Buy cheap medicine now | spam |
| Meeting at 3pm tomorrow | not spam |
| You won a free iPhone | spam |
| Quarterly report attached | not spam |
Spam detection training examples
The ML algorithm will analyze these examples and learn patterns that distinguish spam from not spam. It might learn that certain words (“cheap,” “free,” “winner”) are strong indicators of spam, while others (“meeting,” “report”) suggest not spam. This eliminates the need for you to write explicit if else statements. The model generalizes from the examples to make predictions on new, unseen emails.
Example 1: learning a simple rule from data
Let’s take a simpler example: predicting house prices based on size. You have data like this:
| Size (sq ft) | Price ($k) target |
|---|---|
| 800 | 160 |
| 1000 | 200 |
| 1200 | 240 |
| 1500 | 300 |
House prices by size
A traditional programmer might just write price = 0.2 * size. Although this is a simple linear relationship where the price is directly proportional to the size. What happens if the relationship is more complex, say price = 0.1 * size + 50? Or if there are multiple inputs like size, number of bedrooms, location? Writing rules by hand becomes impractical here.
An ML algorithm does the same thing as above, but automatically, by learning patterns from the data. It starts with a guess, say , and adjusts until predictions match the data.
Let’s check what happens to the predictions if :
The parameter , also known as the weight, is found by the ML algorithm to minimize the error between predictions and actual prices. The algorithm iteratively adjusts until it finds the best fit for the data. We will learn more on how this works in the next section.
Example 2: number recognition
Now imagine classifying handwritten digits (0 through 9). Think of writing if statements for that. What makes a “7” a “7”? A horizontal stroke at the top and a diagonal stroke going down-left? Some people cross their sevens, some write them with serifs, everyone have different handwritings, the number of rules is infintely many here.
ML handles this naturally. You show the algorithm 60,000 labeled images of digits. Each image is a grid of pixel values, and the label is the digit it represents. The algorithm learns to map pixel patterns to digit classes without you having to write any rules.
Where ML sits in the bigger picture
People use “AI,” “machine learning,” and “deep learning” interchangeably, but they are nested concepts:
- Artificial Intelligence is the broadest term. AI is any system that performs tasks normally requiring human intelligence. This includes rule-based expert systems, search algorithms, and ML.
- Machine Learning is a subset of AI. These are systems that learn from data instead of being explicitly programmed.
- Deep Learning is a subset of ML. It uses neural networks with many layers to learn complex patterns.
graph TD AI["Artificial Intelligence"] --> ML["Machine Learning"] ML --> DL["Deep Learning"] AI --> RuleBased["Rule-Based Systems"] AI --> Search["Search & Planning"] ML --> Classical["Classical ML (trees, SVMs, etc.)"]
ML also overlaps heavily with statistics. Statisticians tend to focus on inference (understanding relationships in data), while ML practitioners focus on prediction (making accurate guesses on new data). The math is often identical in both fields, but the final goal is different.
The three branches of machine learning
Supervised learning
It is a machine learning technique where you provide the algorithm with labeled examples. The model learns to map inputs (features) to outputs (labels) based on this training data. The goal is to make accurate predictions on new or unseen data.
Supervised learning has two main types:
- Regression: The predicted output is a continuous number in this case e.g., predicting house prices, stock returns, temperature, etc.
- Classification: The predicted output is a category e.g., spam or not spam, cat or dog, digit 0 through 9.
Below is an interactive example of a regression problem — predicting house prices from size. Hit play to watch the model learn from labeled data:
You can think of training data for supervised learning as tabular Refer to the house prediction example above. house size can be represented as and as the predicted price. and it can also be represented like where is the input (features) and is the label (target).
The algorithm finds a function that minimizes some measure of error between predictions and true labels The actual correct answer. For the house example, the actual price of the house. .
Common supervised algorithms include linear regression (for continuous targets), logistic regression (for categories), decision trees, support vector machines, and neural networks. We’ll cover each of these in depth throughout this series.
Below is a simple example of a classification problem, spam vs not-spam.
- Each email is represented by two features: the number of suspicious words and the sender’s reputation score.
- The labels/ targets are “spam” or “not spam”. Points above the dashed line (high reputation, few suspicious words) are “not spam.”
- Points below (low reputation, many suspicious words) are “spam.” The classifier learns this boundary from labeled examples.
Unsupervised learning
It is a machine learning technique where you provide the algorithm with unlabeled data e.g., customer purchase history, website click data, or sensor readings. The model tries to find patterns, groupings, or structure in the data without any explicit guidance e.g., “these are high value customers” or “these are similar documents.” The goal is to discover hidden insights or organize data in a meaningful way.
Interact with the below visualization to see how an unsupervised learning algorithm discovers clusters in data without any labels.
Common tasks:
- Clustering: Involves grouping similar data points together e.g., customer classification, document grouping.
- Dimensionality reduction: Compresses high-dimensional data into fewer dimensions while preserving structure e.g., removing unwanted columns/ features from data.
- Density estimation: Learns the underlying probability distribution of the data.
The key idea here is: no one tells the algorithm “these customers belong together.” It discovers groupings based on patterns in the features.
Reinforcement learning
Reinforcement learning (RL) is a machine learning method where an agent learns to make decisions by interacting with an environment. The agent takes actions, receives feedback in the form of rewards or penalties, and learns a strategy (policy) to maximize cumulative reward over time.
Think of it like training a dog. The dog (agent) tries actions (sit, roll over). You give treats (positive reward) or say “no” (negative reward). Over time the dog learns which actions lead to treats.
Formally, at each time step :
- The agent observes state
- Takes action
- Receives reward
- Transitions to new state
The goal is to learn a policy that maximizes the expected cumulative reward:
where is a discount factor that makes the agent value immediate rewards more than distant ones.
Do not worry if the above formula looks intimidating. We’ll break it down in future articles. For now, just remember: the agent learns by trial and error, guided by rewards.
Reinforcement learning powers game-playing agents (AlphaGo, Atari), robotics, and recommendation systems.
Click “Start Episode” below to see a simple RL agent learn to navigate a grid. The agent starts with no knowledge and learns to reach the goal while avoiding walls, guided only by rewards.
Key terminology
Before going further, let’s nail down terms you’ll see everywhere:
- Features (): the input variables e.g., house size, number of bedrooms, zip code.
- Labels/targets (): the output you’re predicting e.g., house price, spam/not-spam.
- Training set: the dataset used to train the model.
- Test set: data held out to evaluate how well your model generalizes.
- Parameters: the variables that the algorithm adjusts during training (weights , bias ).
- Hyperparameters: settings you choose before training (learning rate, number of layers, regularization strength).
- Model: A model is nothing but a function with learned parameters.
- Loss function: measures how bad your predictions are. Lower is better. You’ll see MSE for regression and cross-entropy for classification.
How a model actually learns
Consider a linear model for a single feature. The model has one parameter, . Training means finding the that minimizes the loss.
Start with a random , compute the loss, then ask: “If I increase slightly, does the loss go up or down?” This is also known as the gradient, if the gradient is positive, decreasing reduces the loss, if negative, increasing helps.
Adjust the learning rate in the below visualization to see how the model converges to the optimal that minimizes the loss. To learn more about how this works, check out the article on gradient descent.
The update rule for gradient descent is:
where is the learning rate, a small positive number that controls step size how much the parameters are adjusted in each update . Too small value means the model learns slowly, too large means it overshoots the minimum and diverges.
This loop repeats: predict, compute loss, compute gradient, update. Each pass through the full training set is called an epoch. Most models need hundreds or thousands of epochs to converge.
The ML workflow at a high level
Every ML project follows roughly the same steps:
- Define the problem. What are you predicting? What data do you have?
- Collect and prepare data. Clean it, handle missing values, split into train/test.
- Choose a model. Start simple (linear regression, logistic regression) and increase complexity as needed.
- Train. Feed data to the algorithm, let it adjust parameters.
- Evaluate. Check performance on held-out test data.
- Iterate. Try different features, models, hyperparameters.
flowchart LR A[Define Problem] --> B[Prepare Data] B --> C[Choose Model] C --> D[Train] D --> E[Evaluate] E -->|Not good enough| B E -->|Good enough| F[Deploy]
What this series covers
This series builds ML from the ground up, each article includes the math, worked examples, and code. Here’s the roadmap:
- What is ML (you are here)
- Data, features, and the ML pipeline - how to prepare data properly
- Linear regression - your first ML algorithm, solved two ways
- Bias and variance - why models fail and how to diagnose it
- Regularization - preventing overfitting with Ridge, Lasso, and ElasticNet
- Logistic regression - moving from regression to classification
Later articles will cover decision trees, SVMs, neural networks, and more. The math prerequisites are covered in companion series on calculus, linear algebra, and optimization.
What comes next
The next article, Data, features, and the ML pipeline, covers how to prepare your data for ML: splitting datasets, scaling features, and avoiding data leakage. Getting data right is half the battle.