Classification - Predicting Categories
In this series (5 parts)
- Introduction to Machine Learning
- Supervised Learning - Learning from Labeled Data
- Regression - Predicting Continuous Values
- Classification - Predicting Categories
- Unsupervised Learning - Finding Hidden Patterns
In the previous post, we explored regression - predicting continuous values. Now we flip to the other side of supervised learning: classification, where the model predicts a category or class.
What is Classification?
Classification answers the question: “Which one?” or “Is it A or B?”
Given input features, a classification model outputs a discrete label - a category from a predefined set.
flowchart LR A["Input Features (Email text, links, sender)"] --> B["Classification Model"] B --> C["Category (Spam or Not Spam ✅)"] style B fill:#e8f4fd,stroke:#1a5276,color:#1a5276 style C fill:#e8f8f0,stroke:#1a5c38,color:#1a5c38
Types of Classification
flowchart TD CLS["Classification"] --> BIN["Binary Classification"] CLS --> MC["Multi-Class Classification"] CLS --> ML["Multi-Label Classification"] BIN --> B1["2 classes: Yes/No, Spam/Ham"] MC --> M1["3+ classes: Cat/Dog/Bird"] ML --> ML1["Multiple labels per input: Genre tags on a movie"] style CLS fill:#f3e8fd,stroke:#5b1a7a,color:#5b1a7a style BIN fill:#e8f8f0,stroke:#1a5c38,color:#1a5c38 style MC fill:#e8f4fd,stroke:#1a5276,color:#1a5276 style ML fill:#fdf3e8,stroke:#7a4a1a,color:#7a4a1a
| Type | # of Classes | Example | Output |
|---|---|---|---|
| Binary | 2 | Email spam detection | Spam or Not Spam |
| Multi-class | 3+ | Handwritten digit recognition | 0, 1, 2, … 9 |
| Multi-label | Multiple per input | Movie genre tagging | Action, Comedy, Sci-Fi |
Real-World Classification Problems
| Problem | Input Features | Classes | Type |
|---|---|---|---|
| Email spam filter | Words, links, sender domain | Spam / Ham | Binary |
| Disease diagnosis | Symptoms, lab results, history | Healthy / Sick | Binary |
| Handwriting recognition | Pixel values (28×28 image) | 0–9 digits | Multi-class |
| Sentiment analysis | Review text | Positive / Neutral / Negative | Multi-class |
| Credit risk | Income, debt, credit score | Low / Medium / High | Multi-class |
| Object detection | Image regions | Car, Person, Tree, … | Multi-class |
| Music genre tagging | Audio features | Rock, Pop, Jazz, … | Multi-label |
Logistic Regression: The Starting Point
Despite the name, logistic regression is a classification algorithm. It predicts the probability that an input belongs to a class.
The Sigmoid Function
Logistic regression wraps a linear function in the sigmoid function to squeeze the output between 0 and 1:
Where (same linear combination as linear regression).
| z value | σ(z) | Interpretation |
|---|---|---|
| -5 | 0.007 | Very likely class 0 |
| -2 | 0.12 | Probably class 0 |
| 0 | 0.50 | Uncertain (decision boundary) |
| +2 | 0.88 | Probably class 1 |
| +5 | 0.993 | Very likely class 1 |
Decision Rule
Example: Student Exam Pass/Fail
| Hours Studied | Previous Score | Passed? |
|---|---|---|
| 2 | 40 | ❌ Fail |
| 3 | 55 | ❌ Fail |
| 4 | 50 | ❌ Fail |
| 5 | 65 | ✅ Pass |
| 6 | 60 | ✅ Pass |
| 7 | 70 | ✅ Pass |
| 8 | 80 | ✅ Pass |
| 9 | 85 | ✅ Pass |
A trained logistic regression might learn:
For a student who studied 5.5 hours with a previous score of 60:
67% chance of passing → model predicts: Pass.
The Cost Function: Cross-Entropy Loss
For classification, we use cross-entropy loss (also called log loss) instead of MSE:
Why not MSE? The sigmoid function makes the MSE loss surface non-convex (multiple local minima). Cross-entropy gives us a nice convex surface with a single global minimum.
flowchart TD A["Input x"] --> B["z = w·x + b"] B --> C["ŷ = σ(z)"] C --> D["Cross-Entropy Loss"] E["True label y"] --> D D --> F["Gradient Descent"] F --> G["Update w, b"] G --> A style D fill:#fdf3e8,stroke:#7a4a1a,color:#7a4a1a style F fill:#e8f8f0,stroke:#1a5c38,color:#1a5c38
Decision Boundaries
The decision boundary is where the model switches from predicting one class to another. For logistic regression, this is a straight line (or hyperplane in higher dimensions).
flowchart TD
subgraph Linear["Linear Boundary"]
A["Logistic Regression
SVM (linear kernel)"]
end
subgraph NonLinear["Non-Linear Boundary"]
B["Decision Trees
SVM (RBF kernel)
Neural Networks"]
end
style Linear fill:#e8f4fd,stroke:#1a5276,color:#1a5276
style NonLinear fill:#f3e8fd,stroke:#5b1a7a,color:#5b1a7a
The Confusion Matrix
The confusion matrix is the most important evaluation tool for classification. It shows exactly where your model succeeds and fails.
For binary classification:
| Predicted: Positive | Predicted: Negative | |
|---|---|---|
| Actual: Positive | True Positive (TP) | False Negative (FN) |
| Actual: Negative | False Positive (FP) | True Negative (TN) |
Example: Cancer Screening
Out of 1,000 patients (50 actually have cancer):
| Predicted: Cancer | Predicted: Healthy | |
|---|---|---|
| Actual: Cancer | TP = 45 | FN = 5 |
| Actual: Healthy | FP = 30 | TN = 920 |
Classification Metrics
| Metric | Formula | Meaning | When to Use |
|---|---|---|---|
| Accuracy | Overall correctness | Balanced classes | |
| Precision | Of predicted positives, how many are correct? | When FP is costly (spam filter) | |
| Recall | Of actual positives, how many did we catch? | When FN is costly (cancer detection) | |
| F1 Score | Harmonic mean of precision & recall | Imbalanced classes | |
| Specificity | Of actual negatives, how many correctly identified? | When TN matters |
From Our Cancer Example
| Metric | Value |
|---|---|
| Accuracy | (45 + 920) / 1000 = 96.5% |
| Precision | 45 / (45 + 30) = 60.0% |
| Recall | 45 / (45 + 5) = 90.0% |
| F1 Score | 2 × (0.60 × 0.90) / (0.60 + 0.90) = 72.0% |
Accuracy looks great at 96.5%, but precision is only 60% - nearly half the cancer predictions are false alarms. This is why accuracy alone is misleading with imbalanced data.
Common Classification Algorithms
flowchart TD CLS["Classification Algorithms"] --> LIN["Linear Models"] CLS --> TREE["Tree-Based"] CLS --> DIST["Distance-Based"] CLS --> PROB["Probabilistic"] CLS --> NN["Neural Networks"] LIN --> LR["Logistic Regression"] LIN --> SVM["SVM"] TREE --> DT["Decision Tree"] TREE --> RF["Random Forest"] TREE --> GB["Gradient Boosting"] DIST --> KNN["K-Nearest Neighbors"] PROB --> NB["Naive Bayes"] NN --> MLP["MLP / Deep Learning"] style CLS fill:#f3e8fd,stroke:#5b1a7a,color:#5b1a7a
Algorithm Comparison
| Algorithm | Strengths | Weaknesses | Best For |
|---|---|---|---|
| Logistic Regression | Simple, interpretable, fast | Linear boundaries only | Binary, linearly separable |
| Decision Tree | Intuitive, handles mixed features | Overfits easily | Interpretability needed |
| Random Forest | Robust, handles noise | Slower, less interpretable | General-purpose |
| SVM | Works in high dimensions | Slow on large datasets | Text classification |
| K-Nearest Neighbors | No training needed | Slow at prediction time | Small datasets |
| Naive Bayes | Fast, works with little data | Assumes feature independence | Text, spam filtering |
| Gradient Boosting | State-of-the-art accuracy | Complex to tune | Competitions, production |
| Neural Networks | Learns complex patterns | Needs lots of data | Images, text, audio |
Multi-Class Classification Strategies
When you have more than 2 classes, there are two strategies:
One-vs-Rest (OvR)
Train binary classifiers - each one separates one class from all others:
| Classifier | Positive Class | Negative Class |
|---|---|---|
| Classifier 1 | Cat | Dog, Bird |
| Classifier 2 | Dog | Cat, Bird |
| Classifier 3 | Bird | Cat, Dog |
Final prediction = class with highest confidence score.
Softmax (Direct Multi-Class)
Instead of sigmoid, use softmax to output probabilities across all classes:
All probabilities sum to 1. The predicted class has the highest probability.
| Class | Raw Score () | Softmax Probability |
|---|---|---|
| Cat | 3.2 | 0.72 |
| Dog | 1.8 | 0.18 |
| Bird | 1.1 | 0.10 |
Prediction: Cat (72% confidence)
Summary
| Concept | Key Takeaway |
|---|---|
| Classification | Predicts categories, not numbers |
| Binary vs Multi-class | 2 classes vs 3+ classes |
| Logistic regression | Sigmoid + cross-entropy loss |
| Decision boundary | Where the model switches predictions |
| Confusion matrix | TP, FP, TN, FN breakdown |
| Precision vs Recall | Trade-off depending on cost of errors |
| F1 Score | Balanced metric for imbalanced data |
What’s Next?
flowchart LR A["✅ Intro to ML"] --> B["✅ Supervised Learning"] B --> C["✅ Regression"] C --> D["✅ Classification"] D --> E["Unsupervised Learning"] style A fill:#e8f8f0,stroke:#1a5c38,color:#1a5c38 style B fill:#e8f8f0,stroke:#1a5c38,color:#1a5c38 style C fill:#e8f8f0,stroke:#1a5c38,color:#1a5c38 style D fill:#e8f8f0,stroke:#1a5c38,color:#1a5c38 style E fill:#e8f4fd,stroke:#1a5276,color:#1a5276
We’ve now covered both sides of supervised learning. In the next and final post of this series, we’ll explore Unsupervised Learning - what happens when there are no labels, and the model has to find structure on its own.
See you in Part 5.