Apr 17, 2026 · 20 min read · ML / Math

Classification - Predicting Categories

In this series (5 parts)

Introduction to Machine Learning
Supervised Learning - Learning from Labeled Data
Regression - Predicting Continuous Values
Classification - Predicting Categories
Unsupervised Learning - Finding Hidden Patterns

In the previous post, we explored regression - predicting continuous values. Now we flip to the other side of supervised learning: classification, where the model predicts a category or class.

What is Classification?

Classification answers the question: “Which one?” or “Is it A or B?”

Given input features, a classification model outputs a discrete label - a category from a predefined set.

flowchart LR
  A["Input Features
(Email text, links, sender)"] --> B["Classification Model"]
  B --> C["Category
(Spam  or Not Spam ✅)"]
  style B fill:#e8f4fd,stroke:#1a5276,color:#1a5276
  style C fill:#e8f8f0,stroke:#1a5c38,color:#1a5c38

Types of Classification

flowchart TD
  CLS["Classification"] --> BIN["Binary Classification"]
  CLS --> MC["Multi-Class Classification"]
  CLS --> ML["Multi-Label Classification"]
  BIN --> B1["2 classes: Yes/No, Spam/Ham"]
  MC --> M1["3+ classes: Cat/Dog/Bird"]
  ML --> ML1["Multiple labels per input:
Genre tags on a movie"]
  style CLS fill:#f3e8fd,stroke:#5b1a7a,color:#5b1a7a
  style BIN fill:#e8f8f0,stroke:#1a5c38,color:#1a5c38
  style MC fill:#e8f4fd,stroke:#1a5276,color:#1a5276
  style ML fill:#fdf3e8,stroke:#7a4a1a,color:#7a4a1a

Type	# of Classes	Example	Output
Binary	2	Email spam detection	Spam or Not Spam
Multi-class	3+	Handwritten digit recognition	0, 1, 2, … 9
Multi-label	Multiple per input	Movie genre tagging	Action, Comedy, Sci-Fi

Real-World Classification Problems

Problem	Input Features	Classes	Type
Email spam filter	Words, links, sender domain	Spam / Ham	Binary
Disease diagnosis	Symptoms, lab results, history	Healthy / Sick	Binary
Handwriting recognition	Pixel values (28×28 image)	0–9 digits	Multi-class
Sentiment analysis	Review text	Positive / Neutral / Negative	Multi-class
Credit risk	Income, debt, credit score	Low / Medium / High	Multi-class
Object detection	Image regions	Car, Person, Tree, …	Multi-class
Music genre tagging	Audio features	Rock, Pop, Jazz, …	Multi-label

Logistic Regression: The Starting Point

Despite the name, logistic regression is a classification algorithm. It predicts the probability that an input belongs to a class.

The Sigmoid Function

Logistic regression wraps a linear function in the sigmoid function to squeeze the output between 0 and 1:

$\sigma(z) = \frac{1}{1 + e^{-z}}$

Where $z = \mathbf{w} \cdot \mathbf{x} + b$ (same linear combination as linear regression).

z value	σ(z)	Interpretation
-5	0.007	Very likely class 0
-2	0.12	Probably class 0
0	0.50	Uncertain (decision boundary)
+2	0.88	Probably class 1
+5	0.993	Very likely class 1

Decision Rule

$\text{prediction} = \begin{cases} 1 & \text{if } \sigma(z) \geq 0.5 \\ 0 & \text{if } \sigma(z) < 0.5 \end{cases}$

Example: Student Exam Pass/Fail

Hours Studied	Previous Score	Passed?
2	40	❌ Fail
3	55	❌ Fail
4	50	❌ Fail
5	65	✅ Pass
6	60	✅ Pass
7	70	✅ Pass
8	80	✅ Pass
9	85	✅ Pass

A trained logistic regression might learn:

$P(\text{pass}) = \sigma(0.8 \cdot \text{hours} + 0.03 \cdot \text{score} - 5.5)$

For a student who studied 5.5 hours with a previous score of 60:

$z = 0.8(5.5) + 0.03(60) - 5.5 = 0.7$ $P(\text{pass}) = \sigma(0.7) \approx 0.67$

67% chance of passing → model predicts: Pass.

The Cost Function: Cross-Entropy Loss

For classification, we use cross-entropy loss (also called log loss) instead of MSE:

$J = -\frac{1}{n} \sum_{i=1}^{n} \left[ y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i) \right]$

Why not MSE? The sigmoid function makes the MSE loss surface non-convex (multiple local minima). Cross-entropy gives us a nice convex surface with a single global minimum.

flowchart TD
  A["Input x"] --> B["z = w·x + b"]
  B --> C["ŷ = σ(z)"]
  C --> D["Cross-Entropy Loss"]
  E["True label y"] --> D
  D --> F["Gradient Descent"]
  F --> G["Update w, b"]
  G --> A
  style D fill:#fdf3e8,stroke:#7a4a1a,color:#7a4a1a
  style F fill:#e8f8f0,stroke:#1a5c38,color:#1a5c38

Decision Boundaries

The decision boundary is where the model switches from predicting one class to another. For logistic regression, this is a straight line (or hyperplane in higher dimensions).

flowchart TD
  subgraph Linear["Linear Boundary"]
      A["Logistic Regression
SVM (linear kernel)"]
  end
  subgraph NonLinear["Non-Linear Boundary"]
      B["Decision Trees
SVM (RBF kernel)
Neural Networks"]
  end
  style Linear fill:#e8f4fd,stroke:#1a5276,color:#1a5276
  style NonLinear fill:#f3e8fd,stroke:#5b1a7a,color:#5b1a7a

The Confusion Matrix

The confusion matrix is the most important evaluation tool for classification. It shows exactly where your model succeeds and fails.

For binary classification:

	Predicted: Positive	Predicted: Negative
Actual: Positive	True Positive (TP)	False Negative (FN)
Actual: Negative	False Positive (FP)	True Negative (TN)

Example: Cancer Screening

Out of 1,000 patients (50 actually have cancer):

	Predicted: Cancer	Predicted: Healthy
Actual: Cancer	TP = 45	FN = 5
Actual: Healthy	FP = 30	TN = 920

Classification Metrics

Metric	Formula	Meaning	When to Use
Accuracy	$\frac{TP + TN}{Total}$	Overall correctness	Balanced classes
Precision	$\frac{TP}{TP + FP}$	Of predicted positives, how many are correct?	When FP is costly (spam filter)
Recall	$\frac{TP}{TP + FN}$	Of actual positives, how many did we catch?	When FN is costly (cancer detection)
F1 Score	$2 \cdot \frac{P \cdot R}{P + R}$	Harmonic mean of precision & recall	Imbalanced classes
Specificity	$\frac{TN}{TN + FP}$	Of actual negatives, how many correctly identified?	When TN matters

From Our Cancer Example

Metric	Value
Accuracy	(45 + 920) / 1000 = 96.5%
Precision	45 / (45 + 30) = 60.0%
Recall	45 / (45 + 5) = 90.0%
F1 Score	2 × (0.60 × 0.90) / (0.60 + 0.90) = 72.0%

Accuracy looks great at 96.5%, but precision is only 60% - nearly half the cancer predictions are false alarms. This is why accuracy alone is misleading with imbalanced data.

Common Classification Algorithms

flowchart TD
  CLS["Classification Algorithms"] --> LIN["Linear Models"]
  CLS --> TREE["Tree-Based"]
  CLS --> DIST["Distance-Based"]
  CLS --> PROB["Probabilistic"]
  CLS --> NN["Neural Networks"]
  LIN --> LR["Logistic Regression"]
  LIN --> SVM["SVM"]
  TREE --> DT["Decision Tree"]
  TREE --> RF["Random Forest"]
  TREE --> GB["Gradient Boosting"]
  DIST --> KNN["K-Nearest Neighbors"]
  PROB --> NB["Naive Bayes"]
  NN --> MLP["MLP / Deep Learning"]
  style CLS fill:#f3e8fd,stroke:#5b1a7a,color:#5b1a7a

Algorithm Comparison

Algorithm	Strengths	Weaknesses	Best For
Logistic Regression	Simple, interpretable, fast	Linear boundaries only	Binary, linearly separable
Decision Tree	Intuitive, handles mixed features	Overfits easily	Interpretability needed
Random Forest	Robust, handles noise	Slower, less interpretable	General-purpose
SVM	Works in high dimensions	Slow on large datasets	Text classification
K-Nearest Neighbors	No training needed	Slow at prediction time	Small datasets
Naive Bayes	Fast, works with little data	Assumes feature independence	Text, spam filtering
Gradient Boosting	State-of-the-art accuracy	Complex to tune	Competitions, production
Neural Networks	Learns complex patterns	Needs lots of data	Images, text, audio

Multi-Class Classification Strategies

When you have more than 2 classes, there are two strategies:

One-vs-Rest (OvR)

Train $K$ binary classifiers - each one separates one class from all others:

Classifier	Positive Class	Negative Class
Classifier 1	Cat	Dog, Bird
Classifier 2	Dog	Cat, Bird
Classifier 3	Bird	Cat, Dog

Final prediction = class with highest confidence score.

Softmax (Direct Multi-Class)

Instead of sigmoid, use softmax to output probabilities across all classes:

$P(\text{class}_k) = \frac{e^{z_k}}{\sum_{j=1}^{K} e^{z_j}}$

All probabilities sum to 1. The predicted class has the highest probability.

Class	Raw Score ( $z$ )	Softmax Probability
Cat	3.2	0.72
Dog	1.8	0.18
Bird	1.1	0.10

Prediction: Cat (72% confidence)

Summary

Concept	Key Takeaway
Classification	Predicts categories, not numbers
Binary vs Multi-class	2 classes vs 3+ classes
Logistic regression	Sigmoid + cross-entropy loss
Decision boundary	Where the model switches predictions
Confusion matrix	TP, FP, TN, FN breakdown
Precision vs Recall	Trade-off depending on cost of errors
F1 Score	Balanced metric for imbalanced data

What’s Next?

flowchart LR
  A["✅ Intro to ML"] --> B["✅ Supervised Learning"]
  B --> C["✅ Regression"]
  C --> D["✅ Classification"]
  D --> E["Unsupervised Learning"]
  style A fill:#e8f8f0,stroke:#1a5c38,color:#1a5c38
  style B fill:#e8f8f0,stroke:#1a5c38,color:#1a5c38
  style C fill:#e8f8f0,stroke:#1a5c38,color:#1a5c38
  style D fill:#e8f8f0,stroke:#1a5c38,color:#1a5c38
  style E fill:#e8f4fd,stroke:#1a5276,color:#1a5276

We’ve now covered both sides of supervised learning. In the next and final post of this series, we’ll explore Unsupervised Learning - what happens when there are no labels, and the model has to find structure on its own.

See you in Part 5.

← Back to all series