Search…
ML from Scratch · Part 4

Classification - Predicting Categories

In this series (5 parts)
  1. Introduction to Machine Learning
  2. Supervised Learning - Learning from Labeled Data
  3. Regression - Predicting Continuous Values
  4. Classification - Predicting Categories
  5. Unsupervised Learning - Finding Hidden Patterns

In the previous post, we explored regression - predicting continuous values. Now we flip to the other side of supervised learning: classification, where the model predicts a category or class.

What is Classification?

Classification answers the question: “Which one?” or “Is it A or B?”

Given input features, a classification model outputs a discrete label - a category from a predefined set.

flowchart LR
  A["Input Features
(Email text, links, sender)"] --> B["Classification Model"]
  B --> C["Category
(Spam  or Not Spam ✅)"]
  style B fill:#e8f4fd,stroke:#1a5276,color:#1a5276
  style C fill:#e8f8f0,stroke:#1a5c38,color:#1a5c38

Types of Classification

flowchart TD
  CLS["Classification"] --> BIN["Binary Classification"]
  CLS --> MC["Multi-Class Classification"]
  CLS --> ML["Multi-Label Classification"]
  BIN --> B1["2 classes: Yes/No, Spam/Ham"]
  MC --> M1["3+ classes: Cat/Dog/Bird"]
  ML --> ML1["Multiple labels per input:
Genre tags on a movie"]
  style CLS fill:#f3e8fd,stroke:#5b1a7a,color:#5b1a7a
  style BIN fill:#e8f8f0,stroke:#1a5c38,color:#1a5c38
  style MC fill:#e8f4fd,stroke:#1a5276,color:#1a5276
  style ML fill:#fdf3e8,stroke:#7a4a1a,color:#7a4a1a
Type# of ClassesExampleOutput
Binary2Email spam detectionSpam or Not Spam
Multi-class3+Handwritten digit recognition0, 1, 2, … 9
Multi-labelMultiple per inputMovie genre taggingAction, Comedy, Sci-Fi

Real-World Classification Problems

ProblemInput FeaturesClassesType
Email spam filterWords, links, sender domainSpam / HamBinary
Disease diagnosisSymptoms, lab results, historyHealthy / SickBinary
Handwriting recognitionPixel values (28×28 image)0–9 digitsMulti-class
Sentiment analysisReview textPositive / Neutral / NegativeMulti-class
Credit riskIncome, debt, credit scoreLow / Medium / HighMulti-class
Object detectionImage regionsCar, Person, Tree, …Multi-class
Music genre taggingAudio featuresRock, Pop, Jazz, …Multi-label

Logistic Regression: The Starting Point

Despite the name, logistic regression is a classification algorithm. It predicts the probability that an input belongs to a class.

The Sigmoid Function

Logistic regression wraps a linear function in the sigmoid function to squeeze the output between 0 and 1:

σ(z)=11+ez\sigma(z) = \frac{1}{1 + e^{-z}}

Where z=wx+bz = \mathbf{w} \cdot \mathbf{x} + b (same linear combination as linear regression).

z valueσ(z)Interpretation
-50.007Very likely class 0
-20.12Probably class 0
00.50Uncertain (decision boundary)
+20.88Probably class 1
+50.993Very likely class 1

Decision Rule

prediction={1if σ(z)0.50if σ(z)<0.5\text{prediction} = \begin{cases} 1 & \text{if } \sigma(z) \geq 0.5 \\ 0 & \text{if } \sigma(z) < 0.5 \end{cases}

Example: Student Exam Pass/Fail

Hours StudiedPrevious ScorePassed?
240❌ Fail
355❌ Fail
450❌ Fail
565✅ Pass
660✅ Pass
770✅ Pass
880✅ Pass
985✅ Pass

A trained logistic regression might learn:

P(pass)=σ(0.8hours+0.03score5.5)P(\text{pass}) = \sigma(0.8 \cdot \text{hours} + 0.03 \cdot \text{score} - 5.5)

For a student who studied 5.5 hours with a previous score of 60:

z=0.8(5.5)+0.03(60)5.5=0.7z = 0.8(5.5) + 0.03(60) - 5.5 = 0.7 P(pass)=σ(0.7)0.67P(\text{pass}) = \sigma(0.7) \approx 0.67

67% chance of passing → model predicts: Pass.

The Cost Function: Cross-Entropy Loss

For classification, we use cross-entropy loss (also called log loss) instead of MSE:

J=1ni=1n[yilog(y^i)+(1yi)log(1y^i)]J = -\frac{1}{n} \sum_{i=1}^{n} \left[ y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i) \right]

Why not MSE? The sigmoid function makes the MSE loss surface non-convex (multiple local minima). Cross-entropy gives us a nice convex surface with a single global minimum.

flowchart TD
  A["Input x"] --> B["z = w·x + b"]
  B --> C["ŷ = σ(z)"]
  C --> D["Cross-Entropy Loss"]
  E["True label y"] --> D
  D --> F["Gradient Descent"]
  F --> G["Update w, b"]
  G --> A
  style D fill:#fdf3e8,stroke:#7a4a1a,color:#7a4a1a
  style F fill:#e8f8f0,stroke:#1a5c38,color:#1a5c38

Decision Boundaries

The decision boundary is where the model switches from predicting one class to another. For logistic regression, this is a straight line (or hyperplane in higher dimensions).

flowchart TD
  subgraph Linear["Linear Boundary"]
      A["Logistic Regression
SVM (linear kernel)"]
  end
  subgraph NonLinear["Non-Linear Boundary"]
      B["Decision Trees
SVM (RBF kernel)
Neural Networks"]
  end
  style Linear fill:#e8f4fd,stroke:#1a5276,color:#1a5276
  style NonLinear fill:#f3e8fd,stroke:#5b1a7a,color:#5b1a7a

The Confusion Matrix

The confusion matrix is the most important evaluation tool for classification. It shows exactly where your model succeeds and fails.

For binary classification:

Predicted: PositivePredicted: Negative
Actual: PositiveTrue Positive (TP)False Negative (FN)
Actual: NegativeFalse Positive (FP)True Negative (TN)

Example: Cancer Screening

Out of 1,000 patients (50 actually have cancer):

Predicted: CancerPredicted: Healthy
Actual: CancerTP = 45FN = 5
Actual: HealthyFP = 30TN = 920

Classification Metrics

MetricFormulaMeaningWhen to Use
AccuracyTP+TNTotal\frac{TP + TN}{Total}Overall correctnessBalanced classes
PrecisionTPTP+FP\frac{TP}{TP + FP}Of predicted positives, how many are correct?When FP is costly (spam filter)
RecallTPTP+FN\frac{TP}{TP + FN}Of actual positives, how many did we catch?When FN is costly (cancer detection)
F1 Score2PRP+R2 \cdot \frac{P \cdot R}{P + R}Harmonic mean of precision & recallImbalanced classes
SpecificityTNTN+FP\frac{TN}{TN + FP}Of actual negatives, how many correctly identified?When TN matters

From Our Cancer Example

MetricValue
Accuracy(45 + 920) / 1000 = 96.5%
Precision45 / (45 + 30) = 60.0%
Recall45 / (45 + 5) = 90.0%
F1 Score2 × (0.60 × 0.90) / (0.60 + 0.90) = 72.0%

Accuracy looks great at 96.5%, but precision is only 60% - nearly half the cancer predictions are false alarms. This is why accuracy alone is misleading with imbalanced data.

Common Classification Algorithms

flowchart TD
  CLS["Classification Algorithms"] --> LIN["Linear Models"]
  CLS --> TREE["Tree-Based"]
  CLS --> DIST["Distance-Based"]
  CLS --> PROB["Probabilistic"]
  CLS --> NN["Neural Networks"]
  LIN --> LR["Logistic Regression"]
  LIN --> SVM["SVM"]
  TREE --> DT["Decision Tree"]
  TREE --> RF["Random Forest"]
  TREE --> GB["Gradient Boosting"]
  DIST --> KNN["K-Nearest Neighbors"]
  PROB --> NB["Naive Bayes"]
  NN --> MLP["MLP / Deep Learning"]
  style CLS fill:#f3e8fd,stroke:#5b1a7a,color:#5b1a7a

Algorithm Comparison

AlgorithmStrengthsWeaknessesBest For
Logistic RegressionSimple, interpretable, fastLinear boundaries onlyBinary, linearly separable
Decision TreeIntuitive, handles mixed featuresOverfits easilyInterpretability needed
Random ForestRobust, handles noiseSlower, less interpretableGeneral-purpose
SVMWorks in high dimensionsSlow on large datasetsText classification
K-Nearest NeighborsNo training neededSlow at prediction timeSmall datasets
Naive BayesFast, works with little dataAssumes feature independenceText, spam filtering
Gradient BoostingState-of-the-art accuracyComplex to tuneCompetitions, production
Neural NetworksLearns complex patternsNeeds lots of dataImages, text, audio

Multi-Class Classification Strategies

When you have more than 2 classes, there are two strategies:

One-vs-Rest (OvR)

Train KK binary classifiers - each one separates one class from all others:

ClassifierPositive ClassNegative Class
Classifier 1CatDog, Bird
Classifier 2DogCat, Bird
Classifier 3BirdCat, Dog

Final prediction = class with highest confidence score.

Softmax (Direct Multi-Class)

Instead of sigmoid, use softmax to output probabilities across all classes:

P(classk)=ezkj=1KezjP(\text{class}_k) = \frac{e^{z_k}}{\sum_{j=1}^{K} e^{z_j}}

All probabilities sum to 1. The predicted class has the highest probability.

ClassRaw Score (zz)Softmax Probability
Cat3.20.72
Dog1.80.18
Bird1.10.10

Prediction: Cat (72% confidence)

Summary

ConceptKey Takeaway
ClassificationPredicts categories, not numbers
Binary vs Multi-class2 classes vs 3+ classes
Logistic regressionSigmoid + cross-entropy loss
Decision boundaryWhere the model switches predictions
Confusion matrixTP, FP, TN, FN breakdown
Precision vs RecallTrade-off depending on cost of errors
F1 ScoreBalanced metric for imbalanced data

What’s Next?

flowchart LR
  A["✅ Intro to ML"] --> B["✅ Supervised Learning"]
  B --> C["✅ Regression"]
  C --> D["✅ Classification"]
  D --> E["Unsupervised Learning"]
  style A fill:#e8f8f0,stroke:#1a5c38,color:#1a5c38
  style B fill:#e8f8f0,stroke:#1a5c38,color:#1a5c38
  style C fill:#e8f8f0,stroke:#1a5c38,color:#1a5c38
  style D fill:#e8f8f0,stroke:#1a5c38,color:#1a5c38
  style E fill:#e8f4fd,stroke:#1a5276,color:#1a5276

We’ve now covered both sides of supervised learning. In the next and final post of this series, we’ll explore Unsupervised Learning - what happens when there are no labels, and the model has to find structure on its own.

See you in Part 5.

Start typing to search across all content
navigate Enter open Esc close