Search…
Maths for ML · Part 14

Bayes theorem and its role in ML

In this series (15 parts)
  1. Why Maths Matters for ML: A Practical Overview
  2. Scalars, Vectors, and Vector Spaces
  3. Matrices and Matrix Operations
  4. Matrix Inverses and Systems of Linear Equations
  5. Eigenvalues and Eigenvectors
  6. Matrix Decompositions: LU, QR, SVD
  7. Norms, Distances, and Similarity
  8. Calculus Review: Derivatives and the Chain Rule
  9. Partial Derivatives and Gradients
  10. The Jacobian and Hessian Matrices
  11. Taylor series and local approximations
  12. Probability fundamentals
  13. Random variables and distributions
  14. Bayes theorem and its role in ML
  15. Information theory: entropy, KL divergence, cross-entropy

Bayes’ theorem is a formula for reversing conditional probabilities. You know P(evidencehypothesis)P(\text{evidence} \mid \text{hypothesis}) and you want P(hypothesisevidence)P(\text{hypothesis} \mid \text{evidence}). That reversal is what lets ML models learn from data: you observe data, and you update your beliefs about the model’s parameters.

Prerequisites

You should be comfortable with conditional probability and the law of total probability before continuing.

The formula

Bayes’ theorem states:

P(HE)=P(EH)P(H)P(E)P(H \mid E) = \frac{P(E \mid H) \cdot P(H)}{P(E)}

Each piece has a name:

TermNameWhat it represents
P(H)P(H)PriorYour belief about HH before seeing evidence
P(EH)P(E \mid H)LikelihoodHow probable the evidence is if HH is true
P(E)P(E)Evidence (marginal likelihood)Total probability of the evidence under all hypotheses
P(HE)P(H \mid E)PosteriorYour updated belief about HH after seeing evidence

The denominator P(E)P(E) acts as a normalizing constant that ensures the posterior sums to 1 across all hypotheses.

Why this matters for ML

In machine learning, the “hypothesis” is usually a model or a set of parameters θ\theta, and the “evidence” is the observed data DD:

P(θD)=P(Dθ)P(θ)P(D)P(\theta \mid D) = \frac{P(D \mid \theta) \cdot P(\theta)}{P(D)}
  • P(θ)P(\theta): your prior belief about the parameters (e.g., “weights should be small”)
  • P(Dθ)P(D \mid \theta): the likelihood of the data given those parameters (your model’s predictions)
  • P(θD)P(\theta \mid D): the posterior, your updated belief about parameters after seeing data

Maximum a posteriori (MAP) estimation finds the θ\theta that maximizes the posterior. If you drop the prior (or use a uniform prior), MAP reduces to maximum likelihood estimation (MLE).

Example 1: The medical test problem

This is the classic example that shows why Bayes’ theorem is counterintuitive.

Setup:

  • A disease affects 1 in 1000 people: P(disease)=0.001P(\text{disease}) = 0.001
  • A test detects the disease 99% of the time: P(positivedisease)=0.99P(\text{positive} \mid \text{disease}) = 0.99
  • The test has a 5% false positive rate: P(positiveno disease)=0.05P(\text{positive} \mid \text{no disease}) = 0.05

Question: You test positive. What is the probability you actually have the disease?

Step 1: Identify the pieces.

P(disease)=0.001P(\text{disease}) = 0.001 P(no disease)=0.999P(\text{no disease}) = 0.999 P(positivedisease)=0.99P(\text{positive} \mid \text{disease}) = 0.99 P(positiveno disease)=0.05P(\text{positive} \mid \text{no disease}) = 0.05

Step 2: Compute P(positive)P(\text{positive}) using total probability.

P(positive)=P(posdisease)P(disease)+P(posno disease)P(no disease)P(\text{positive}) = P(\text{pos} \mid \text{disease}) \cdot P(\text{disease}) + P(\text{pos} \mid \text{no disease}) \cdot P(\text{no disease}) =0.99×0.001+0.05×0.999= 0.99 \times 0.001 + 0.05 \times 0.999 =0.00099+0.04995=0.05094= 0.00099 + 0.04995 = 0.05094

Step 3: Apply Bayes’ theorem.

P(diseasepositive)=0.99×0.0010.05094=0.000990.05094=0.0194P(\text{disease} \mid \text{positive}) = \frac{0.99 \times 0.001}{0.05094} = \frac{0.00099}{0.05094} = 0.0194

Only about 1.94%. A positive test still means there is less than a 2% chance you have the disease.

Why so low? The disease is rare (1 in 1000), so the false positives from the 999 healthy people vastly outnumber the true positives from the 1 sick person. Out of 1000 people tested:

  • 1 sick person, test catches them (0.99 true positives)
  • 999 healthy people, 5% test positive: 999×0.0550999 \times 0.05 \approx 50 false positives

So roughly 1 true positive out of 51 total positives. That is about 2%.

Medical test example: why false positives dominate when the disease is rare:

graph TD
  Pop["1000 people tested"] --> Sick["1 sick"]
  Pop --> Healthy["999 healthy"]
  Sick -->|"99% detected"| TP["1 true positive"]
  Healthy -->|"5% false alarm"| FP["50 false positives"]
  TP --> Total["51 total positives"]
  FP --> Total
  Total --> Post["P(sick | positive) = 1/51 = about 2%"]

Prior vs. posterior probabilities after a positive test:

Example 2: Spam classification with numbers

Naive Bayes is one of the simplest and most practical applications of Bayes’ theorem in ML. Let’s work through a real example.

Training data: 100 emails, 40 spam and 60 not spam.

Word frequencies:

WordAppears in spamAppears in not-spam
”free”30 out of 406 out of 60
”meeting”4 out of 4036 out of 60

So:

P(spam)=40100=0.4,P(not spam)=0.6P(\text{spam}) = \frac{40}{100} = 0.4, \quad P(\text{not spam}) = 0.6 P("free"spam)=3040=0.75,P("free"not spam)=660=0.10P(\text{"free"} \mid \text{spam}) = \frac{30}{40} = 0.75, \quad P(\text{"free"} \mid \text{not spam}) = \frac{6}{60} = 0.10 P("meeting"spam)=440=0.10,P("meeting"not spam)=3660=0.60P(\text{"meeting"} \mid \text{spam}) = \frac{4}{40} = 0.10, \quad P(\text{"meeting"} \mid \text{not spam}) = \frac{36}{60} = 0.60

Question: A new email contains “free” but not “meeting.” Is it spam?

Step 1: The Naive Bayes assumption says word occurrences are independent given the class. So:

P("free", not "meeting"spam)=P("free"spam)×P(not "meeting"spam)P(\text{"free", not "meeting"} \mid \text{spam}) = P(\text{"free"} \mid \text{spam}) \times P(\text{not "meeting"} \mid \text{spam}) =0.75×(10.10)=0.75×0.90=0.675= 0.75 \times (1 - 0.10) = 0.75 \times 0.90 = 0.675 P("free", not "meeting"not spam)=0.10×(10.60)=0.10×0.40=0.04P(\text{"free", not "meeting"} \mid \text{not spam}) = 0.10 \times (1 - 0.60) = 0.10 \times 0.40 = 0.04

Step 2: Compute the evidence.

P(evidence)=0.675×0.4+0.04×0.6=0.270+0.024=0.294P(\text{evidence}) = 0.675 \times 0.4 + 0.04 \times 0.6 = 0.270 + 0.024 = 0.294

Step 3: Apply Bayes.

P(spamevidence)=0.675×0.40.294=0.2700.294=0.918P(\text{spam} \mid \text{evidence}) = \frac{0.675 \times 0.4}{0.294} = \frac{0.270}{0.294} = 0.918

About 92% chance it is spam. The word “free” is a strong spam signal, and the absence of “meeting” reinforces that.

What if the email contained “meeting” but not “free”?

P(not "free", "meeting"spam)=(10.75)×0.10=0.25×0.10=0.025P(\text{not "free", "meeting"} \mid \text{spam}) = (1 - 0.75) \times 0.10 = 0.25 \times 0.10 = 0.025 P(not "free", "meeting"not spam)=(10.10)×0.60=0.90×0.60=0.54P(\text{not "free", "meeting"} \mid \text{not spam}) = (1 - 0.10) \times 0.60 = 0.90 \times 0.60 = 0.54 P(evidence)=0.025×0.4+0.54×0.6=0.010+0.324=0.334P(\text{evidence}) = 0.025 \times 0.4 + 0.54 \times 0.6 = 0.010 + 0.324 = 0.334 P(spamevidence)=0.0100.334=0.030P(\text{spam} \mid \text{evidence}) = \frac{0.010}{0.334} = 0.030

Only 3% chance of spam. The model correctly flips its prediction based on the words.

Example 3: Updating beliefs with multiple observations

Bayes’ theorem is naturally iterative. Today’s posterior becomes tomorrow’s prior.

Setup: A coin might be fair (p=0.5p = 0.5) or biased (p=0.8p = 0.8). You start with equal belief:

P(fair)=0.5,P(biased)=0.5P(\text{fair}) = 0.5, \quad P(\text{biased}) = 0.5

Observation 1: Heads.

P(Hfair)=0.5,P(Hbiased)=0.8P(H \mid \text{fair}) = 0.5, \quad P(H \mid \text{biased}) = 0.8 P(H)=0.5×0.5+0.8×0.5=0.25+0.40=0.65P(H) = 0.5 \times 0.5 + 0.8 \times 0.5 = 0.25 + 0.40 = 0.65 P(fairH)=0.5×0.50.65=0.250.65=0.385P(\text{fair} \mid H) = \frac{0.5 \times 0.5}{0.65} = \frac{0.25}{0.65} = 0.385 P(biasedH)=0.8×0.50.65=0.400.65=0.615P(\text{biased} \mid H) = \frac{0.8 \times 0.5}{0.65} = \frac{0.40}{0.65} = 0.615

After one heads, the biased hypothesis is slightly more likely.

Observation 2: Heads again. Now use the posterior from step 1 as the new prior.

P(H)=0.5×0.385+0.8×0.615=0.1925+0.492=0.6845P(H) = 0.5 \times 0.385 + 0.8 \times 0.615 = 0.1925 + 0.492 = 0.6845 P(fairH,H)=0.5×0.3850.6845=0.19250.6845=0.281P(\text{fair} \mid H, H) = \frac{0.5 \times 0.385}{0.6845} = \frac{0.1925}{0.6845} = 0.281 P(biasedH,H)=0.4920.6845=0.719P(\text{biased} \mid H, H) = \frac{0.492}{0.6845} = 0.719

Observation 3: Tails. New prior: P(fair)=0.281P(\text{fair}) = 0.281, P(biased)=0.719P(\text{biased}) = 0.719.

P(Tfair)=0.5,P(Tbiased)=0.2P(T \mid \text{fair}) = 0.5, \quad P(T \mid \text{biased}) = 0.2 P(T)=0.5×0.281+0.2×0.719=0.1405+0.1438=0.2843P(T) = 0.5 \times 0.281 + 0.2 \times 0.719 = 0.1405 + 0.1438 = 0.2843 P(fairH,H,T)=0.5×0.2810.2843=0.14050.2843=0.494P(\text{fair} \mid H, H, T) = \frac{0.5 \times 0.281}{0.2843} = \frac{0.1405}{0.2843} = 0.494

The tails observation pulls belief back toward fair. Each new data point adjusts the posterior. With enough data, the posterior concentrates on the true hypothesis.

Bayesian updating cycle: the posterior becomes the next prior:

graph LR
  A["Prior P(H)"] --> B["Observe data"]
  B --> C["Compute likelihood P(D|H)"]
  C --> D["Apply Bayes' theorem"]
  D --> E["Posterior P(H|D)"]
  E -->|"Posterior becomes new prior"| A

The Naive Bayes assumption

The “naive” in Naive Bayes refers to the assumption that features are conditionally independent given the class:

P(x1,x2,,xnC)=i=1nP(xiC)P(x_1, x_2, \ldots, x_n \mid C) = \prod_{i=1}^n P(x_i \mid C)

This is almost never true in practice. Words in an email are not independent. Pixel values in an image are not independent. But the assumption makes computation tractable and, surprisingly, Naive Bayes classifiers often perform well despite the wrong assumption.

Why does it work? Because classification only needs the posterior to be highest for the correct class. Even if the probability values are wrong, the ranking can still be correct.

Prior, likelihood, posterior: a visual intuition

Think of it as a tug-of-war:

graph LR
  A["Prior P(H)"] -->|"× Likelihood P(E|H)"| B["Unnormalized posterior"]
  B -->|"÷ Evidence P(E)"| C["Posterior P(H|E)"]
  • A strong prior requires strong evidence to overcome it.
  • A weak prior gets easily swayed by data.
  • More data means the likelihood dominates and the prior becomes irrelevant.

This is why, with enough training data, Bayesian and frequentist approaches converge to the same answer.

Connection to regularization

In the ML parameter setting:

P(θD)P(Dθ)P(θ)P(\theta \mid D) \propto P(D \mid \theta) \cdot P(\theta)

Taking the negative log:

logP(θD)=logP(Dθ)logP(θ)+const-\log P(\theta \mid D) = -\log P(D \mid \theta) - \log P(\theta) + \text{const}
  • logP(Dθ)-\log P(D \mid \theta) is the loss function (e.g., cross-entropy for classification)
  • logP(θ)-\log P(\theta) is the regularization term

If P(θ)=N(0,σ2I)P(\theta) = \mathcal{N}(0, \sigma^2 I) (Gaussian prior), then logP(θ)θ2-\log P(\theta) \propto \|\theta\|^2, which is L2 regularization. If P(θ)P(\theta) is Laplace distributed, you get L1 regularization.

So every time you add a regularization term to your loss function, you are implicitly choosing a Bayesian prior on the parameters.

Python: Naive Bayes from scratch

import numpy as np

# Training data: word counts per class
spam_counts = {"free": 30, "meeting": 4}
spam_total = 40
ham_counts = {"free": 6, "meeting": 36}
ham_total = 60
p_spam = 0.4
p_ham = 0.6

def classify(words_present, words_absent):
    log_spam = np.log(p_spam)
    log_ham = np.log(p_ham)

    for w in words_present:
        log_spam += np.log(spam_counts[w] / spam_total)
        log_ham += np.log(ham_counts[w] / ham_total)

    for w in words_absent:
        log_spam += np.log(1 - spam_counts[w] / spam_total)
        log_ham += np.log(1 - ham_counts[w] / ham_total)

    # Convert back from log space
    log_total = np.logaddexp(log_spam, log_ham)
    return np.exp(log_spam - log_total)

print(f"P(spam | 'free', no 'meeting') = {classify(['free'], ['meeting']):.3f}")
print(f"P(spam | 'meeting', no 'free') = {classify(['meeting'], ['free']):.3f}")

Notice we work in log space. This avoids numerical underflow when multiplying many small probabilities, which is critical in real Naive Bayes implementations.

Common mistakes

Ignoring the base rate (prior): The medical test example shows this. A 99% accurate test still produces mostly false positives when the disease is rare. Always account for P(H)P(H).

Confusing likelihood and posterior: P(EH)P(E \mid H) is not P(HE)P(H \mid E). The likelihood of the evidence given cancer is not the probability of cancer given the evidence.

Treating Naive Bayes probabilities as calibrated: The independence assumption means the raw probability outputs are usually wrong. Use Naive Bayes for ranking/classification, but do not trust its probability values without calibration.

Summary

ConceptKey formulaML connection
Bayes’ theoremP(HE)=P(EH)P(H)P(E)P(H \mid E) = \frac{P(E \mid H) P(H)}{P(E)}Foundation of probabilistic ML
PriorP(θ)P(\theta)Regularization
LikelihoodP(Dθ)P(D \mid \theta)Loss function (negative log)
PosteriorP(θD)P(\theta \mid D)What we want to compute
Naive BayesFeatures independent given classSimple, effective classifier

What comes next

The next article covers information theory, where we connect probability to the concepts of entropy, KL divergence, and cross-entropy loss, the loss function used in nearly every classification model.

Start typing to search across all content
navigate Enter open Esc close