Mar 3, 2026 · 16 min read · Maths for ML

Bayes theorem and its role in ML

In this series (15 parts)

Bayes’ theorem is a formula for reversing conditional probabilities. You know $P(\text{evidence} \mid \text{hypothesis})$ and you want $P(\text{hypothesis} \mid \text{evidence})$ . That reversal is what lets ML models learn from data: you observe data, and you update your beliefs about the model’s parameters.

Prerequisites

You should be comfortable with conditional probability and the law of total probability before continuing.

The formula

Bayes’ theorem states:

P(H \mid E) = \frac{P(E \mid H) \cdot P(H)}{P(E)}

Each piece has a name:

Term	Name	What it represents
$P(H)$	Prior	Your belief about $H$ before seeing evidence
$P(E \mid H)$	Likelihood	How probable the evidence is if $H$ is true
$P(E)$	Evidence (marginal likelihood)	Total probability of the evidence under all hypotheses
$P(H \mid E)$	Posterior	Your updated belief about $H$ after seeing evidence

The denominator $P(E)$ acts as a normalizing constant that ensures the posterior sums to 1 across all hypotheses.

Why this matters for ML

In machine learning, the “hypothesis” is usually a model or a set of parameters $\theta$ , and the “evidence” is the observed data $D$ :

P(\theta \mid D) = \frac{P(D \mid \theta) \cdot P(\theta)}{P(D)}

$P(\theta)$ : your prior belief about the parameters (e.g., “weights should be small”)
$P(D \mid \theta)$ : the likelihood of the data given those parameters (your model’s predictions)
$P(\theta \mid D)$ : the posterior, your updated belief about parameters after seeing data

Maximum a posteriori (MAP) estimation finds the $\theta$ that maximizes the posterior. If you drop the prior (or use a uniform prior), MAP reduces to maximum likelihood estimation (MLE).

Example 1: The medical test problem

This is the classic example that shows why Bayes’ theorem is counterintuitive.

Setup:

A disease affects 1 in 1000 people: $P(\text{disease}) = 0.001$
A test detects the disease 99% of the time: $P(\text{positive} \mid \text{disease}) = 0.99$
The test has a 5% false positive rate: $P(\text{positive} \mid \text{no disease}) = 0.05$

Question: You test positive. What is the probability you actually have the disease?

Step 1: Identify the pieces.

P(\text{disease}) = 0.001

P(\text{no disease}) = 0.999

P(\text{positive} \mid \text{disease}) = 0.99

P(\text{positive} \mid \text{no disease}) = 0.05

Step 2: Compute $P(\text{positive})$ using total probability.

P(\text{positive}) = P(\text{pos} \mid \text{disease}) \cdot P(\text{disease}) + P(\text{pos} \mid \text{no disease}) \cdot P(\text{no disease})

= 0.99 \times 0.001 + 0.05 \times 0.999

= 0.00099 + 0.04995 = 0.05094

Step 3: Apply Bayes’ theorem.

P(\text{disease} \mid \text{positive}) = \frac{0.99 \times 0.001}{0.05094} = \frac{0.00099}{0.05094} = 0.0194

Only about 1.94%. A positive test still means there is less than a 2% chance you have the disease.

Why so low? The disease is rare (1 in 1000), so the false positives from the 999 healthy people vastly outnumber the true positives from the 1 sick person. Out of 1000 people tested:

1 sick person, test catches them (0.99 true positives)
999 healthy people, 5% test positive: $999 \times 0.05 \approx 50$ false positives

So roughly 1 true positive out of 51 total positives. That is about 2%.

Medical test example: why false positives dominate when the disease is rare:

graph TD
  Pop["1000 people tested"] --> Sick["1 sick"]
  Pop --> Healthy["999 healthy"]
  Sick -->|"99% detected"| TP["1 true positive"]
  Healthy -->|"5% false alarm"| FP["50 false positives"]
  TP --> Total["51 total positives"]
  FP --> Total
  Total --> Post["P(sick | positive) = 1/51 = about 2%"]

Prior vs. posterior probabilities after a positive test:

Example 2: Spam classification with numbers

Naive Bayes is one of the simplest and most practical applications of Bayes’ theorem in ML. Let’s work through a real example.

Training data: 100 emails, 40 spam and 60 not spam.

Word frequencies:

Word	Appears in spam	Appears in not-spam
”free”	30 out of 40	6 out of 60
”meeting”	4 out of 40	36 out of 60

So:

P(\text{spam}) = \frac{40}{100} = 0.4, \quad P(\text{not spam}) = 0.6

P(\text{"free"} \mid \text{spam}) = \frac{30}{40} = 0.75, \quad P(\text{"free"} \mid \text{not spam}) = \frac{6}{60} = 0.10

P(\text{"meeting"} \mid \text{spam}) = \frac{4}{40} = 0.10, \quad P(\text{"meeting"} \mid \text{not spam}) = \frac{36}{60} = 0.60

Question: A new email contains “free” but not “meeting.” Is it spam?

Step 1: The Naive Bayes assumption says word occurrences are independent given the class. So:

P(\text{"free", not "meeting"} \mid \text{spam}) = P(\text{"free"} \mid \text{spam}) \times P(\text{not "meeting"} \mid \text{spam})

= 0.75 \times (1 - 0.10) = 0.75 \times 0.90 = 0.675

P(\text{"free", not "meeting"} \mid \text{not spam}) = 0.10 \times (1 - 0.60) = 0.10 \times 0.40 = 0.04

Step 2: Compute the evidence.

P(\text{evidence}) = 0.675 \times 0.4 + 0.04 \times 0.6 = 0.270 + 0.024 = 0.294

Step 3: Apply Bayes.

P(\text{spam} \mid \text{evidence}) = \frac{0.675 \times 0.4}{0.294} = \frac{0.270}{0.294} = 0.918

About 92% chance it is spam. The word “free” is a strong spam signal, and the absence of “meeting” reinforces that.

What if the email contained “meeting” but not “free”?

P(\text{not "free", "meeting"} \mid \text{spam}) = (1 - 0.75) \times 0.10 = 0.25 \times 0.10 = 0.025

P(\text{not "free", "meeting"} \mid \text{not spam}) = (1 - 0.10) \times 0.60 = 0.90 \times 0.60 = 0.54

P(\text{evidence}) = 0.025 \times 0.4 + 0.54 \times 0.6 = 0.010 + 0.324 = 0.334

P(\text{spam} \mid \text{evidence}) = \frac{0.010}{0.334} = 0.030

Only 3% chance of spam. The model correctly flips its prediction based on the words.

Example 3: Updating beliefs with multiple observations

Bayes’ theorem is naturally iterative. Today’s posterior becomes tomorrow’s prior.

Setup: A coin might be fair ( $p = 0.5$ ) or biased ( $p = 0.8$ ). You start with equal belief:

P(\text{fair}) = 0.5, \quad P(\text{biased}) = 0.5

Observation 1: Heads.

P(H \mid \text{fair}) = 0.5, \quad P(H \mid \text{biased}) = 0.8

P(H) = 0.5 \times 0.5 + 0.8 \times 0.5 = 0.25 + 0.40 = 0.65

P(\text{fair} \mid H) = \frac{0.5 \times 0.5}{0.65} = \frac{0.25}{0.65} = 0.385

P(\text{biased} \mid H) = \frac{0.8 \times 0.5}{0.65} = \frac{0.40}{0.65} = 0.615

After one heads, the biased hypothesis is slightly more likely.

Observation 2: Heads again. Now use the posterior from step 1 as the new prior.

P(H) = 0.5 \times 0.385 + 0.8 \times 0.615 = 0.1925 + 0.492 = 0.6845

P(\text{fair} \mid H, H) = \frac{0.5 \times 0.385}{0.6845} = \frac{0.1925}{0.6845} = 0.281

P(\text{biased} \mid H, H) = \frac{0.492}{0.6845} = 0.719

Observation 3: Tails. New prior: $P(\text{fair}) = 0.281$ , $P(\text{biased}) = 0.719$ .

P(T \mid \text{fair}) = 0.5, \quad P(T \mid \text{biased}) = 0.2

P(T) = 0.5 \times 0.281 + 0.2 \times 0.719 = 0.1405 + 0.1438 = 0.2843

P(\text{fair} \mid H, H, T) = \frac{0.5 \times 0.281}{0.2843} = \frac{0.1405}{0.2843} = 0.494

The tails observation pulls belief back toward fair. Each new data point adjusts the posterior. With enough data, the posterior concentrates on the true hypothesis.

Bayesian updating cycle: the posterior becomes the next prior:

graph LR
  A["Prior P(H)"] --> B["Observe data"]
  B --> C["Compute likelihood P(D|H)"]
  C --> D["Apply Bayes' theorem"]
  D --> E["Posterior P(H|D)"]
  E -->|"Posterior becomes new prior"| A

The Naive Bayes assumption

The “naive” in Naive Bayes refers to the assumption that features are conditionally independent given the class:

P(x_1, x_2, \ldots, x_n \mid C) = \prod_{i=1}^n P(x_i \mid C)

This is almost never true in practice. Words in an email are not independent. Pixel values in an image are not independent. But the assumption makes computation tractable and, surprisingly, Naive Bayes classifiers often perform well despite the wrong assumption.

Why does it work? Because classification only needs the posterior to be highest for the correct class. Even if the probability values are wrong, the ranking can still be correct.

Prior, likelihood, posterior: a visual intuition

Think of it as a tug-of-war:

graph LR
  A["Prior P(H)"] -->|"× Likelihood P(E|H)"| B["Unnormalized posterior"]
  B -->|"÷ Evidence P(E)"| C["Posterior P(H|E)"]

A strong prior requires strong evidence to overcome it.
A weak prior gets easily swayed by data.
More data means the likelihood dominates and the prior becomes irrelevant.

This is why, with enough training data, Bayesian and frequentist approaches converge to the same answer.

Connection to regularization

In the ML parameter setting:

P(\theta \mid D) \propto P(D \mid \theta) \cdot P(\theta)

Taking the negative log:

-\log P(\theta \mid D) = -\log P(D \mid \theta) - \log P(\theta) + \text{const}

$-\log P(D \mid \theta)$ is the loss function (e.g., cross-entropy for classification)
$-\log P(\theta)$ is the regularization term

If $P(\theta) = \mathcal{N}(0, \sigma^2 I)$ (Gaussian prior), then $-\log P(\theta) \propto \|\theta\|^2$ , which is L2 regularization. If $P(\theta)$ is Laplace distributed, you get L1 regularization.

So every time you add a regularization term to your loss function, you are implicitly choosing a Bayesian prior on the parameters.

Python: Naive Bayes from scratch

import numpy as np

# Training data: word counts per class
spam_counts = {"free": 30, "meeting": 4}
spam_total = 40
ham_counts = {"free": 6, "meeting": 36}
ham_total = 60
p_spam = 0.4
p_ham = 0.6

def classify(words_present, words_absent):
    log_spam = np.log(p_spam)
    log_ham = np.log(p_ham)

    for w in words_present:
        log_spam += np.log(spam_counts[w] / spam_total)
        log_ham += np.log(ham_counts[w] / ham_total)

    for w in words_absent:
        log_spam += np.log(1 - spam_counts[w] / spam_total)
        log_ham += np.log(1 - ham_counts[w] / ham_total)

    # Convert back from log space
    log_total = np.logaddexp(log_spam, log_ham)
    return np.exp(log_spam - log_total)

print(f"P(spam | 'free', no 'meeting') = {classify(['free'], ['meeting']):.3f}")
print(f"P(spam | 'meeting', no 'free') = {classify(['meeting'], ['free']):.3f}")

Notice we work in log space. This avoids numerical underflow when multiplying many small probabilities, which is critical in real Naive Bayes implementations.

Common mistakes

Ignoring the base rate (prior): The medical test example shows this. A 99% accurate test still produces mostly false positives when the disease is rare. Always account for $P(H)$ .

Confusing likelihood and posterior: $P(E \mid H)$ is not $P(H \mid E)$ . The likelihood of the evidence given cancer is not the probability of cancer given the evidence.

Treating Naive Bayes probabilities as calibrated: The independence assumption means the raw probability outputs are usually wrong. Use Naive Bayes for ranking/classification, but do not trust its probability values without calibration.

Summary

Concept	Key formula	ML connection
Bayes’ theorem	$P(H \mid E) = \frac{P(E \mid H) P(H)}{P(E)}$	Foundation of probabilistic ML
Prior	$P(\theta)$	Regularization
Likelihood	$P(D \mid \theta)$	Loss function (negative log)
Posterior	$P(\theta \mid D)$	What we want to compute
Naive Bayes	Features independent given class	Simple, effective classifier

What comes next

The next article covers information theory, where we connect probability to the concepts of entropy, KL divergence, and cross-entropy loss, the loss function used in nearly every classification model.

← Back to all series