Bayes theorem and its role in ML
In this series (15 parts)
- Why Maths Matters for ML: A Practical Overview
- Scalars, Vectors, and Vector Spaces
- Matrices and Matrix Operations
- Matrix Inverses and Systems of Linear Equations
- Eigenvalues and Eigenvectors
- Matrix Decompositions: LU, QR, SVD
- Norms, Distances, and Similarity
- Calculus Review: Derivatives and the Chain Rule
- Partial Derivatives and Gradients
- The Jacobian and Hessian Matrices
- Taylor series and local approximations
- Probability fundamentals
- Random variables and distributions
- Bayes theorem and its role in ML
- Information theory: entropy, KL divergence, cross-entropy
Bayes’ theorem is a formula for reversing conditional probabilities. You know and you want . That reversal is what lets ML models learn from data: you observe data, and you update your beliefs about the model’s parameters.
Prerequisites
You should be comfortable with conditional probability and the law of total probability before continuing.
The formula
Bayes’ theorem states:
Each piece has a name:
| Term | Name | What it represents |
|---|---|---|
| Prior | Your belief about before seeing evidence | |
| Likelihood | How probable the evidence is if is true | |
| Evidence (marginal likelihood) | Total probability of the evidence under all hypotheses | |
| Posterior | Your updated belief about after seeing evidence |
The denominator acts as a normalizing constant that ensures the posterior sums to 1 across all hypotheses.
Why this matters for ML
In machine learning, the “hypothesis” is usually a model or a set of parameters , and the “evidence” is the observed data :
- : your prior belief about the parameters (e.g., “weights should be small”)
- : the likelihood of the data given those parameters (your model’s predictions)
- : the posterior, your updated belief about parameters after seeing data
Maximum a posteriori (MAP) estimation finds the that maximizes the posterior. If you drop the prior (or use a uniform prior), MAP reduces to maximum likelihood estimation (MLE).
Example 1: The medical test problem
This is the classic example that shows why Bayes’ theorem is counterintuitive.
Setup:
- A disease affects 1 in 1000 people:
- A test detects the disease 99% of the time:
- The test has a 5% false positive rate:
Question: You test positive. What is the probability you actually have the disease?
Step 1: Identify the pieces.
Step 2: Compute using total probability.
Step 3: Apply Bayes’ theorem.
Only about 1.94%. A positive test still means there is less than a 2% chance you have the disease.
Why so low? The disease is rare (1 in 1000), so the false positives from the 999 healthy people vastly outnumber the true positives from the 1 sick person. Out of 1000 people tested:
- 1 sick person, test catches them (0.99 true positives)
- 999 healthy people, 5% test positive: false positives
So roughly 1 true positive out of 51 total positives. That is about 2%.
Medical test example: why false positives dominate when the disease is rare:
graph TD Pop["1000 people tested"] --> Sick["1 sick"] Pop --> Healthy["999 healthy"] Sick -->|"99% detected"| TP["1 true positive"] Healthy -->|"5% false alarm"| FP["50 false positives"] TP --> Total["51 total positives"] FP --> Total Total --> Post["P(sick | positive) = 1/51 = about 2%"]
Prior vs. posterior probabilities after a positive test:
Example 2: Spam classification with numbers
Naive Bayes is one of the simplest and most practical applications of Bayes’ theorem in ML. Let’s work through a real example.
Training data: 100 emails, 40 spam and 60 not spam.
Word frequencies:
| Word | Appears in spam | Appears in not-spam |
|---|---|---|
| ”free” | 30 out of 40 | 6 out of 60 |
| ”meeting” | 4 out of 40 | 36 out of 60 |
So:
Question: A new email contains “free” but not “meeting.” Is it spam?
Step 1: The Naive Bayes assumption says word occurrences are independent given the class. So:
Step 2: Compute the evidence.
Step 3: Apply Bayes.
About 92% chance it is spam. The word “free” is a strong spam signal, and the absence of “meeting” reinforces that.
What if the email contained “meeting” but not “free”?
Only 3% chance of spam. The model correctly flips its prediction based on the words.
Example 3: Updating beliefs with multiple observations
Bayes’ theorem is naturally iterative. Today’s posterior becomes tomorrow’s prior.
Setup: A coin might be fair () or biased (). You start with equal belief:
Observation 1: Heads.
After one heads, the biased hypothesis is slightly more likely.
Observation 2: Heads again. Now use the posterior from step 1 as the new prior.
Observation 3: Tails. New prior: , .
The tails observation pulls belief back toward fair. Each new data point adjusts the posterior. With enough data, the posterior concentrates on the true hypothesis.
Bayesian updating cycle: the posterior becomes the next prior:
graph LR A["Prior P(H)"] --> B["Observe data"] B --> C["Compute likelihood P(D|H)"] C --> D["Apply Bayes' theorem"] D --> E["Posterior P(H|D)"] E -->|"Posterior becomes new prior"| A
The Naive Bayes assumption
The “naive” in Naive Bayes refers to the assumption that features are conditionally independent given the class:
This is almost never true in practice. Words in an email are not independent. Pixel values in an image are not independent. But the assumption makes computation tractable and, surprisingly, Naive Bayes classifiers often perform well despite the wrong assumption.
Why does it work? Because classification only needs the posterior to be highest for the correct class. Even if the probability values are wrong, the ranking can still be correct.
Prior, likelihood, posterior: a visual intuition
Think of it as a tug-of-war:
graph LR A["Prior P(H)"] -->|"× Likelihood P(E|H)"| B["Unnormalized posterior"] B -->|"÷ Evidence P(E)"| C["Posterior P(H|E)"]
- A strong prior requires strong evidence to overcome it.
- A weak prior gets easily swayed by data.
- More data means the likelihood dominates and the prior becomes irrelevant.
This is why, with enough training data, Bayesian and frequentist approaches converge to the same answer.
Connection to regularization
In the ML parameter setting:
Taking the negative log:
- is the loss function (e.g., cross-entropy for classification)
- is the regularization term
If (Gaussian prior), then , which is L2 regularization. If is Laplace distributed, you get L1 regularization.
So every time you add a regularization term to your loss function, you are implicitly choosing a Bayesian prior on the parameters.
Python: Naive Bayes from scratch
import numpy as np
# Training data: word counts per class
spam_counts = {"free": 30, "meeting": 4}
spam_total = 40
ham_counts = {"free": 6, "meeting": 36}
ham_total = 60
p_spam = 0.4
p_ham = 0.6
def classify(words_present, words_absent):
log_spam = np.log(p_spam)
log_ham = np.log(p_ham)
for w in words_present:
log_spam += np.log(spam_counts[w] / spam_total)
log_ham += np.log(ham_counts[w] / ham_total)
for w in words_absent:
log_spam += np.log(1 - spam_counts[w] / spam_total)
log_ham += np.log(1 - ham_counts[w] / ham_total)
# Convert back from log space
log_total = np.logaddexp(log_spam, log_ham)
return np.exp(log_spam - log_total)
print(f"P(spam | 'free', no 'meeting') = {classify(['free'], ['meeting']):.3f}")
print(f"P(spam | 'meeting', no 'free') = {classify(['meeting'], ['free']):.3f}")
Notice we work in log space. This avoids numerical underflow when multiplying many small probabilities, which is critical in real Naive Bayes implementations.
Common mistakes
Ignoring the base rate (prior): The medical test example shows this. A 99% accurate test still produces mostly false positives when the disease is rare. Always account for .
Confusing likelihood and posterior: is not . The likelihood of the evidence given cancer is not the probability of cancer given the evidence.
Treating Naive Bayes probabilities as calibrated: The independence assumption means the raw probability outputs are usually wrong. Use Naive Bayes for ranking/classification, but do not trust its probability values without calibration.
Summary
| Concept | Key formula | ML connection |
|---|---|---|
| Bayes’ theorem | Foundation of probabilistic ML | |
| Prior | Regularization | |
| Likelihood | Loss function (negative log) | |
| Posterior | What we want to compute | |
| Naive Bayes | Features independent given class | Simple, effective classifier |
What comes next
The next article covers information theory, where we connect probability to the concepts of entropy, KL divergence, and cross-entropy loss, the loss function used in nearly every classification model.