Probability fundamentals
In this series (15 parts)
- Why Maths Matters for ML: A Practical Overview
- Scalars, Vectors, and Vector Spaces
- Matrices and Matrix Operations
- Matrix Inverses and Systems of Linear Equations
- Eigenvalues and Eigenvectors
- Matrix Decompositions: LU, QR, SVD
- Norms, Distances, and Similarity
- Calculus Review: Derivatives and the Chain Rule
- Partial Derivatives and Gradients
- The Jacobian and Hessian Matrices
- Taylor series and local approximations
- Probability fundamentals
- Random variables and distributions
- Bayes theorem and its role in ML
- Information theory: entropy, KL divergence, cross-entropy
Machine learning is fundamentally about making predictions under uncertainty. Probability gives you the formal tools to quantify that uncertainty, update beliefs when new data arrives, and reason about what your model “knows” vs. what it is guessing.
Prerequisites
This article assumes basic mathematical maturity. If you are new to the series, start with why maths for ML.
Sample spaces and events
A sample space is the set of all possible outcomes of an experiment. An event is a subset of , representing outcomes you care about.
Example: Roll a six-sided die.
- Sample space:
- Event = “roll an even number” =
- Event = “roll greater than 4” =
Events can be combined:
- (A or B) =
- (A and B) =
- (not A) =
Set operations on events (Venn diagram concepts):
graph TD U["Sample Space: 1,2,3,4,5,6"] --> AuB["A ∪ B = 2,4,5,6"] U --> AnB["A ∩ B = 6"] U --> Ac["A complement = 1,3,5"] AuB --> Rule["P(A ∪ B) = P(A) + P(B) - P(A ∩ B)"]
The three axioms of probability
Every probability function must satisfy:
- Non-negativity: for any event
- Normalization:
- Additivity: If and are mutually exclusive (cannot both happen), then
Everything else in probability theory follows from these three rules.
A useful consequence is the addition rule for events that are not mutually exclusive:
You subtract because it gets counted twice otherwise.
Event probabilities using the addition rule:
Conditional probability
Conditional probability answers: “What is the probability of given that already happened?” The definition is:
provided . You are restricting your attention to the world where is true, then asking how much of that world also contains .
Conditional probability narrows the sample space:
graph LR Full["Full sample space Ω"] -->|"Condition on B"| Narrow["Restrict to event B"] Narrow -->|"Find A within B"| Result["P(A|B) = P(A ∩ B) / P(B)"]
This single formula is the foundation of Bayes’ theorem and basically all of probabilistic ML.
Example 1: Conditional probability with a contingency table
A company surveys 200 employees about their commute:
| Drives | Takes transit | Total | |
|---|---|---|---|
| Under 30 | 40 | 30 | 70 |
| 30 or older | 80 | 50 | 130 |
| Total | 120 | 80 | 200 |
Question 1: What is ?
We need the probability of driving, given the person is under 30.
Question 2: What is ?
Question 3: What is ?
Independence
Two events and are independent if knowing one happened gives you zero information about the other. Formally:
Equivalently, . The conditional probability equals the unconditional probability, meaning is irrelevant to .
Independence is not the same as “mutually exclusive.” Mutually exclusive events are actually maximally dependent: if one happens, the other definitely did not.
Example 2: Testing for independence
Using the same employee table, check whether “Drives” and “Under 30” are independent.
If independent, we need:
Compute each side:
They are not independent. But they are close. In practice, you often test whether the departure from independence is statistically significant (that is what the chi-squared test does).
Let’s construct an example that is independent. Suppose 200 students are surveyed:
| Likes coffee | Does not like coffee | Total | |
|---|---|---|---|
| Left-handed | 18 | 12 | 30 |
| Right-handed | 102 | 68 | 170 |
| Total | 120 | 80 | 200 |
Check independence of “Likes coffee” and “Left-handed”:
These events are independent. Knowing someone is left-handed tells you nothing about their coffee preference.
The law of total probability
If events partition the sample space (they cover everything and do not overlap), then for any event :
This is like computing a weighted average of the conditional probabilities. It is essential for deriving Bayes’ theorem and for reasoning about mixture models in ML.
Probability tree: branching over a partition of the sample space:
graph TD S["Start"] --> B1["Machine 1: 50%"] S --> B2["Machine 2: 30%"] S --> B3["Machine 3: 20%"] B1 -->|"3% defect"| D1["Defective"] B1 -->|"97% OK"| OK1["Good"] B2 -->|"5% defect"| D2["Defective"] B2 -->|"95% OK"| OK2["Good"] B3 -->|"2% defect"| D3["Defective"] B3 -->|"98% OK"| OK3["Good"]
Example 3: Law of total probability
A factory has three machines producing bolts:
- Machine 1 produces 50% of bolts, with 3% defect rate
- Machine 2 produces 30% of bolts, with 5% defect rate
- Machine 3 produces 20% of bolts, with 2% defect rate
What is the overall defect rate?
Let = “bolt is defective.” Apply total probability:
The overall defect rate is 3.4%. Notice that Machine 2, despite producing only 30% of bolts, contributes as much to defects as Machine 1 (which produces 50%) because Machine 2’s defect rate is higher.
Follow-up: Given a bolt is defective, what is the probability it came from Machine 2? This requires Bayes’ theorem:
Machine 2 is responsible for about 44% of defective bolts, even though it produces only 30% of total output.
Joint and marginal probability
The joint probability is another notation for , the probability both events occur.
If you have a joint distribution over two variables, you get the marginal distribution by summing out the other variable:
This “sums out” or “marginalizes over” . It is the same idea as the law of total probability.
For ML, joint and marginal distributions show up constantly. When you build a classifier, you model the joint distribution of features and labels , then use it to compute .
The product rule (chain rule of probability)
From the definition of conditional probability:
This extends to more variables:
In general, for events:
This “chain rule” of probability (not to be confused with the calculus chain rule) is how autoregressive language models generate text: each token’s probability is conditioned on all previous tokens.
A visual mental model
Think of probability as area. The sample space is a rectangle with area 1. Events are regions inside it. Conditional probability is the fraction of ‘s area that overlaps with .
graph LR
subgraph "Sample Space Ω (area = 1)"
A["Event A"]
B["Event B"]
AB["A ∩ B"]
end
This “area” intuition also works for continuous distributions, where probabilities become integrals over density functions. We cover that in the next article.
Python: simulating conditional probability
import numpy as np
rng = np.random.default_rng(42)
n = 100_000
# Roll two dice
die1 = rng.integers(1, 7, size=n)
die2 = rng.integers(1, 7, size=n)
total = die1 + die2
# P(total >= 10 | die1 = 6)
mask_die1_is_6 = die1 == 6
p_given = np.mean(total[mask_die1_is_6] >= 10)
print(f"P(total >= 10 | die1 = 6) ≈ {p_given:.3f}")
# Exact: outcomes where die1=6 and total>=10 are die2 in {4,5,6} → 3/6 = 0.5
print(f"Exact: {3/6:.3f}")
Common pitfalls
Confusing with : These are generally not equal. is high, but is much lower because many things cause fevers. This confusion is called the “prosecutor’s fallacy.”
Assuming independence without checking: Many ML models (like Naive Bayes) assume features are independent. That assumption is almost always wrong, but it often works surprisingly well in practice.
Forgetting that conditional probability changes the sample space: When you condition on , you are working in a smaller universe. All probabilities are recalculated relative to .
Summary
| Concept | Formula | Why it matters |
|---|---|---|
| Conditional probability | Foundation of Bayesian reasoning | |
| Independence | Simplifies computation, key ML assumption | |
| Total probability | Combines information across partitions | |
| Product rule | Decomposes joint distributions |
What comes next
The next article covers random variables and distributions, where we connect probability to the mathematical objects that ML actually computes with: expectations, variances, and named distributions like the Gaussian.