Mar 1, 2026 · 16 min read · Maths for ML

Probability fundamentals

In this series (15 parts)

Machine learning is fundamentally about making predictions under uncertainty. Probability gives you the formal tools to quantify that uncertainty, update beliefs when new data arrives, and reason about what your model “knows” vs. what it is guessing.

Prerequisites

This article assumes basic mathematical maturity. If you are new to the series, start with why maths for ML.

Sample spaces and events

A sample space $\Omega$ is the set of all possible outcomes of an experiment. An event is a subset of $\Omega$ , representing outcomes you care about.

Example: Roll a six-sided die.

Sample space: $\Omega = \{1, 2, 3, 4, 5, 6\}$
Event $A$ = “roll an even number” = $\{2, 4, 6\}$
Event $B$ = “roll greater than 4” = $\{5, 6\}$

Events can be combined:

$A \cup B$ (A or B) = $\{2, 4, 5, 6\}$
$A \cap B$ (A and B) = $\{6\}$
$A^c$ (not A) = $\{1, 3, 5\}$

Set operations on events (Venn diagram concepts):

graph TD
  U["Sample Space: 1,2,3,4,5,6"] --> AuB["A ∪ B = 2,4,5,6"]
  U --> AnB["A ∩ B = 6"]
  U --> Ac["A complement = 1,3,5"]
  AuB --> Rule["P(A ∪ B) = P(A) + P(B) - P(A ∩ B)"]

The three axioms of probability

Every probability function $P$ must satisfy:

Non-negativity: $P(A) \geq 0$ for any event $A$
Normalization: $P(\Omega) = 1$
Additivity: If $A$ and $B$ are mutually exclusive (cannot both happen), then $P(A \cup B) = P(A) + P(B)$

Everything else in probability theory follows from these three rules.

A useful consequence is the addition rule for events that are not mutually exclusive:

P(A \cup B) = P(A) + P(B) - P(A \cap B)

You subtract $P(A \cap B)$ because it gets counted twice otherwise.

Event probabilities using the addition rule:

Conditional probability

Conditional probability answers: “What is the probability of $A$ given that $B$ already happened?” The definition is:

P(A \mid B) = \frac{P(A \cap B)}{P(B)}

provided $P(B) > 0$ . You are restricting your attention to the world where $B$ is true, then asking how much of that world also contains $A$ .

Conditional probability narrows the sample space:

graph LR
  Full["Full sample space Ω"] -->|"Condition on B"| Narrow["Restrict to event B"]
  Narrow -->|"Find A within B"| Result["P(A|B) = P(A ∩ B) / P(B)"]

This single formula is the foundation of Bayes’ theorem and basically all of probabilistic ML.

Example 1: Conditional probability with a contingency table

A company surveys 200 employees about their commute:

	Drives	Takes transit	Total
Under 30	40	30	70
30 or older	80	50	130
Total	120	80	200

Question 1: What is $P(\text{Drives} \mid \text{Under 30})$ ?

We need the probability of driving, given the person is under 30.

P(\text{Drives} \mid \text{Under 30}) = \frac{P(\text{Drives} \cap \text{Under 30})}{P(\text{Under 30})}

= \frac{40/200}{70/200} = \frac{40}{70} = \frac{4}{7} \approx 0.571

Question 2: What is $P(\text{Under 30} \mid \text{Takes transit})$ ?

P(\text{Under 30} \mid \text{Takes transit}) = \frac{P(\text{Under 30} \cap \text{Takes transit})}{P(\text{Takes transit})}

= \frac{30/200}{80/200} = \frac{30}{80} = \frac{3}{8} = 0.375

Question 3: What is $P(\text{Drives} \cup \text{Under 30})$ ?

P(\text{Drives} \cup \text{Under 30}) = P(\text{Drives}) + P(\text{Under 30}) - P(\text{Drives} \cap \text{Under 30})

= \frac{120}{200} + \frac{70}{200} - \frac{40}{200} = \frac{150}{200} = 0.75

Independence

Two events $A$ and $B$ are independent if knowing one happened gives you zero information about the other. Formally:

P(A \cap B) = P(A) \cdot P(B)

Equivalently, $P(A \mid B) = P(A)$ . The conditional probability equals the unconditional probability, meaning $B$ is irrelevant to $A$ .

Independence is not the same as “mutually exclusive.” Mutually exclusive events are actually maximally dependent: if one happens, the other definitely did not.

Example 2: Testing for independence

Using the same employee table, check whether “Drives” and “Under 30” are independent.

If independent, we need: $P(\text{Drives} \cap \text{Under 30}) = P(\text{Drives}) \cdot P(\text{Under 30})$

Compute each side:

P(\text{Drives} \cap \text{Under 30}) = \frac{40}{200} = 0.20

P(\text{Drives}) \cdot P(\text{Under 30}) = \frac{120}{200} \cdot \frac{70}{200} = 0.60 \times 0.35 = 0.21

0.20 \neq 0.21

They are not independent. But they are close. In practice, you often test whether the departure from independence is statistically significant (that is what the chi-squared test does).

Let’s construct an example that is independent. Suppose 200 students are surveyed:

	Likes coffee	Does not like coffee	Total
Left-handed	18	12	30
Right-handed	102	68	170
Total	120	80	200

Check independence of “Likes coffee” and “Left-handed”:

P(\text{Coffee} \cap \text{Left}) = \frac{18}{200} = 0.09

P(\text{Coffee}) \cdot P(\text{Left}) = \frac{120}{200} \cdot \frac{30}{200} = 0.60 \times 0.15 = 0.09

0.09 = 0.09 \quad \checkmark

These events are independent. Knowing someone is left-handed tells you nothing about their coffee preference.

The law of total probability

If events $B_1, B_2, \ldots, B_n$ partition the sample space (they cover everything and do not overlap), then for any event $A$ :

P(A) = \sum_{i=1}^{n} P(A \mid B_i) \cdot P(B_i)

This is like computing a weighted average of the conditional probabilities. It is essential for deriving Bayes’ theorem and for reasoning about mixture models in ML.

Probability tree: branching over a partition of the sample space:

graph TD
  S["Start"] --> B1["Machine 1: 50%"]
  S --> B2["Machine 2: 30%"]
  S --> B3["Machine 3: 20%"]
  B1 -->|"3% defect"| D1["Defective"]
  B1 -->|"97% OK"| OK1["Good"]
  B2 -->|"5% defect"| D2["Defective"]
  B2 -->|"95% OK"| OK2["Good"]
  B3 -->|"2% defect"| D3["Defective"]
  B3 -->|"98% OK"| OK3["Good"]

Example 3: Law of total probability

A factory has three machines producing bolts:

Machine 1 produces 50% of bolts, with 3% defect rate
Machine 2 produces 30% of bolts, with 5% defect rate
Machine 3 produces 20% of bolts, with 2% defect rate

What is the overall defect rate?

Let $D$ = “bolt is defective.” Apply total probability:

P(D) = P(D \mid M_1) \cdot P(M_1) + P(D \mid M_2) \cdot P(M_2) + P(D \mid M_3) \cdot P(M_3)

= 0.03 \times 0.50 + 0.05 \times 0.30 + 0.02 \times 0.20

= 0.015 + 0.015 + 0.004

= 0.034

The overall defect rate is 3.4%. Notice that Machine 2, despite producing only 30% of bolts, contributes as much to defects as Machine 1 (which produces 50%) because Machine 2’s defect rate is higher.

Follow-up: Given a bolt is defective, what is the probability it came from Machine 2? This requires Bayes’ theorem:

P(M_2 \mid D) = \frac{P(D \mid M_2) \cdot P(M_2)}{P(D)} = \frac{0.05 \times 0.30}{0.034} = \frac{0.015}{0.034} \approx 0.441

Machine 2 is responsible for about 44% of defective bolts, even though it produces only 30% of total output.

Joint and marginal probability

The joint probability $P(A, B)$ is another notation for $P(A \cap B)$ , the probability both events occur.

If you have a joint distribution over two variables, you get the marginal distribution by summing out the other variable:

P(A) = \sum_{b} P(A, B = b)

This “sums out” or “marginalizes over” $B$ . It is the same idea as the law of total probability.

For ML, joint and marginal distributions show up constantly. When you build a classifier, you model the joint distribution $P(X, Y)$ of features $X$ and labels $Y$ , then use it to compute $P(Y \mid X)$ .

The product rule (chain rule of probability)

From the definition of conditional probability:

P(A, B) = P(A \mid B) \cdot P(B) = P(B \mid A) \cdot P(A)

This extends to more variables:

P(A, B, C) = P(A \mid B, C) \cdot P(B \mid C) \cdot P(C)

In general, for $n$ events:

P(A_1, A_2, \ldots, A_n) = P(A_1) \cdot P(A_2 \mid A_1) \cdot P(A_3 \mid A_1, A_2) \cdots P(A_n \mid A_1, \ldots, A_{n-1})

This “chain rule” of probability (not to be confused with the calculus chain rule) is how autoregressive language models generate text: each token’s probability is conditioned on all previous tokens.

A visual mental model

Think of probability as area. The sample space is a rectangle with area 1. Events are regions inside it. Conditional probability $P(A \mid B)$ is the fraction of $B$ ‘s area that overlaps with $A$ .

graph LR
  subgraph "Sample Space Ω (area = 1)"
      A["Event A"]
      B["Event B"]
      AB["A ∩ B"]
  end

This “area” intuition also works for continuous distributions, where probabilities become integrals over density functions. We cover that in the next article.

Python: simulating conditional probability

import numpy as np

rng = np.random.default_rng(42)
n = 100_000

# Roll two dice
die1 = rng.integers(1, 7, size=n)
die2 = rng.integers(1, 7, size=n)
total = die1 + die2

# P(total >= 10 | die1 = 6)
mask_die1_is_6 = die1 == 6
p_given = np.mean(total[mask_die1_is_6] >= 10)
print(f"P(total >= 10 | die1 = 6) ≈ {p_given:.3f}")

# Exact: outcomes where die1=6 and total>=10 are die2 in {4,5,6} → 3/6 = 0.5
print(f"Exact: {3/6:.3f}")

Common pitfalls

Confusing $P(A \mid B)$ with $P(B \mid A)$ : These are generally not equal. $P(\text{fever} \mid \text{flu})$ is high, but $P(\text{flu} \mid \text{fever})$ is much lower because many things cause fevers. This confusion is called the “prosecutor’s fallacy.”

Assuming independence without checking: Many ML models (like Naive Bayes) assume features are independent. That assumption is almost always wrong, but it often works surprisingly well in practice.

Forgetting that conditional probability changes the sample space: When you condition on $B$ , you are working in a smaller universe. All probabilities are recalculated relative to $B$ .

Summary

Concept	Formula	Why it matters
Conditional probability	$P(A \mid B) = P(A \cap B) / P(B)$	Foundation of Bayesian reasoning
Independence	$P(A \cap B) = P(A) P(B)$	Simplifies computation, key ML assumption
Total probability	$P(A) = \sum_i P(A \mid B_i) P(B_i)$	Combines information across partitions
Product rule	$P(A, B) = P(A \mid B) P(B)$	Decomposes joint distributions

What comes next

The next article covers random variables and distributions, where we connect probability to the mathematical objects that ML actually computes with: expectations, variances, and named distributions like the Gaussian.

← Back to all series