Search…
Maths for ML · Part 12

Probability fundamentals

In this series (15 parts)
  1. Why Maths Matters for ML: A Practical Overview
  2. Scalars, Vectors, and Vector Spaces
  3. Matrices and Matrix Operations
  4. Matrix Inverses and Systems of Linear Equations
  5. Eigenvalues and Eigenvectors
  6. Matrix Decompositions: LU, QR, SVD
  7. Norms, Distances, and Similarity
  8. Calculus Review: Derivatives and the Chain Rule
  9. Partial Derivatives and Gradients
  10. The Jacobian and Hessian Matrices
  11. Taylor series and local approximations
  12. Probability fundamentals
  13. Random variables and distributions
  14. Bayes theorem and its role in ML
  15. Information theory: entropy, KL divergence, cross-entropy

Machine learning is fundamentally about making predictions under uncertainty. Probability gives you the formal tools to quantify that uncertainty, update beliefs when new data arrives, and reason about what your model “knows” vs. what it is guessing.

Prerequisites

This article assumes basic mathematical maturity. If you are new to the series, start with why maths for ML.

Sample spaces and events

A sample space Ω\Omega is the set of all possible outcomes of an experiment. An event is a subset of Ω\Omega, representing outcomes you care about.

Example: Roll a six-sided die.

  • Sample space: Ω={1,2,3,4,5,6}\Omega = \{1, 2, 3, 4, 5, 6\}
  • Event AA = “roll an even number” = {2,4,6}\{2, 4, 6\}
  • Event BB = “roll greater than 4” = {5,6}\{5, 6\}

Events can be combined:

  • ABA \cup B (A or B) = {2,4,5,6}\{2, 4, 5, 6\}
  • ABA \cap B (A and B) = {6}\{6\}
  • AcA^c (not A) = {1,3,5}\{1, 3, 5\}

Set operations on events (Venn diagram concepts):

graph TD
  U["Sample Space: 1,2,3,4,5,6"] --> AuB["A ∪ B = 2,4,5,6"]
  U --> AnB["A ∩ B = 6"]
  U --> Ac["A complement = 1,3,5"]
  AuB --> Rule["P(A ∪ B) = P(A) + P(B) - P(A ∩ B)"]

The three axioms of probability

Every probability function PP must satisfy:

  1. Non-negativity: P(A)0P(A) \geq 0 for any event AA
  2. Normalization: P(Ω)=1P(\Omega) = 1
  3. Additivity: If AA and BB are mutually exclusive (cannot both happen), then P(AB)=P(A)+P(B)P(A \cup B) = P(A) + P(B)

Everything else in probability theory follows from these three rules.

A useful consequence is the addition rule for events that are not mutually exclusive:

P(AB)=P(A)+P(B)P(AB)P(A \cup B) = P(A) + P(B) - P(A \cap B)

You subtract P(AB)P(A \cap B) because it gets counted twice otherwise.

Event probabilities using the addition rule:

Conditional probability

Conditional probability answers: “What is the probability of AA given that BB already happened?” The definition is:

P(AB)=P(AB)P(B)P(A \mid B) = \frac{P(A \cap B)}{P(B)}

provided P(B)>0P(B) > 0. You are restricting your attention to the world where BB is true, then asking how much of that world also contains AA.

Conditional probability narrows the sample space:

graph LR
  Full["Full sample space Ω"] -->|"Condition on B"| Narrow["Restrict to event B"]
  Narrow -->|"Find A within B"| Result["P(A|B) = P(A ∩ B) / P(B)"]

This single formula is the foundation of Bayes’ theorem and basically all of probabilistic ML.

Example 1: Conditional probability with a contingency table

A company surveys 200 employees about their commute:

DrivesTakes transitTotal
Under 30403070
30 or older8050130
Total12080200

Question 1: What is P(DrivesUnder 30)P(\text{Drives} \mid \text{Under 30})?

We need the probability of driving, given the person is under 30.

P(DrivesUnder 30)=P(DrivesUnder 30)P(Under 30)P(\text{Drives} \mid \text{Under 30}) = \frac{P(\text{Drives} \cap \text{Under 30})}{P(\text{Under 30})} =40/20070/200=4070=470.571= \frac{40/200}{70/200} = \frac{40}{70} = \frac{4}{7} \approx 0.571

Question 2: What is P(Under 30Takes transit)P(\text{Under 30} \mid \text{Takes transit})?

P(Under 30Takes transit)=P(Under 30Takes transit)P(Takes transit)P(\text{Under 30} \mid \text{Takes transit}) = \frac{P(\text{Under 30} \cap \text{Takes transit})}{P(\text{Takes transit})} =30/20080/200=3080=38=0.375= \frac{30/200}{80/200} = \frac{30}{80} = \frac{3}{8} = 0.375

Question 3: What is P(DrivesUnder 30)P(\text{Drives} \cup \text{Under 30})?

P(DrivesUnder 30)=P(Drives)+P(Under 30)P(DrivesUnder 30)P(\text{Drives} \cup \text{Under 30}) = P(\text{Drives}) + P(\text{Under 30}) - P(\text{Drives} \cap \text{Under 30}) =120200+7020040200=150200=0.75= \frac{120}{200} + \frac{70}{200} - \frac{40}{200} = \frac{150}{200} = 0.75

Independence

Two events AA and BB are independent if knowing one happened gives you zero information about the other. Formally:

P(AB)=P(A)P(B)P(A \cap B) = P(A) \cdot P(B)

Equivalently, P(AB)=P(A)P(A \mid B) = P(A). The conditional probability equals the unconditional probability, meaning BB is irrelevant to AA.

Independence is not the same as “mutually exclusive.” Mutually exclusive events are actually maximally dependent: if one happens, the other definitely did not.

Example 2: Testing for independence

Using the same employee table, check whether “Drives” and “Under 30” are independent.

If independent, we need: P(DrivesUnder 30)=P(Drives)P(Under 30)P(\text{Drives} \cap \text{Under 30}) = P(\text{Drives}) \cdot P(\text{Under 30})

Compute each side:

P(DrivesUnder 30)=40200=0.20P(\text{Drives} \cap \text{Under 30}) = \frac{40}{200} = 0.20 P(Drives)P(Under 30)=12020070200=0.60×0.35=0.21P(\text{Drives}) \cdot P(\text{Under 30}) = \frac{120}{200} \cdot \frac{70}{200} = 0.60 \times 0.35 = 0.21 0.200.210.20 \neq 0.21

They are not independent. But they are close. In practice, you often test whether the departure from independence is statistically significant (that is what the chi-squared test does).

Let’s construct an example that is independent. Suppose 200 students are surveyed:

Likes coffeeDoes not like coffeeTotal
Left-handed181230
Right-handed10268170
Total12080200

Check independence of “Likes coffee” and “Left-handed”:

P(CoffeeLeft)=18200=0.09P(\text{Coffee} \cap \text{Left}) = \frac{18}{200} = 0.09 P(Coffee)P(Left)=12020030200=0.60×0.15=0.09P(\text{Coffee}) \cdot P(\text{Left}) = \frac{120}{200} \cdot \frac{30}{200} = 0.60 \times 0.15 = 0.09 0.09=0.090.09 = 0.09 \quad \checkmark

These events are independent. Knowing someone is left-handed tells you nothing about their coffee preference.

The law of total probability

If events B1,B2,,BnB_1, B_2, \ldots, B_n partition the sample space (they cover everything and do not overlap), then for any event AA:

P(A)=i=1nP(ABi)P(Bi)P(A) = \sum_{i=1}^{n} P(A \mid B_i) \cdot P(B_i)

This is like computing a weighted average of the conditional probabilities. It is essential for deriving Bayes’ theorem and for reasoning about mixture models in ML.

Probability tree: branching over a partition of the sample space:

graph TD
  S["Start"] --> B1["Machine 1: 50%"]
  S --> B2["Machine 2: 30%"]
  S --> B3["Machine 3: 20%"]
  B1 -->|"3% defect"| D1["Defective"]
  B1 -->|"97% OK"| OK1["Good"]
  B2 -->|"5% defect"| D2["Defective"]
  B2 -->|"95% OK"| OK2["Good"]
  B3 -->|"2% defect"| D3["Defective"]
  B3 -->|"98% OK"| OK3["Good"]

Example 3: Law of total probability

A factory has three machines producing bolts:

  • Machine 1 produces 50% of bolts, with 3% defect rate
  • Machine 2 produces 30% of bolts, with 5% defect rate
  • Machine 3 produces 20% of bolts, with 2% defect rate

What is the overall defect rate?

Let DD = “bolt is defective.” Apply total probability:

P(D)=P(DM1)P(M1)+P(DM2)P(M2)+P(DM3)P(M3)P(D) = P(D \mid M_1) \cdot P(M_1) + P(D \mid M_2) \cdot P(M_2) + P(D \mid M_3) \cdot P(M_3) =0.03×0.50+0.05×0.30+0.02×0.20= 0.03 \times 0.50 + 0.05 \times 0.30 + 0.02 \times 0.20 =0.015+0.015+0.004= 0.015 + 0.015 + 0.004 =0.034= 0.034

The overall defect rate is 3.4%. Notice that Machine 2, despite producing only 30% of bolts, contributes as much to defects as Machine 1 (which produces 50%) because Machine 2’s defect rate is higher.

Follow-up: Given a bolt is defective, what is the probability it came from Machine 2? This requires Bayes’ theorem:

P(M2D)=P(DM2)P(M2)P(D)=0.05×0.300.034=0.0150.0340.441P(M_2 \mid D) = \frac{P(D \mid M_2) \cdot P(M_2)}{P(D)} = \frac{0.05 \times 0.30}{0.034} = \frac{0.015}{0.034} \approx 0.441

Machine 2 is responsible for about 44% of defective bolts, even though it produces only 30% of total output.

Joint and marginal probability

The joint probability P(A,B)P(A, B) is another notation for P(AB)P(A \cap B), the probability both events occur.

If you have a joint distribution over two variables, you get the marginal distribution by summing out the other variable:

P(A)=bP(A,B=b)P(A) = \sum_{b} P(A, B = b)

This “sums out” or “marginalizes over” BB. It is the same idea as the law of total probability.

For ML, joint and marginal distributions show up constantly. When you build a classifier, you model the joint distribution P(X,Y)P(X, Y) of features XX and labels YY, then use it to compute P(YX)P(Y \mid X).

The product rule (chain rule of probability)

From the definition of conditional probability:

P(A,B)=P(AB)P(B)=P(BA)P(A)P(A, B) = P(A \mid B) \cdot P(B) = P(B \mid A) \cdot P(A)

This extends to more variables:

P(A,B,C)=P(AB,C)P(BC)P(C)P(A, B, C) = P(A \mid B, C) \cdot P(B \mid C) \cdot P(C)

In general, for nn events:

P(A1,A2,,An)=P(A1)P(A2A1)P(A3A1,A2)P(AnA1,,An1)P(A_1, A_2, \ldots, A_n) = P(A_1) \cdot P(A_2 \mid A_1) \cdot P(A_3 \mid A_1, A_2) \cdots P(A_n \mid A_1, \ldots, A_{n-1})

This “chain rule” of probability (not to be confused with the calculus chain rule) is how autoregressive language models generate text: each token’s probability is conditioned on all previous tokens.

A visual mental model

Think of probability as area. The sample space is a rectangle with area 1. Events are regions inside it. Conditional probability P(AB)P(A \mid B) is the fraction of BB‘s area that overlaps with AA.

graph LR
  subgraph "Sample Space Ω (area = 1)"
      A["Event A"]
      B["Event B"]
      AB["A ∩ B"]
  end

This “area” intuition also works for continuous distributions, where probabilities become integrals over density functions. We cover that in the next article.

Python: simulating conditional probability

import numpy as np

rng = np.random.default_rng(42)
n = 100_000

# Roll two dice
die1 = rng.integers(1, 7, size=n)
die2 = rng.integers(1, 7, size=n)
total = die1 + die2

# P(total >= 10 | die1 = 6)
mask_die1_is_6 = die1 == 6
p_given = np.mean(total[mask_die1_is_6] >= 10)
print(f"P(total >= 10 | die1 = 6) ≈ {p_given:.3f}")

# Exact: outcomes where die1=6 and total>=10 are die2 in {4,5,6} → 3/6 = 0.5
print(f"Exact: {3/6:.3f}")

Common pitfalls

Confusing P(AB)P(A \mid B) with P(BA)P(B \mid A): These are generally not equal. P(feverflu)P(\text{fever} \mid \text{flu}) is high, but P(flufever)P(\text{flu} \mid \text{fever}) is much lower because many things cause fevers. This confusion is called the “prosecutor’s fallacy.”

Assuming independence without checking: Many ML models (like Naive Bayes) assume features are independent. That assumption is almost always wrong, but it often works surprisingly well in practice.

Forgetting that conditional probability changes the sample space: When you condition on BB, you are working in a smaller universe. All probabilities are recalculated relative to BB.

Summary

ConceptFormulaWhy it matters
Conditional probabilityP(AB)=P(AB)/P(B)P(A \mid B) = P(A \cap B) / P(B)Foundation of Bayesian reasoning
IndependenceP(AB)=P(A)P(B)P(A \cap B) = P(A) P(B)Simplifies computation, key ML assumption
Total probabilityP(A)=iP(ABi)P(Bi)P(A) = \sum_i P(A \mid B_i) P(B_i)Combines information across partitions
Product ruleP(A,B)=P(AB)P(B)P(A, B) = P(A \mid B) P(B)Decomposes joint distributions

What comes next

The next article covers random variables and distributions, where we connect probability to the mathematical objects that ML actually computes with: expectations, variances, and named distributions like the Gaussian.

Start typing to search across all content
navigate Enter open Esc close