Search…
Maths for ML · Part 13

Random variables and distributions

In this series (15 parts)
  1. Why Maths Matters for ML: A Practical Overview
  2. Scalars, Vectors, and Vector Spaces
  3. Matrices and Matrix Operations
  4. Matrix Inverses and Systems of Linear Equations
  5. Eigenvalues and Eigenvectors
  6. Matrix Decompositions: LU, QR, SVD
  7. Norms, Distances, and Similarity
  8. Calculus Review: Derivatives and the Chain Rule
  9. Partial Derivatives and Gradients
  10. The Jacobian and Hessian Matrices
  11. Taylor series and local approximations
  12. Probability fundamentals
  13. Random variables and distributions
  14. Bayes theorem and its role in ML
  15. Information theory: entropy, KL divergence, cross-entropy

A random variable is a function that assigns a number to each outcome in a sample space. That sounds abstract, so here is the concrete version: if you roll a die, the random variable XX is just “the number that shows up.” If you flip a coin, you might define X=1X = 1 for heads and X=0X = 0 for tails. Random variables let you do arithmetic with randomness.

Prerequisites

You should be comfortable with probability fundamentals before reading this article.

Discrete vs. continuous

A discrete random variable takes countable values (1, 2, 3, … or “red,” “blue,” “green” mapped to numbers). A continuous random variable can take any value in an interval (like temperature or height).

The tools differ slightly:

  • Discrete: probability mass function (PMF)
  • Continuous: probability density function (PDF)
  • Both: cumulative distribution function (CDF)

Discrete vs continuous distributions at a glance:

graph TD
  RV["Random Variable X"] --> D["Discrete"]
  RV --> C["Continuous"]
  D --> PMF["PMF: P(X = x) as probability bars"]
  D --> DCDF["CDF: step function"]
  C --> PDF["PDF: smooth density curve"]
  C --> CCDF["CDF: smooth curve from 0 to 1"]
  PMF --> SumRule["All probabilities sum to 1"]
  PDF --> IntRule["Density integrates to 1"]

Probability mass function (PMF)

For a discrete random variable XX, the PMF gives the probability of each value:

p(x)=P(X=x)p(x) = P(X = x)

The PMF must satisfy:

  1. p(x)0p(x) \geq 0 for all xx
  2. xp(x)=1\sum_x p(x) = 1

Example: A loaded die with PMF:

xx123456
p(x)p(x)0.10.10.10.10.10.5

The die lands on 6 half the time. The probabilities sum to 0.1×5+0.5=10.1 \times 5 + 0.5 = 1. ✓

Probability density function (PDF)

For a continuous random variable, you cannot assign probability to a single point (the probability of hitting exactly 3.14159… is zero). Instead, the PDF f(x)f(x) gives the density, and probabilities come from integrals:

P(aXb)=abf(x)dxP(a \leq X \leq b) = \int_a^b f(x) \, dx

The PDF must satisfy:

  1. f(x)0f(x) \geq 0 for all xx
  2. f(x)dx=1\int_{-\infty}^{\infty} f(x) \, dx = 1

Note: f(x)f(x) can be greater than 1 at some points. It is a density, not a probability.

Cumulative distribution function (CDF)

The CDF works for both discrete and continuous variables:

F(x)=P(Xx)F(x) = P(X \leq x)

For discrete variables, the CDF is a step function. For continuous variables, it is a smooth increasing curve from 0 to 1. The CDF is useful because P(a<Xb)=F(b)F(a)P(a < X \leq b) = F(b) - F(a).

CDF accumulates probability from left to right:

graph LR
  A["Start: F = 0"] -->|"accumulate probability"| B["F(x) = P(X ≤ x)"]
  B -->|"keeps growing"| C["End: F = 1"]

Expectation (mean)

The expected value E[X]E[X] is the weighted average of all possible values:

Discrete:

E[X]=xxp(x)E[X] = \sum_x x \cdot p(x)

Continuous:

E[X]=xf(x)dxE[X] = \int_{-\infty}^{\infty} x \cdot f(x) \, dx

The expectation tells you the “center” of the distribution. It is the long-run average if you drew from the distribution many times.

Key properties:

  • Linearity: E[aX+b]=aE[X]+bE[aX + b] = aE[X] + b
  • Sum: E[X+Y]=E[X]+E[Y]E[X + Y] = E[X] + E[Y] (always true, even if XX and YY are dependent)

Variance and standard deviation

Variance measures how spread out the distribution is:

Var(X)=E[(XE[X])2]=E[X2](E[X])2\text{Var}(X) = E[(X - E[X])^2] = E[X^2] - (E[X])^2

The second form is often easier to compute. The standard deviation is σ=Var(X)\sigma = \sqrt{\text{Var}(X)}, which has the same units as XX.

Key properties:

  • Var(aX+b)=a2Var(X)\text{Var}(aX + b) = a^2 \text{Var}(X)
  • If XX and YY are independent: Var(X+Y)=Var(X)+Var(Y)\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y)

Example 1: Computing E[X]E[X] and Var(X)\text{Var}(X) from a PMF

A random variable XX has the following PMF:

xx0123
p(x)p(x)0.10.30.40.2

Step 1: Compute E[X]E[X]

E[X]=0(0.1)+1(0.3)+2(0.4)+3(0.2)E[X] = 0(0.1) + 1(0.3) + 2(0.4) + 3(0.2) =0+0.3+0.8+0.6=1.7= 0 + 0.3 + 0.8 + 0.6 = 1.7

Step 2: Compute E[X2]E[X^2]

E[X2]=02(0.1)+12(0.3)+22(0.4)+32(0.2)E[X^2] = 0^2(0.1) + 1^2(0.3) + 2^2(0.4) + 3^2(0.2) =0+0.3+1.6+1.8=3.7= 0 + 0.3 + 1.6 + 1.8 = 3.7

Step 3: Compute Var(X)\text{Var}(X)

Var(X)=E[X2](E[X])2=3.7(1.7)2=3.72.89=0.81\text{Var}(X) = E[X^2] - (E[X])^2 = 3.7 - (1.7)^2 = 3.7 - 2.89 = 0.81

Step 4: Standard deviation

σ=0.81=0.9\sigma = \sqrt{0.81} = 0.9

So the distribution is centered at 1.7 with a spread of about 0.9 in each direction.

Example 2: Bernoulli and Binomial

The Bernoulli distribution models a single yes/no trial with success probability pp:

XBernoulli(p):P(X=1)=p,P(X=0)=1pX \sim \text{Bernoulli}(p): \quad P(X = 1) = p, \quad P(X = 0) = 1 - p E[X]=p,Var(X)=p(1p)E[X] = p, \quad \text{Var}(X) = p(1 - p)

Let’s verify for p=0.3p = 0.3:

E[X]=00.7+10.3=0.3E[X] = 0 \cdot 0.7 + 1 \cdot 0.3 = 0.3 \quad \checkmark E[X2]=020.7+120.3=0.3E[X^2] = 0^2 \cdot 0.7 + 1^2 \cdot 0.3 = 0.3 Var(X)=0.3(0.3)2=0.30.09=0.21\text{Var}(X) = 0.3 - (0.3)^2 = 0.3 - 0.09 = 0.21

Check the formula: p(1p)=0.3×0.7=0.21p(1 - p) = 0.3 \times 0.7 = 0.21

The Binomial distribution counts successes in nn independent Bernoulli trials:

XBinomial(n,p):P(X=k)=(nk)pk(1p)nkX \sim \text{Binomial}(n, p): \quad P(X = k) = \binom{n}{k} p^k (1 - p)^{n - k} E[X]=np,Var(X)=np(1p)E[X] = np, \quad \text{Var}(X) = np(1 - p)

Example: Flip a coin with p=0.4p = 0.4 success probability, n=10n = 10 times. What is P(X=3)P(X = 3)?

P(X=3)=(103)(0.4)3(0.6)7P(X = 3) = \binom{10}{3} (0.4)^3 (0.6)^7 =120×0.064×0.0280=120×0.001792=0.2150= 120 \times 0.064 \times 0.0280 = 120 \times 0.001792 = 0.2150 E[X]=10×0.4=4,Var(X)=10×0.4×0.6=2.4E[X] = 10 \times 0.4 = 4, \quad \text{Var}(X) = 10 \times 0.4 \times 0.6 = 2.4

The Gaussian (Normal) distribution

The most important distribution in ML. Its PDF is:

f(x)=1σ2πexp((xμ)22σ2)f(x) = \frac{1}{\sigma\sqrt{2\pi}} \exp\left(-\frac{(x - \mu)^2}{2\sigma^2}\right)

where μ\mu is the mean and σ2\sigma^2 is the variance. Written as XN(μ,σ2)X \sim \mathcal{N}(\mu, \sigma^2).

Why it matters:

  • The Central Limit Theorem says that averages of many random variables converge to a Gaussian, regardless of the original distribution.
  • Many ML algorithms assume Gaussian noise.
  • The Gaussian has maximum entropy among all distributions with a given mean and variance (connecting to information theory).

The standard normal has μ=0\mu = 0 and σ=1\sigma = 1, written ZN(0,1)Z \sim \mathcal{N}(0, 1).

Gaussian PDF with different means and variances:

Example 3: Gaussian standardization

Exam scores follow XN(72,64)X \sim \mathcal{N}(72, 64), so μ=72\mu = 72 and σ=8\sigma = 8.

Question: What fraction of students score above 84?

Step 1: Standardize. Convert to the standard normal:

Z=Xμσ=84728=128=1.5Z = \frac{X - \mu}{\sigma} = \frac{84 - 72}{8} = \frac{12}{8} = 1.5

Step 2: Look up or compute P(Z>1.5)P(Z > 1.5).

From the standard normal table (or scipy.stats.norm):

P(Z1.5)=0.9332P(Z \leq 1.5) = 0.9332 P(Z>1.5)=10.9332=0.0668P(Z > 1.5) = 1 - 0.9332 = 0.0668

About 6.7% of students score above 84.

Question: What score is at the 90th percentile?

Step 1: Find zz where P(Zz)=0.90P(Z \leq z) = 0.90. From the table: z1.282z \approx 1.282.

Step 2: Convert back:

x=μ+zσ=72+1.282×8=72+10.26=82.26x = \mu + z\sigma = 72 + 1.282 \times 8 = 72 + 10.26 = 82.26

A student at the 90th percentile scored about 82.3.

The Poisson distribution

The Poisson models the count of rare events in a fixed interval:

XPoisson(λ):P(X=k)=λkeλk!X \sim \text{Poisson}(\lambda): \quad P(X = k) = \frac{\lambda^k e^{-\lambda}}{k!} E[X]=λ,Var(X)=λE[X] = \lambda, \quad \text{Var}(X) = \lambda

The mean and variance are both equal to λ\lambda, which is a distinctive property.

Example: A website gets an average of 3 errors per hour (λ=3\lambda = 3). What is P(X=5)P(X = 5)?

P(X=5)=35e35!=243×0.0498120=12.10120=0.1008P(X = 5) = \frac{3^5 e^{-3}}{5!} = \frac{243 \times 0.0498}{120} = \frac{12.10}{120} = 0.1008

About a 10% chance of exactly 5 errors in an hour.

Poisson PMF with lambda = 3:

Example 4: Transforming random variables

If XN(μ,σ2)X \sim \mathcal{N}(\mu, \sigma^2) and you apply a linear transformation Y=aX+bY = aX + b, then:

YN(aμ+b,a2σ2)Y \sim \mathcal{N}(a\mu + b, \, a^2\sigma^2)

Concrete example: Temperature in Celsius is CN(20,9)C \sim \mathcal{N}(20, 9), so μC=20\mu_C = 20 and σC=3\sigma_C = 3.

Convert to Fahrenheit: F=1.8C+32F = 1.8C + 32.

μF=1.8×20+32=36+32=68\mu_F = 1.8 \times 20 + 32 = 36 + 32 = 68 σF2=(1.8)2×9=3.24×9=29.16\sigma_F^2 = (1.8)^2 \times 9 = 3.24 \times 9 = 29.16 σF=29.16=5.4\sigma_F = \sqrt{29.16} = 5.4

So FN(68,29.16)F \sim \mathcal{N}(68, 29.16), which means temperatures in Fahrenheit are centered at 68 with a standard deviation of 5.4.

Covariance and correlation

When you have two random variables, their covariance measures how they move together:

Cov(X,Y)=E[(XE[X])(YE[Y])]=E[XY]E[X]E[Y]\text{Cov}(X, Y) = E[(X - E[X])(Y - E[Y])] = E[XY] - E[X]E[Y]

If Cov(X,Y)>0\text{Cov}(X, Y) > 0, they tend to increase together. If negative, one tends to decrease when the other increases. If zero, they are uncorrelated (but not necessarily independent).

The correlation normalizes covariance to the range [1,1][-1, 1]:

ρ(X,Y)=Cov(X,Y)σXσY\rho(X, Y) = \frac{\text{Cov}(X, Y)}{\sigma_X \sigma_Y}

Covariance appears everywhere in ML. The covariance matrix of a multivariate Gaussian determines the shape and orientation of the distribution. PCA finds directions of maximum variance, which is fundamentally about the eigenstructure of the covariance matrix.

Named distributions cheat sheet

DistributionParametersE[X]E[X]Var(X)\text{Var}(X)Use case
Bernoulli(pp)ppppp(1p)p(1-p)Binary outcomes
Binomial(n,pn, p)n,pn, pnpnpnp(1p)np(1-p)Count of successes
Poisson(λ\lambda)λ\lambdaλ\lambdaλ\lambdaRare event counts
Gaussian(μ,σ2\mu, \sigma^2)μ,σ2\mu, \sigma^2μ\muσ2\sigma^2Continuous measurements
Uniform(a,ba, b)a,ba, ba+b2\frac{a+b}{2}(ba)212\frac{(b-a)^2}{12}Equally likely outcomes
Exponential(λ\lambda)λ\lambda1λ\frac{1}{\lambda}1λ2\frac{1}{\lambda^2}Time between events

Choosing a distribution based on the problem:

graph TD
  A["What type of outcome?"] --> B{"Binary yes/no?"}
  B -->|"Single trial"| C["Bernoulli"]
  B -->|"Count of n trials"| D["Binomial"]
  A --> E{"Count of rare events?"}
  E -->|Yes| F["Poisson"]
  A --> G{"Continuous measurement?"}
  G -->|"Bell-shaped"| H["Gaussian"]
  G -->|"Equally likely range"| I["Uniform"]
  G -->|"Time to next event"| J["Exponential"]

Python: exploring distributions

import numpy as np
from scipy import stats

# Bernoulli
p = 0.3
samples = np.random.binomial(1, p, size=10000)
print(f"Bernoulli(0.3): mean={samples.mean():.3f}, var={samples.var():.3f}")
print(f"Theory:         mean={p:.3f}, var={p*(1-p):.3f}")

# Gaussian
mu, sigma = 72, 8
X = np.random.normal(mu, sigma, size=10000)
print(f"\nGaussian(72, 64): mean={X.mean():.1f}, std={X.std():.1f}")
print(f"P(X > 84) ≈ {np.mean(X > 84):.3f}")
print(f"Theory:    {1 - stats.norm.cdf(84, mu, sigma):.3f}")

# Poisson
lam = 3
Y = np.random.poisson(lam, size=10000)
print(f"\nPoisson(3): mean={Y.mean():.2f}, var={Y.var():.2f}")

How this connects to ML

Random variables and distributions are not just theory. Here is where they show up in practice:

  • Loss functions: When you minimize cross-entropy loss, you are comparing two probability distributions.
  • Generative models: A generative model learns the distribution P(X)P(X) or P(XY)P(X \mid Y) of the data.
  • Regularization: L2 regularization corresponds to placing a Gaussian prior on the weights (a Bayesian interpretation).
  • Noise assumptions: Linear regression assumes Gaussian noise, which leads to the normal equations for the optimal weights.

Summary

ConceptDiscreteContinuous
Probability functionPMF: p(x)=P(X=x)p(x) = P(X = x)PDF: f(x)f(x), use integrals
Must sum/integrate to11
Expectationxp(x)\sum x \cdot p(x)xf(x)dx\int x \cdot f(x) \, dx
VarianceE[X2](E[X])2E[X^2] - (E[X])^2Same formula
CDFStep functionSmooth curve

What comes next

The next article covers Bayes’ theorem, which flips conditional probabilities around and forms the foundation of probabilistic machine learning.

Start typing to search across all content
navigate Enter open Esc close