Mar 2, 2026 · 20 min read · Maths for ML

Random variables and distributions

In this series (15 parts)

A random variable is a function that assigns a number to each outcome in a sample space. That sounds abstract, so here is the concrete version: if you roll a die, the random variable $X$ is just “the number that shows up.” If you flip a coin, you might define $X = 1$ for heads and $X = 0$ for tails. Random variables let you do arithmetic with randomness.

Prerequisites

You should be comfortable with probability fundamentals before reading this article.

Discrete vs. continuous

A discrete random variable takes countable values (1, 2, 3, … or “red,” “blue,” “green” mapped to numbers). A continuous random variable can take any value in an interval (like temperature or height).

The tools differ slightly:

Discrete: probability mass function (PMF)
Continuous: probability density function (PDF)
Both: cumulative distribution function (CDF)

Discrete vs continuous distributions at a glance:

graph TD
  RV["Random Variable X"] --> D["Discrete"]
  RV --> C["Continuous"]
  D --> PMF["PMF: P(X = x) as probability bars"]
  D --> DCDF["CDF: step function"]
  C --> PDF["PDF: smooth density curve"]
  C --> CCDF["CDF: smooth curve from 0 to 1"]
  PMF --> SumRule["All probabilities sum to 1"]
  PDF --> IntRule["Density integrates to 1"]

Probability mass function (PMF)

For a discrete random variable $X$ , the PMF gives the probability of each value:

p(x) = P(X = x)

The PMF must satisfy:

$p(x) \geq 0$ for all $x$
$\sum_x p(x) = 1$

Example: A loaded die with PMF:

$x$	1	2	3	4	5	6
$p(x)$	0.1	0.1	0.1	0.1	0.1	0.5

The die lands on 6 half the time. The probabilities sum to $0.1 \times 5 + 0.5 = 1$ . ✓

Probability density function (PDF)

For a continuous random variable, you cannot assign probability to a single point (the probability of hitting exactly 3.14159… is zero). Instead, the PDF $f(x)$ gives the density, and probabilities come from integrals:

P(a \leq X \leq b) = \int_a^b f(x) \, dx

The PDF must satisfy:

$f(x) \geq 0$ for all $x$
$\int_{-\infty}^{\infty} f(x) \, dx = 1$

Note: $f(x)$ can be greater than 1 at some points. It is a density, not a probability.

Cumulative distribution function (CDF)

The CDF works for both discrete and continuous variables:

F(x) = P(X \leq x)

For discrete variables, the CDF is a step function. For continuous variables, it is a smooth increasing curve from 0 to 1. The CDF is useful because $P(a < X \leq b) = F(b) - F(a)$ .

CDF accumulates probability from left to right:

graph LR
  A["Start: F = 0"] -->|"accumulate probability"| B["F(x) = P(X ≤ x)"]
  B -->|"keeps growing"| C["End: F = 1"]

Expectation (mean)

The expected value $E[X]$ is the weighted average of all possible values:

Discrete:

E[X] = \sum_x x \cdot p(x)

Continuous:

E[X] = \int_{-\infty}^{\infty} x \cdot f(x) \, dx

The expectation tells you the “center” of the distribution. It is the long-run average if you drew from the distribution many times.

Key properties:

Linearity: $E[aX + b] = aE[X] + b$
Sum: $E[X + Y] = E[X] + E[Y]$ (always true, even if $X$ and $Y$ are dependent)

Variance and standard deviation

Variance measures how spread out the distribution is:

\text{Var}(X) = E[(X - E[X])^2] = E[X^2] - (E[X])^2

The second form is often easier to compute. The standard deviation is $\sigma = \sqrt{\text{Var}(X)}$ , which has the same units as $X$ .

Key properties:

$\text{Var}(aX + b) = a^2 \text{Var}(X)$
If $X$ and $Y$ are independent: $\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y)$

Example 1: Computing $E[X]$ and $\text{Var}(X)$ from a PMF

A random variable $X$ has the following PMF:

$x$	0	1	2	3
$p(x)$	0.1	0.3	0.4	0.2

Step 1: Compute $E[X]$

E[X] = 0(0.1) + 1(0.3) + 2(0.4) + 3(0.2)

= 0 + 0.3 + 0.8 + 0.6 = 1.7

Step 2: Compute $E[X^2]$

E[X^2] = 0^2(0.1) + 1^2(0.3) + 2^2(0.4) + 3^2(0.2)

= 0 + 0.3 + 1.6 + 1.8 = 3.7

Step 3: Compute $\text{Var}(X)$

\text{Var}(X) = E[X^2] - (E[X])^2 = 3.7 - (1.7)^2 = 3.7 - 2.89 = 0.81

Step 4: Standard deviation

\sigma = \sqrt{0.81} = 0.9

So the distribution is centered at 1.7 with a spread of about 0.9 in each direction.

Example 2: Bernoulli and Binomial

The Bernoulli distribution models a single yes/no trial with success probability $p$ :

X \sim \text{Bernoulli}(p): \quad P(X = 1) = p, \quad P(X = 0) = 1 - p

E[X] = p, \quad \text{Var}(X) = p(1 - p)

Let’s verify for $p = 0.3$ :

E[X] = 0 \cdot 0.7 + 1 \cdot 0.3 = 0.3 \quad \checkmark

E[X^2] = 0^2 \cdot 0.7 + 1^2 \cdot 0.3 = 0.3

\text{Var}(X) = 0.3 - (0.3)^2 = 0.3 - 0.09 = 0.21

Check the formula: $p(1 - p) = 0.3 \times 0.7 = 0.21$ ✓

The Binomial distribution counts successes in $n$ independent Bernoulli trials:

X \sim \text{Binomial}(n, p): \quad P(X = k) = \binom{n}{k} p^k (1 - p)^{n - k}

E[X] = np, \quad \text{Var}(X) = np(1 - p)

Example: Flip a coin with $p = 0.4$ success probability, $n = 10$ times. What is $P(X = 3)$ ?

P(X = 3) = \binom{10}{3} (0.4)^3 (0.6)^7

= 120 \times 0.064 \times 0.0280 = 120 \times 0.001792 = 0.2150

E[X] = 10 \times 0.4 = 4, \quad \text{Var}(X) = 10 \times 0.4 \times 0.6 = 2.4

The Gaussian (Normal) distribution

The most important distribution in ML. Its PDF is:

f(x) = \frac{1}{\sigma\sqrt{2\pi}} \exp\left(-\frac{(x - \mu)^2}{2\sigma^2}\right)

where $\mu$ is the mean and $\sigma^2$ is the variance. Written as $X \sim \mathcal{N}(\mu, \sigma^2)$ .

Why it matters:

The Central Limit Theorem says that averages of many random variables converge to a Gaussian, regardless of the original distribution.
Many ML algorithms assume Gaussian noise.
The Gaussian has maximum entropy among all distributions with a given mean and variance (connecting to information theory).

The standard normal has $\mu = 0$ and $\sigma = 1$ , written $Z \sim \mathcal{N}(0, 1)$ .

Gaussian PDF with different means and variances:

Example 3: Gaussian standardization

Exam scores follow $X \sim \mathcal{N}(72, 64)$ , so $\mu = 72$ and $\sigma = 8$ .

Question: What fraction of students score above 84?

Step 1: Standardize. Convert to the standard normal:

Z = \frac{X - \mu}{\sigma} = \frac{84 - 72}{8} = \frac{12}{8} = 1.5

Step 2: Look up or compute $P(Z > 1.5)$ .

From the standard normal table (or scipy.stats.norm):

P(Z \leq 1.5) = 0.9332

P(Z > 1.5) = 1 - 0.9332 = 0.0668

About 6.7% of students score above 84.

Question: What score is at the 90th percentile?

Step 1: Find $z$ where $P(Z \leq z) = 0.90$ . From the table: $z \approx 1.282$ .

Step 2: Convert back:

x = \mu + z\sigma = 72 + 1.282 \times 8 = 72 + 10.26 = 82.26

A student at the 90th percentile scored about 82.3.

The Poisson distribution

The Poisson models the count of rare events in a fixed interval:

X \sim \text{Poisson}(\lambda): \quad P(X = k) = \frac{\lambda^k e^{-\lambda}}{k!}

E[X] = \lambda, \quad \text{Var}(X) = \lambda

The mean and variance are both equal to $\lambda$ , which is a distinctive property.

Example: A website gets an average of 3 errors per hour ( $\lambda = 3$ ). What is $P(X = 5)$ ?

P(X = 5) = \frac{3^5 e^{-3}}{5!} = \frac{243 \times 0.0498}{120} = \frac{12.10}{120} = 0.1008

About a 10% chance of exactly 5 errors in an hour.

Poisson PMF with lambda = 3:

Example 4: Transforming random variables

If $X \sim \mathcal{N}(\mu, \sigma^2)$ and you apply a linear transformation $Y = aX + b$ , then:

Y \sim \mathcal{N}(a\mu + b, \, a^2\sigma^2)

Concrete example: Temperature in Celsius is $C \sim \mathcal{N}(20, 9)$ , so $\mu_C = 20$ and $\sigma_C = 3$ .

Convert to Fahrenheit: $F = 1.8C + 32$ .

\mu_F = 1.8 \times 20 + 32 = 36 + 32 = 68

\sigma_F^2 = (1.8)^2 \times 9 = 3.24 \times 9 = 29.16

\sigma_F = \sqrt{29.16} = 5.4

So $F \sim \mathcal{N}(68, 29.16)$ , which means temperatures in Fahrenheit are centered at 68 with a standard deviation of 5.4.

Covariance and correlation

When you have two random variables, their covariance measures how they move together:

\text{Cov}(X, Y) = E[(X - E[X])(Y - E[Y])] = E[XY] - E[X]E[Y]

If $\text{Cov}(X, Y) > 0$ , they tend to increase together. If negative, one tends to decrease when the other increases. If zero, they are uncorrelated (but not necessarily independent).

The correlation normalizes covariance to the range $[-1, 1]$ :

\rho(X, Y) = \frac{\text{Cov}(X, Y)}{\sigma_X \sigma_Y}

Covariance appears everywhere in ML. The covariance matrix of a multivariate Gaussian determines the shape and orientation of the distribution. PCA finds directions of maximum variance, which is fundamentally about the eigenstructure of the covariance matrix.

Named distributions cheat sheet

Distribution	Parameters	$E[X]$	$\text{Var}(X)$	Use case
Bernoulli( $p$ )	$p$	$p$	$p(1-p)$	Binary outcomes
Binomial( $n, p$ )	$n, p$	$np$	$np(1-p)$	Count of successes
Poisson( $\lambda$ )	$\lambda$	$\lambda$	$\lambda$	Rare event counts
Gaussian( $\mu, \sigma^2$ )	$\mu, \sigma^2$	$\mu$	$\sigma^2$	Continuous measurements
Uniform( $a, b$ )	$a, b$	$\frac{a+b}{2}$	$\frac{(b-a)^2}{12}$	Equally likely outcomes
Exponential( $\lambda$ )	$\lambda$	$\frac{1}{\lambda}$	$\frac{1}{\lambda^2}$	Time between events

Choosing a distribution based on the problem:

graph TD
  A["What type of outcome?"] --> B{"Binary yes/no?"}
  B -->|"Single trial"| C["Bernoulli"]
  B -->|"Count of n trials"| D["Binomial"]
  A --> E{"Count of rare events?"}
  E -->|Yes| F["Poisson"]
  A --> G{"Continuous measurement?"}
  G -->|"Bell-shaped"| H["Gaussian"]
  G -->|"Equally likely range"| I["Uniform"]
  G -->|"Time to next event"| J["Exponential"]

Python: exploring distributions

import numpy as np
from scipy import stats

# Bernoulli
p = 0.3
samples = np.random.binomial(1, p, size=10000)
print(f"Bernoulli(0.3): mean={samples.mean():.3f}, var={samples.var():.3f}")
print(f"Theory:         mean={p:.3f}, var={p*(1-p):.3f}")

# Gaussian
mu, sigma = 72, 8
X = np.random.normal(mu, sigma, size=10000)
print(f"\nGaussian(72, 64): mean={X.mean():.1f}, std={X.std():.1f}")
print(f"P(X > 84) ≈ {np.mean(X > 84):.3f}")
print(f"Theory:    {1 - stats.norm.cdf(84, mu, sigma):.3f}")

# Poisson
lam = 3
Y = np.random.poisson(lam, size=10000)
print(f"\nPoisson(3): mean={Y.mean():.2f}, var={Y.var():.2f}")

How this connects to ML

Random variables and distributions are not just theory. Here is where they show up in practice:

Loss functions: When you minimize cross-entropy loss, you are comparing two probability distributions.
Generative models: A generative model learns the distribution $P(X)$ or $P(X \mid Y)$ of the data.
Regularization: L2 regularization corresponds to placing a Gaussian prior on the weights (a Bayesian interpretation).
Noise assumptions: Linear regression assumes Gaussian noise, which leads to the normal equations for the optimal weights.

Summary

Concept	Discrete	Continuous
Probability function	PMF: $p(x) = P(X = x)$	PDF: $f(x)$ , use integrals
Must sum/integrate to	1	1
Expectation	$\sum x \cdot p(x)$	$\int x \cdot f(x) \, dx$
Variance	$E[X^2] - (E[X])^2$	Same formula
CDF	Step function	Smooth curve

What comes next

The next article covers Bayes’ theorem, which flips conditional probabilities around and forms the foundation of probabilistic machine learning.

← Back to all series