Random variables and distributions
In this series (15 parts)
- Why Maths Matters for ML: A Practical Overview
- Scalars, Vectors, and Vector Spaces
- Matrices and Matrix Operations
- Matrix Inverses and Systems of Linear Equations
- Eigenvalues and Eigenvectors
- Matrix Decompositions: LU, QR, SVD
- Norms, Distances, and Similarity
- Calculus Review: Derivatives and the Chain Rule
- Partial Derivatives and Gradients
- The Jacobian and Hessian Matrices
- Taylor series and local approximations
- Probability fundamentals
- Random variables and distributions
- Bayes theorem and its role in ML
- Information theory: entropy, KL divergence, cross-entropy
A random variable is a function that assigns a number to each outcome in a sample space. That sounds abstract, so here is the concrete version: if you roll a die, the random variable is just “the number that shows up.” If you flip a coin, you might define for heads and for tails. Random variables let you do arithmetic with randomness.
Prerequisites
You should be comfortable with probability fundamentals before reading this article.
Discrete vs. continuous
A discrete random variable takes countable values (1, 2, 3, … or “red,” “blue,” “green” mapped to numbers). A continuous random variable can take any value in an interval (like temperature or height).
The tools differ slightly:
- Discrete: probability mass function (PMF)
- Continuous: probability density function (PDF)
- Both: cumulative distribution function (CDF)
Discrete vs continuous distributions at a glance:
graph TD RV["Random Variable X"] --> D["Discrete"] RV --> C["Continuous"] D --> PMF["PMF: P(X = x) as probability bars"] D --> DCDF["CDF: step function"] C --> PDF["PDF: smooth density curve"] C --> CCDF["CDF: smooth curve from 0 to 1"] PMF --> SumRule["All probabilities sum to 1"] PDF --> IntRule["Density integrates to 1"]
Probability mass function (PMF)
For a discrete random variable , the PMF gives the probability of each value:
The PMF must satisfy:
- for all
Example: A loaded die with PMF:
| 1 | 2 | 3 | 4 | 5 | 6 | |
|---|---|---|---|---|---|---|
| 0.1 | 0.1 | 0.1 | 0.1 | 0.1 | 0.5 |
The die lands on 6 half the time. The probabilities sum to . ✓
Probability density function (PDF)
For a continuous random variable, you cannot assign probability to a single point (the probability of hitting exactly 3.14159… is zero). Instead, the PDF gives the density, and probabilities come from integrals:
The PDF must satisfy:
- for all
Note: can be greater than 1 at some points. It is a density, not a probability.
Cumulative distribution function (CDF)
The CDF works for both discrete and continuous variables:
For discrete variables, the CDF is a step function. For continuous variables, it is a smooth increasing curve from 0 to 1. The CDF is useful because .
CDF accumulates probability from left to right:
graph LR A["Start: F = 0"] -->|"accumulate probability"| B["F(x) = P(X ≤ x)"] B -->|"keeps growing"| C["End: F = 1"]
Expectation (mean)
The expected value is the weighted average of all possible values:
Discrete:
Continuous:
The expectation tells you the “center” of the distribution. It is the long-run average if you drew from the distribution many times.
Key properties:
- Linearity:
- Sum: (always true, even if and are dependent)
Variance and standard deviation
Variance measures how spread out the distribution is:
The second form is often easier to compute. The standard deviation is , which has the same units as .
Key properties:
- If and are independent:
Example 1: Computing and from a PMF
A random variable has the following PMF:
| 0 | 1 | 2 | 3 | |
|---|---|---|---|---|
| 0.1 | 0.3 | 0.4 | 0.2 |
Step 1: Compute
Step 2: Compute
Step 3: Compute
Step 4: Standard deviation
So the distribution is centered at 1.7 with a spread of about 0.9 in each direction.
Example 2: Bernoulli and Binomial
The Bernoulli distribution models a single yes/no trial with success probability :
Let’s verify for :
Check the formula: ✓
The Binomial distribution counts successes in independent Bernoulli trials:
Example: Flip a coin with success probability, times. What is ?
The Gaussian (Normal) distribution
The most important distribution in ML. Its PDF is:
where is the mean and is the variance. Written as .
Why it matters:
- The Central Limit Theorem says that averages of many random variables converge to a Gaussian, regardless of the original distribution.
- Many ML algorithms assume Gaussian noise.
- The Gaussian has maximum entropy among all distributions with a given mean and variance (connecting to information theory).
The standard normal has and , written .
Gaussian PDF with different means and variances:
Example 3: Gaussian standardization
Exam scores follow , so and .
Question: What fraction of students score above 84?
Step 1: Standardize. Convert to the standard normal:
Step 2: Look up or compute .
From the standard normal table (or scipy.stats.norm):
About 6.7% of students score above 84.
Question: What score is at the 90th percentile?
Step 1: Find where . From the table: .
Step 2: Convert back:
A student at the 90th percentile scored about 82.3.
The Poisson distribution
The Poisson models the count of rare events in a fixed interval:
The mean and variance are both equal to , which is a distinctive property.
Example: A website gets an average of 3 errors per hour (). What is ?
About a 10% chance of exactly 5 errors in an hour.
Poisson PMF with lambda = 3:
Example 4: Transforming random variables
If and you apply a linear transformation , then:
Concrete example: Temperature in Celsius is , so and .
Convert to Fahrenheit: .
So , which means temperatures in Fahrenheit are centered at 68 with a standard deviation of 5.4.
Covariance and correlation
When you have two random variables, their covariance measures how they move together:
If , they tend to increase together. If negative, one tends to decrease when the other increases. If zero, they are uncorrelated (but not necessarily independent).
The correlation normalizes covariance to the range :
Covariance appears everywhere in ML. The covariance matrix of a multivariate Gaussian determines the shape and orientation of the distribution. PCA finds directions of maximum variance, which is fundamentally about the eigenstructure of the covariance matrix.
Named distributions cheat sheet
| Distribution | Parameters | Use case | ||
|---|---|---|---|---|
| Bernoulli() | Binary outcomes | |||
| Binomial() | Count of successes | |||
| Poisson() | Rare event counts | |||
| Gaussian() | Continuous measurements | |||
| Uniform() | Equally likely outcomes | |||
| Exponential() | Time between events |
Choosing a distribution based on the problem:
graph TD
A["What type of outcome?"] --> B{"Binary yes/no?"}
B -->|"Single trial"| C["Bernoulli"]
B -->|"Count of n trials"| D["Binomial"]
A --> E{"Count of rare events?"}
E -->|Yes| F["Poisson"]
A --> G{"Continuous measurement?"}
G -->|"Bell-shaped"| H["Gaussian"]
G -->|"Equally likely range"| I["Uniform"]
G -->|"Time to next event"| J["Exponential"]
Python: exploring distributions
import numpy as np
from scipy import stats
# Bernoulli
p = 0.3
samples = np.random.binomial(1, p, size=10000)
print(f"Bernoulli(0.3): mean={samples.mean():.3f}, var={samples.var():.3f}")
print(f"Theory: mean={p:.3f}, var={p*(1-p):.3f}")
# Gaussian
mu, sigma = 72, 8
X = np.random.normal(mu, sigma, size=10000)
print(f"\nGaussian(72, 64): mean={X.mean():.1f}, std={X.std():.1f}")
print(f"P(X > 84) ≈ {np.mean(X > 84):.3f}")
print(f"Theory: {1 - stats.norm.cdf(84, mu, sigma):.3f}")
# Poisson
lam = 3
Y = np.random.poisson(lam, size=10000)
print(f"\nPoisson(3): mean={Y.mean():.2f}, var={Y.var():.2f}")
How this connects to ML
Random variables and distributions are not just theory. Here is where they show up in practice:
- Loss functions: When you minimize cross-entropy loss, you are comparing two probability distributions.
- Generative models: A generative model learns the distribution or of the data.
- Regularization: L2 regularization corresponds to placing a Gaussian prior on the weights (a Bayesian interpretation).
- Noise assumptions: Linear regression assumes Gaussian noise, which leads to the normal equations for the optimal weights.
Summary
| Concept | Discrete | Continuous |
|---|---|---|
| Probability function | PMF: | PDF: , use integrals |
| Must sum/integrate to | 1 | 1 |
| Expectation | ||
| Variance | Same formula | |
| CDF | Step function | Smooth curve |
What comes next
The next article covers Bayes’ theorem, which flips conditional probabilities around and forms the foundation of probabilistic machine learning.