Search…
Maths for ML · Part 1

Why Maths Matters for ML: A Practical Overview

In this series (15 parts)
  1. Why Maths Matters for ML: A Practical Overview
  2. Scalars, Vectors, and Vector Spaces
  3. Matrices and Matrix Operations
  4. Matrix Inverses and Systems of Linear Equations
  5. Eigenvalues and Eigenvectors
  6. Matrix Decompositions: LU, QR, SVD
  7. Norms, Distances, and Similarity
  8. Calculus Review: Derivatives and the Chain Rule
  9. Partial Derivatives and Gradients
  10. The Jacobian and Hessian Matrices
  11. Taylor series and local approximations
  12. Probability fundamentals
  13. Random variables and distributions
  14. Bayes theorem and its role in ML
  15. Information theory: entropy, KL divergence, cross-entropy

Machine learning is applied mathematics. Every model you train, every prediction you make, and every loss function you minimize relies on a handful of core math concepts. You do not need a PhD to learn them, but you do need to take them seriously.

This article lays out the three pillars of mathematics that support ML and shows you, with real numbers, how they show up in practice.

The three pillars

ML sits at the intersection of three branches of mathematics:

  1. Linear algebra handles data representation and transformations.
  2. Calculus drives the optimization that makes models learn.
  3. Probability and statistics quantifies uncertainty and measures performance.

Strip away the frameworks and APIs, and every ML algorithm is a composition of operations from these three areas. Let’s look at each one.

The three pillars of mathematics for ML:

graph TD
  LA["Linear Algebra<br/>Vectors, matrices, decompositions"] --> DR["Dimensionality Reduction<br/>PCA, SVD"]
  LA --> NN["Neural Network Layers<br/>y = Wx + b"]
  LA --> DATA["Data Representation<br/>Features, embeddings"]
  CALC["Calculus<br/>Derivatives, gradients, chain rule"] --> GD["Gradient Descent<br/>Parameter optimization"]
  CALC --> BP["Backpropagation<br/>Training neural networks"]
  CALC --> LR["Learning Rate Tuning<br/>Hessian, curvature"]
  PROB["Probability and Statistics<br/>Distributions, Bayes, entropy"] --> LOSS["Loss Functions<br/>Cross-entropy, likelihood"]
  PROB --> CLASS["Classification<br/>Softmax, logistic regression"]
  PROB --> EVAL["Evaluation<br/>Precision, recall, AUC"]
  LA -.->|"Matrix calculus"| CALC
  CALC -.->|"Probabilistic loss"| PROB
  PROB -.->|"Data as vectors"| LA

Pillar 1: Linear algebra

Your training data is a matrix. Each row is a sample, each column is a feature. A dataset of 1000 images, each with 784 pixels, becomes a 1000×7841000 \times 784 matrix. That alone makes linear algebra unavoidable.

Where it shows up:

  • Data representation. Features are vectors. Datasets are matrices. Batches of images are tensors.
  • Neural networks. A single layer computes y=Wx+b\mathbf{y} = W\mathbf{x} + \mathbf{b}, which is a matrix multiplication followed by a vector addition.
  • Dimensionality reduction. PCA finds the directions of maximum variance using eigenvalues and eigenvectors.
  • Recommender systems. SVD factorizes the user-item matrix into latent factors.

Even something as basic as computing similarity between two documents uses the dot product of their feature vectors.

Pillar 2: Calculus

Training a model means finding parameters that minimize a loss function. Calculus tells you which direction to move those parameters and by how much.

Where it shows up:

  • Gradient descent. Compute the gradient of the loss with respect to each parameter, then update in the opposite direction.
  • Backpropagation. The chain rule lets you propagate error signals backward through layers of a neural network.
  • Learning rate schedules. Understanding the curvature of the loss surface (via the Hessian) helps you pick better step sizes.
  • Regularization. Adding penalty terms to the loss function changes the optimization landscape, and calculus describes exactly how.

Without calculus, you have no way to train anything. You would be stuck guessing parameters at random.

Pillar 3: Probability and statistics

ML models deal with uncertain data and make probabilistic predictions. Probability gives you the language to reason about this.

Where it shows up:

  • Classification. A logistic regression model outputs a probability: P(y=1x)P(y=1 \mid \mathbf{x}). Softmax extends this to multiple classes.
  • Bayes’ theorem. Naive Bayes classifiers use it directly. Bayesian neural networks use it as a foundation.
  • Loss functions. Cross-entropy loss measures how far your predicted probability distribution is from the true one.
  • Evaluation. Metrics like precision, recall, and AUC are all grounded in probability.
  • Generative models. VAEs and diffusion models explicitly learn probability distributions over the data.

Probability is also how you reason about overfitting, bias-variance tradeoffs, and confidence intervals.

Worked example 1: predicting with a dot product

A simple linear model predicts house prices using three features: area (in hundreds of sq ft), number of bedrooms, and age of the house (in decades).

Suppose we have learned weights w=[3.5,  1.2,  0.8]\mathbf{w} = [3.5, \; 1.2, \; -0.8] and bias b=50b = 50. These weights mean: each 100 sq ft adds $3.5k, each bedroom adds $1.2k, and each decade of age subtracts $0.8k.

For a house with features x=[12,  3,  2]\mathbf{x} = [12, \; 3, \; 2] (1200 sq ft, 3 bedrooms, 2 decades old):

y^=wx+b\hat{y} = \mathbf{w} \cdot \mathbf{x} + b

Compute the dot product step by step:

wx=(3.5×12)+(1.2×3)+(0.8×2)\mathbf{w} \cdot \mathbf{x} = (3.5 \times 12) + (1.2 \times 3) + (-0.8 \times 2) =42+3.6+(1.6)= 42 + 3.6 + (-1.6) =44.0= 44.0

Add the bias:

y^=44.0+50=94.0\hat{y} = 44.0 + 50 = 94.0

The predicted price is $94,000.

That is linear algebra in action: a dot product between a weight vector and a feature vector, plus a bias term. Every linear model, and every neuron in a neural network, does exactly this.

Worked example 2: one step of gradient descent

Let’s minimize a simple function f(x)=(x3)2f(x) = (x - 3)^2. The minimum is obviously at x=3x = 3, but let’s see how gradient descent finds it.

Step 1: compute the derivative.

f(x)=2(x3)f'(x) = 2(x - 3)

Step 2: pick a starting point and learning rate.

Start at x0=7x_0 = 7 with learning rate α=0.1\alpha = 0.1.

Step 3: update.

x1=x0αf(x0)x_1 = x_0 - \alpha \cdot f'(x_0) =70.1×2(73)= 7 - 0.1 \times 2(7 - 3) =70.1×8= 7 - 0.1 \times 8 =70.8= 7 - 0.8 =6.2= 6.2

Step 4: repeat.

x2=6.20.1×2(6.23)x_2 = 6.2 - 0.1 \times 2(6.2 - 3) =6.20.1×6.4= 6.2 - 0.1 \times 6.4 =6.20.64= 6.2 - 0.64 =5.56= 5.56

Each step moves xx closer to 3. After many iterations, xx converges to the minimum. This is exactly what happens during model training, just with thousands of parameters instead of one.

Here is a quick Python snippet to see the convergence:

x = 7.0
alpha = 0.1

for i in range(20):
    grad = 2 * (x - 3)
    x = x - alpha * grad
    print(f"Step {i+1}: x = {x:.4f}, f(x) = {(x-3)**2:.4f}")

Worked example 3: computing a class probability

Probability shows up the moment you do classification. Suppose a logistic regression model outputs a raw score (logit) of z=2.0z = 2.0 for a spam email. To turn this into a probability, we apply the sigmoid function:

σ(z)=11+ez\sigma(z) = \frac{1}{1 + e^{-z}}

Step 1: compute eze^{-z}.

e2.00.1353e^{-2.0} \approx 0.1353

Step 2: add 1.

1+0.1353=1.13531 + 0.1353 = 1.1353

Step 3: take the reciprocal.

σ(2.0)=11.13530.8808\sigma(2.0) = \frac{1}{1.1353} \approx 0.8808

The model says there is an 88.1% chance this email is spam. If our decision threshold is 0.5, we classify it as spam.

The sigmoid function: mapping any real number to a probability between 0 and 1

Now try a negative logit, z=1.5z = -1.5:

e(1.5)=e1.54.4817e^{-(-1.5)} = e^{1.5} \approx 4.4817 σ(1.5)=11+4.4817=15.48170.1824\sigma(-1.5) = \frac{1}{1 + 4.4817} = \frac{1}{5.4817} \approx 0.1824

An 18.2% probability, so we classify this as not spam. The sigmoid squashes any real number into the range (0,1)(0, 1), giving you a valid probability. This is pure math: an exponential function turning a raw number into something meaningful.

import math

def sigmoid(z):
    return 1 / (1 + math.exp(-z))

print(f"sigmoid(2.0) = {sigmoid(2.0):.4f}")   # 0.8808
print(f"sigmoid(-1.5) = {sigmoid(-1.5):.4f}") # 0.1824
print(f"sigmoid(0) = {sigmoid(0):.4f}")        # 0.5000

Notice that σ(0)=0.5\sigma(0) = 0.5 exactly. Positive logits give probabilities above 0.5, negative logits give probabilities below 0.5. The larger the magnitude, the more confident the prediction.

How the three pillars work together

These areas do not exist in isolation. A single training step of a neural network involves all three:

  1. Linear algebra: multiply the input by the weight matrix to get the layer output.
  2. Calculus: compute gradients of the loss with respect to every weight using the chain rule.
  3. Probability: the loss function itself (e.g., cross-entropy) is defined in terms of predicted probabilities.

When you call model.fit() in your favourite framework, hundreds of matrix multiplications, derivative computations, and probability calculations happen under the hood. Understanding these gives you the ability to debug models, design new architectures, and reason about why training succeeds or fails.

A concrete walkthrough

Consider one training step of a simple neural network classifying an image:

  1. The image pixels are flattened into a vector xR784\mathbf{x} \in \mathbb{R}^{784} (linear algebra).
  2. The first layer computes h=σ(W1x+b1)\mathbf{h} = \sigma(W_1 \mathbf{x} + \mathbf{b}_1), a matrix multiplication followed by a nonlinear activation (linear algebra + calculus).
  3. The output layer computes class probabilities using softmax (probability).
  4. The cross-entropy loss measures how wrong the predictions are (probability + calculus).
  5. Backpropagation uses the chain rule to compute gradients of the loss with respect to every weight (calculus).
  6. Gradient descent updates the weights: WWαWLW \leftarrow W - \alpha \nabla_W L (calculus + linear algebra).

Neural network computational graph for one training step:

graph LR
  X["Input x<br/>pixel vector"] --> H["Hidden Layer<br/>h = sigmoid of Wx + b"]
  H --> O["Output Layer<br/>class probabilities"]
  O --> L["Loss<br/>cross-entropy"]
  L --> BP["Backpropagation<br/>chain rule gradients"]
  BP --> U["Update Weights<br/>W = W - alpha * grad"]
  U -.->|"next step"| X

Every step in this pipeline requires math. There is no getting around it.

What you do not need

You do not need to memorize proofs or compute 50-dimensional integrals by hand. What you need is:

  • Fluency with the basics. You should be comfortable with vectors, matrices, derivatives, and probability distributions.
  • Intuition for what operations mean. When someone says “take the gradient,” you should know that means “find the direction of steepest increase.”
  • Ability to follow derivations. Research papers assume you can follow matrix calculus and probabilistic arguments.

Think of math as the language ML is written in. You do not need to be a poet, but you need to read and write comfortably.

What comes next

This series starts from the ground up. The next article covers scalars, vectors, and vector spaces, the building blocks of everything that follows. From there, we build up through matrices, linear systems, eigenvalues, calculus, probability, and optimization, all the way to the math behind neural networks.

Each article includes worked examples with real numbers so you can follow every step. No hand-waving, no “proof left as an exercise.” Just clear explanations you can actually use.

Start typing to search across all content
navigate Enter open Esc close