Feb 24, 2026 · 15 min read · Maths for ML

Norms, Distances, and Similarity

In this series (15 parts)

A norm is a function that takes a vector and returns a single non-negative number telling you how “big” that vector is. It sounds simple, but this idea underpins regularization, loss functions, distance metrics, and similarity measures across all of machine learning.

Prerequisites: You should be comfortable with vectors and dot products.

What is a norm?

A norm $\|\mathbf{x}\|$ on a vector space must satisfy three properties:

Non-negativity: $\|\mathbf{x}\| \geq 0$ , and $\|\mathbf{x}\| = 0$ only if $\mathbf{x} = \mathbf{0}$
Scaling: $\|\alpha \mathbf{x}\| = |\alpha| \cdot \|\mathbf{x}\|$ for any scalar $\alpha$
Triangle inequality: $\|\mathbf{x} + \mathbf{y}\| \leq \|\mathbf{x}\| + \|\mathbf{y}\|$

These properties guarantee that norms behave like our intuition for “length.”

The $L_p$ norm family

The general $L_p$ norm of a vector $\mathbf{x} = [x_1, x_2, \ldots, x_n]^T$ is:

\|\mathbf{x}\|_p = \left( \sum_{i=1}^{n} |x_i|^p \right)^{1/p}

Different values of $p$ give different norms, and each has distinct geometric and practical properties.

$L_1$ norm (Manhattan norm)

\|\mathbf{x}\|_1 = |x_1| + |x_2| + \cdots + |x_n|

Sum of absolute values. Called the Manhattan norm because it measures distance like walking on a grid of city blocks. In ML, $L_1$ regularization (Lasso) pushes weights toward exactly zero, producing sparse models.

$L_2$ norm (Euclidean norm)

\|\mathbf{x}\|_2 = \sqrt{x_1^2 + x_2^2 + \cdots + x_n^2}

This is the “straight line” distance you learned in school. It is the most common norm in ML. $L_2$ regularization (Ridge) penalizes large weights but does not force them to zero. The $L_2$ norm is also directly related to the dot product: $\|\mathbf{x}\|_2 = \sqrt{\mathbf{x} \cdot \mathbf{x}}$ .

$L_\infty$ norm (max norm)

\|\mathbf{x}\|_\infty = \max(|x_1|, |x_2|, \ldots, |x_n|)

Just the largest absolute component. Useful when you care about worst-case behavior.

Example 1: Computing norms of a vector

Let $\mathbf{x} = [3, -4, 2]^T$ .

$L_1$ norm:

\|\mathbf{x}\|_1 = |3| + |-4| + |2| = 3 + 4 + 2 = 9

$L_2$ norm:

\|\mathbf{x}\|_2 = \sqrt{3^2 + (-4)^2 + 2^2} = \sqrt{9 + 16 + 4} = \sqrt{29} \approx 5.39

$L_\infty$ norm:

\|\mathbf{x}\|_\infty = \max(|3|, |-4|, |2|) = 4

Notice the ordering: $L_\infty \leq L_2 \leq L_1$ . This always holds. The $L_1$ norm is the most “generous” measure of size, and $L_\infty$ is the most conservative.

import numpy as np

x = np.array([3, -4, 2])

l1 = np.linalg.norm(x, ord=1)     # 9.0
l2 = np.linalg.norm(x, ord=2)     # 5.385...
linf = np.linalg.norm(x, ord=np.inf)  # 4.0

print(f"L1 = {l1}, L2 = {l2:.3f}, Linf = {linf}")

Example 2: L1 vs L2 on the same vectors

This example shows how $L_1$ and $L_2$ respond differently to “spread” vs “concentration” of values.

Vector A: $\mathbf{a} = [1, 1, 1, 1]$ (values spread evenly)

\|\mathbf{a}\|_1 = 4, \quad \|\mathbf{a}\|_2 = \sqrt{4} = 2

Vector B: $\mathbf{b} = [4, 0, 0, 0]$ (all mass in one component)

\|\mathbf{b}\|_1 = 4, \quad \|\mathbf{b}\|_2 = \sqrt{16} = 4

Both vectors have the same $L_1$ norm (4), but their $L_2$ norms are very different (2 vs 4). The $L_2$ norm penalizes concentration: one large component costs more than many small ones.

This is exactly why $L_2$ regularization discourages any single weight from getting too large, while $L_1$ regularization does not care about concentration. It just penalizes total absolute size.

Ratio: $\|\mathbf{a}\|_2 / \|\mathbf{a}\|_1 = 2/4 = 0.5$ , but $\|\mathbf{b}\|_2 / \|\mathbf{b}\|_1 = 4/4 = 1.0$ . The ratio is higher when the vector is “spikier.”

Distances

A norm naturally defines a distance. The distance between vectors $\mathbf{x}$ and $\mathbf{y}$ under the $L_p$ norm is:

d_p(\mathbf{x}, \mathbf{y}) = \|\mathbf{x} - \mathbf{y}\|_p

The most common distance in ML is the Euclidean distance ( $L_2$ ):

d_2(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_{i=1}^n (x_i - y_i)^2}

This is what k-nearest neighbors, k-means clustering, and many other algorithms use by default.

Quick example

Distance between $\mathbf{x} = [1, 2]$ and $\mathbf{y} = [4, 6]$ :

d_2 = \sqrt{(4-1)^2 + (6-2)^2} = \sqrt{9 + 16} = \sqrt{25} = 5

d_1 = |4-1| + |6-2| = 3 + 4 = 7

Cosine similarity

Sometimes you care about the direction of two vectors, not their magnitudes. Cosine similarity measures the angle between them:

\text{cos\_sim}(\mathbf{x}, \mathbf{y}) = \frac{\mathbf{x} \cdot \mathbf{y}}{\|\mathbf{x}\|_2 \, \|\mathbf{y}\|_2}

The result is between $-1$ and $1$ :

$1$ means the vectors point in the same direction
$0$ means they are perpendicular (no relationship)
$-1$ means they point in opposite directions

Cosine similarity is the default metric for comparing text embeddings, word vectors, and document vectors. Two documents can have very different word counts ( $L_2$ norms) but still be about the same topic (similar direction).

Cosine similarity: angle determines relationship:

graph TD
  subgraph "cos = 1"
      S1["Same direction<br/>Angle = 0 degrees<br/>Maximally similar"]
  end
  subgraph "cos = 0"
      S2["Perpendicular<br/>Angle = 90 degrees<br/>No similarity"]
  end
  subgraph "cos = -1"
      S3["Opposite direction<br/>Angle = 180 degrees<br/>Maximally dissimilar"]
  end

Example 3: Cosine similarity between two word vectors

Suppose a word embedding model gives us:

\mathbf{v}_{\text{king}} = [2, 5, 1], \quad \mathbf{v}_{\text{queen}} = [2, 4, 1]

Step 1: Compute the dot product.

\mathbf{v}_{\text{king}} \cdot \mathbf{v}_{\text{queen}} = (2)(2) + (5)(4) + (1)(1) = 4 + 20 + 1 = 25

Step 2: Compute the $L_2$ norms.

\|\mathbf{v}_{\text{king}}\|_2 = \sqrt{4 + 25 + 1} = \sqrt{30} \approx 5.477

\|\mathbf{v}_{\text{queen}}\|_2 = \sqrt{4 + 16 + 1} = \sqrt{21} \approx 4.583

Step 3: Compute cosine similarity.

\text{cos\_sim} = \frac{25}{\sqrt{30} \cdot \sqrt{21}} = \frac{25}{\sqrt{630}} \approx \frac{25}{25.10} \approx 0.996

A cosine similarity of $0.996$ is very high, as expected. “King” and “queen” are semantically close.

Now compare with a less related word:

\mathbf{v}_{\text{car}} = [4, 0, 3]

\mathbf{v}_{\text{king}} \cdot \mathbf{v}_{\text{car}} = (2)(4) + (5)(0) + (1)(3) = 8 + 0 + 3 = 11

\|\mathbf{v}_{\text{car}}\|_2 = \sqrt{16 + 0 + 9} = 5

\text{cos\_sim} = \frac{11}{5.477 \times 5} = \frac{11}{27.39} \approx 0.402

Much lower, as expected. The direction of “car” diverges significantly from “king.”

import numpy as np

king = np.array([2, 5, 1])
queen = np.array([2, 4, 1])
car = np.array([4, 0, 3])

def cosine_sim(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

print(f"king-queen: {cosine_sim(king, queen):.4f}")  # ~0.996
print(f"king-car:   {cosine_sim(king, car):.4f}")     # ~0.402

Frobenius norm

Norms are not just for vectors. The Frobenius norm extends the $L_2$ idea to matrices. For a matrix $A$ with entries $a_{ij}$ :

\|A\|_F = \sqrt{\sum_{i}\sum_{j} a_{ij}^2}

Treat the matrix as one long vector and take the $L_2$ norm. Simple as that.

The Frobenius norm connects to SVD: if $\sigma_1, \sigma_2, \ldots$ are the singular values of $A$ , then:

\|A\|_F = \sqrt{\sigma_1^2 + \sigma_2^2 + \cdots + \sigma_r^2}

Example 4: Frobenius norm of a matrix

A = \begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix}

\|A\|_F = \sqrt{1^2 + 2^2 + 3^2 + 4^2} = \sqrt{1 + 4 + 9 + 16} = \sqrt{30} \approx 5.48

import numpy as np
A = np.array([[1, 2], [3, 4]])
print(np.linalg.norm(A, 'fro'))  # 5.477...

Norms in machine learning

Here is why you should care about all of this.

Loss functions are norms in disguise. Mean squared error is just the squared $L_2$ norm of the residual vector. Mean absolute error uses the $L_1$ norm.

Regularization adds a norm penalty to the loss:

\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{data}} + \lambda \|\mathbf{w}\|_p^p

$p = 1$ : Lasso. Produces sparse weights (feature selection).
$p = 2$ : Ridge. Keeps all weights small but nonzero.
Elastic Net: combines both.

Gradient clipping uses the $L_2$ norm. If $\|\nabla \mathcal{L}\|_2 > \tau$ , scale the gradient down to have norm $\tau$ .

Batch normalization divides by the $L_2$ norm of activations (approximately). This stabilizes training.

L1 vs L2 regularization effect on weights:

graph LR
  subgraph "L1 Regularization - Lasso"
      L1W["Pushes weights to exactly zero<br/>Produces sparse models<br/>Automatic feature selection<br/>Diamond constraint region"]
  end
  subgraph "L2 Regularization - Ridge"
      L2W["Shrinks all weights toward zero<br/>Keeps all features active<br/>Weights stay small but nonzero<br/>Circular constraint region"]
  end

Unit vectors and normalization

Dividing a vector by its norm gives a unit vector: a vector with norm 1 that points in the same direction.

\hat{\mathbf{x}} = \frac{\mathbf{x}}{\|\mathbf{x}\|_2}

This is called normalization. It strips away magnitude and keeps only direction. Cosine similarity between two vectors is the same as the dot product of their unit vectors:

\text{cos\_sim}(\mathbf{x}, \mathbf{y}) = \hat{\mathbf{x}} \cdot \hat{\mathbf{y}}

Example: Normalize $\mathbf{x} = [3, -4, 2]^T$ :

\hat{\mathbf{x}} = \frac{1}{\sqrt{29}}[3, -4, 2]^T \approx [0.557, -0.743, 0.371]^T

Verify: $\|\hat{\mathbf{x}}\|_2 = \sqrt{0.557^2 + 0.743^2 + 0.371^2} = \sqrt{0.310 + 0.552 + 0.138} = \sqrt{1.0} = 1$ ✓

Normalization shows up everywhere in ML. Batch normalization, layer normalization, and weight normalization all use some form of dividing by a norm to stabilize training.

Norm balls: geometric intuition

The set of all vectors with $\|\mathbf{x}\|_p \leq 1$ is called the $L_p$ unit ball. Its shape depends on $p$ :

$L_1$ ball: A diamond (rotated square in 2D). Points on the axes are included because they have $L_1$ norm exactly 1.
$L_2$ ball: A circle (sphere in higher dimensions). The familiar round shape.
$L_\infty$ ball: A square (cube in higher dimensions). All components can be as large as 1.

This geometry matters for regularization. $L_1$ regularization constrains weights to lie in a diamond. The corners of the diamond sit on the axes, which is why $L_1$ tends to produce sparse solutions (weights exactly equal to zero). $L_2$ regularization uses a circle, which has no corners, so weights shrink evenly but never hit zero.

Norm ball shapes and their properties:

graph TD
  subgraph "L1 Ball - Diamond"
      L1P["Shape: Diamond in 2D<br/>Corners lie on axes<br/>Produces sparse solutions<br/>Used in Lasso regression"]
  end
  subgraph "L2 Ball - Circle"
      L2P["Shape: Circle in 2D<br/>Smooth, no corners<br/>Shrinks all weights evenly<br/>Used in Ridge regression"]
  end
  subgraph "L-inf Ball - Square"
      LIP["Shape: Square in 2D<br/>Flat sides along axes<br/>Bounds max component<br/>Used in adversarial robustness"]
  end

L1, L2, and L-infinity unit balls in 2D

Summary

Concept	Formula	Use case
$L_1$ norm	$\sum \\|x_i\\|$	Lasso regularization, MAE loss
$L_2$ norm	$\sqrt{\sum x_i^2}$	Ridge regularization, MSE loss, Euclidean distance
$L_\infty$ norm	$\max \\|x_i\\|$	Worst-case bounds, adversarial robustness
Cosine similarity	$\frac{\mathbf{x} \cdot \mathbf{y}}{\\|\mathbf{x}\\| \\|\mathbf{y}\\|}$	Text similarity, embedding comparison
Frobenius norm	$\sqrt{\sum a_{ij}^2}$	Matrix approximation error, weight decay

What comes next

You now have the tools to measure vectors and matrices. The next building block is calculus: how functions change. Head to Calculus Review: Derivatives and the Chain Rule to see how derivatives connect to optimization in ML.

← Back to all series