Search…
Maths for ML · Part 7

Norms, Distances, and Similarity

In this series (15 parts)
  1. Why Maths Matters for ML: A Practical Overview
  2. Scalars, Vectors, and Vector Spaces
  3. Matrices and Matrix Operations
  4. Matrix Inverses and Systems of Linear Equations
  5. Eigenvalues and Eigenvectors
  6. Matrix Decompositions: LU, QR, SVD
  7. Norms, Distances, and Similarity
  8. Calculus Review: Derivatives and the Chain Rule
  9. Partial Derivatives and Gradients
  10. The Jacobian and Hessian Matrices
  11. Taylor series and local approximations
  12. Probability fundamentals
  13. Random variables and distributions
  14. Bayes theorem and its role in ML
  15. Information theory: entropy, KL divergence, cross-entropy

A norm is a function that takes a vector and returns a single non-negative number telling you how “big” that vector is. It sounds simple, but this idea underpins regularization, loss functions, distance metrics, and similarity measures across all of machine learning.

Prerequisites: You should be comfortable with vectors and dot products.


What is a norm?

A norm x\|\mathbf{x}\| on a vector space must satisfy three properties:

  1. Non-negativity: x0\|\mathbf{x}\| \geq 0, and x=0\|\mathbf{x}\| = 0 only if x=0\mathbf{x} = \mathbf{0}
  2. Scaling: αx=αx\|\alpha \mathbf{x}\| = |\alpha| \cdot \|\mathbf{x}\| for any scalar α\alpha
  3. Triangle inequality: x+yx+y\|\mathbf{x} + \mathbf{y}\| \leq \|\mathbf{x}\| + \|\mathbf{y}\|

These properties guarantee that norms behave like our intuition for “length.”


The LpL_p norm family

The general LpL_p norm of a vector x=[x1,x2,,xn]T\mathbf{x} = [x_1, x_2, \ldots, x_n]^T is:

xp=(i=1nxip)1/p\|\mathbf{x}\|_p = \left( \sum_{i=1}^{n} |x_i|^p \right)^{1/p}

Different values of pp give different norms, and each has distinct geometric and practical properties.

L1L_1 norm (Manhattan norm)

x1=x1+x2++xn\|\mathbf{x}\|_1 = |x_1| + |x_2| + \cdots + |x_n|

Sum of absolute values. Called the Manhattan norm because it measures distance like walking on a grid of city blocks. In ML, L1L_1 regularization (Lasso) pushes weights toward exactly zero, producing sparse models.

L2L_2 norm (Euclidean norm)

x2=x12+x22++xn2\|\mathbf{x}\|_2 = \sqrt{x_1^2 + x_2^2 + \cdots + x_n^2}

This is the “straight line” distance you learned in school. It is the most common norm in ML. L2L_2 regularization (Ridge) penalizes large weights but does not force them to zero. The L2L_2 norm is also directly related to the dot product: x2=xx\|\mathbf{x}\|_2 = \sqrt{\mathbf{x} \cdot \mathbf{x}}.

LL_\infty norm (max norm)

x=max(x1,x2,,xn)\|\mathbf{x}\|_\infty = \max(|x_1|, |x_2|, \ldots, |x_n|)

Just the largest absolute component. Useful when you care about worst-case behavior.


Example 1: Computing norms of a vector

Let x=[3,4,2]T\mathbf{x} = [3, -4, 2]^T.

L1L_1 norm:

x1=3+4+2=3+4+2=9\|\mathbf{x}\|_1 = |3| + |-4| + |2| = 3 + 4 + 2 = 9

L2L_2 norm:

x2=32+(4)2+22=9+16+4=295.39\|\mathbf{x}\|_2 = \sqrt{3^2 + (-4)^2 + 2^2} = \sqrt{9 + 16 + 4} = \sqrt{29} \approx 5.39

LL_\infty norm:

x=max(3,4,2)=4\|\mathbf{x}\|_\infty = \max(|3|, |-4|, |2|) = 4

Notice the ordering: LL2L1L_\infty \leq L_2 \leq L_1. This always holds. The L1L_1 norm is the most “generous” measure of size, and LL_\infty is the most conservative.

import numpy as np

x = np.array([3, -4, 2])

l1 = np.linalg.norm(x, ord=1)     # 9.0
l2 = np.linalg.norm(x, ord=2)     # 5.385...
linf = np.linalg.norm(x, ord=np.inf)  # 4.0

print(f"L1 = {l1}, L2 = {l2:.3f}, Linf = {linf}")

Example 2: L1 vs L2 on the same vectors

This example shows how L1L_1 and L2L_2 respond differently to “spread” vs “concentration” of values.

Vector A: a=[1,1,1,1]\mathbf{a} = [1, 1, 1, 1] (values spread evenly)

a1=4,a2=4=2\|\mathbf{a}\|_1 = 4, \quad \|\mathbf{a}\|_2 = \sqrt{4} = 2

Vector B: b=[4,0,0,0]\mathbf{b} = [4, 0, 0, 0] (all mass in one component)

b1=4,b2=16=4\|\mathbf{b}\|_1 = 4, \quad \|\mathbf{b}\|_2 = \sqrt{16} = 4

Both vectors have the same L1L_1 norm (4), but their L2L_2 norms are very different (2 vs 4). The L2L_2 norm penalizes concentration: one large component costs more than many small ones.

This is exactly why L2L_2 regularization discourages any single weight from getting too large, while L1L_1 regularization does not care about concentration. It just penalizes total absolute size.

Ratio: a2/a1=2/4=0.5\|\mathbf{a}\|_2 / \|\mathbf{a}\|_1 = 2/4 = 0.5, but b2/b1=4/4=1.0\|\mathbf{b}\|_2 / \|\mathbf{b}\|_1 = 4/4 = 1.0. The ratio is higher when the vector is “spikier.”


Distances

A norm naturally defines a distance. The distance between vectors x\mathbf{x} and y\mathbf{y} under the LpL_p norm is:

dp(x,y)=xypd_p(\mathbf{x}, \mathbf{y}) = \|\mathbf{x} - \mathbf{y}\|_p

The most common distance in ML is the Euclidean distance (L2L_2):

d2(x,y)=i=1n(xiyi)2d_2(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_{i=1}^n (x_i - y_i)^2}

This is what k-nearest neighbors, k-means clustering, and many other algorithms use by default.

Quick example

Distance between x=[1,2]\mathbf{x} = [1, 2] and y=[4,6]\mathbf{y} = [4, 6]:

d2=(41)2+(62)2=9+16=25=5d_2 = \sqrt{(4-1)^2 + (6-2)^2} = \sqrt{9 + 16} = \sqrt{25} = 5 d1=41+62=3+4=7d_1 = |4-1| + |6-2| = 3 + 4 = 7

Cosine similarity

Sometimes you care about the direction of two vectors, not their magnitudes. Cosine similarity measures the angle between them:

cos_sim(x,y)=xyx2y2\text{cos\_sim}(\mathbf{x}, \mathbf{y}) = \frac{\mathbf{x} \cdot \mathbf{y}}{\|\mathbf{x}\|_2 \, \|\mathbf{y}\|_2}

The result is between 1-1 and 11:

  • 11 means the vectors point in the same direction
  • 00 means they are perpendicular (no relationship)
  • 1-1 means they point in opposite directions

Cosine similarity is the default metric for comparing text embeddings, word vectors, and document vectors. Two documents can have very different word counts (L2L_2 norms) but still be about the same topic (similar direction).

Cosine similarity: angle determines relationship:

graph TD
  subgraph "cos = 1"
      S1["Same direction<br/>Angle = 0 degrees<br/>Maximally similar"]
  end
  subgraph "cos = 0"
      S2["Perpendicular<br/>Angle = 90 degrees<br/>No similarity"]
  end
  subgraph "cos = -1"
      S3["Opposite direction<br/>Angle = 180 degrees<br/>Maximally dissimilar"]
  end

Example 3: Cosine similarity between two word vectors

Suppose a word embedding model gives us:

vking=[2,5,1],vqueen=[2,4,1]\mathbf{v}_{\text{king}} = [2, 5, 1], \quad \mathbf{v}_{\text{queen}} = [2, 4, 1]

Step 1: Compute the dot product.

vkingvqueen=(2)(2)+(5)(4)+(1)(1)=4+20+1=25\mathbf{v}_{\text{king}} \cdot \mathbf{v}_{\text{queen}} = (2)(2) + (5)(4) + (1)(1) = 4 + 20 + 1 = 25

Step 2: Compute the L2L_2 norms.

vking2=4+25+1=305.477\|\mathbf{v}_{\text{king}}\|_2 = \sqrt{4 + 25 + 1} = \sqrt{30} \approx 5.477 vqueen2=4+16+1=214.583\|\mathbf{v}_{\text{queen}}\|_2 = \sqrt{4 + 16 + 1} = \sqrt{21} \approx 4.583

Step 3: Compute cosine similarity.

cos_sim=253021=256302525.100.996\text{cos\_sim} = \frac{25}{\sqrt{30} \cdot \sqrt{21}} = \frac{25}{\sqrt{630}} \approx \frac{25}{25.10} \approx 0.996

A cosine similarity of 0.9960.996 is very high, as expected. “King” and “queen” are semantically close.

Now compare with a less related word:

vcar=[4,0,3]\mathbf{v}_{\text{car}} = [4, 0, 3] vkingvcar=(2)(4)+(5)(0)+(1)(3)=8+0+3=11\mathbf{v}_{\text{king}} \cdot \mathbf{v}_{\text{car}} = (2)(4) + (5)(0) + (1)(3) = 8 + 0 + 3 = 11 vcar2=16+0+9=5\|\mathbf{v}_{\text{car}}\|_2 = \sqrt{16 + 0 + 9} = 5 cos_sim=115.477×5=1127.390.402\text{cos\_sim} = \frac{11}{5.477 \times 5} = \frac{11}{27.39} \approx 0.402

Much lower, as expected. The direction of “car” diverges significantly from “king.”

import numpy as np

king = np.array([2, 5, 1])
queen = np.array([2, 4, 1])
car = np.array([4, 0, 3])

def cosine_sim(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

print(f"king-queen: {cosine_sim(king, queen):.4f}")  # ~0.996
print(f"king-car:   {cosine_sim(king, car):.4f}")     # ~0.402

Frobenius norm

Norms are not just for vectors. The Frobenius norm extends the L2L_2 idea to matrices. For a matrix AA with entries aija_{ij}:

AF=ijaij2\|A\|_F = \sqrt{\sum_{i}\sum_{j} a_{ij}^2}

Treat the matrix as one long vector and take the L2L_2 norm. Simple as that.

The Frobenius norm connects to SVD: if σ1,σ2,\sigma_1, \sigma_2, \ldots are the singular values of AA, then:

AF=σ12+σ22++σr2\|A\|_F = \sqrt{\sigma_1^2 + \sigma_2^2 + \cdots + \sigma_r^2}

Example 4: Frobenius norm of a matrix

A=[1234]A = \begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix} AF=12+22+32+42=1+4+9+16=305.48\|A\|_F = \sqrt{1^2 + 2^2 + 3^2 + 4^2} = \sqrt{1 + 4 + 9 + 16} = \sqrt{30} \approx 5.48
import numpy as np
A = np.array([[1, 2], [3, 4]])
print(np.linalg.norm(A, 'fro'))  # 5.477...

Norms in machine learning

Here is why you should care about all of this.

Loss functions are norms in disguise. Mean squared error is just the squared L2L_2 norm of the residual vector. Mean absolute error uses the L1L_1 norm.

Regularization adds a norm penalty to the loss:

Ltotal=Ldata+λwpp\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{data}} + \lambda \|\mathbf{w}\|_p^p
  • p=1p = 1: Lasso. Produces sparse weights (feature selection).
  • p=2p = 2: Ridge. Keeps all weights small but nonzero.
  • Elastic Net: combines both.

Gradient clipping uses the L2L_2 norm. If L2>τ\|\nabla \mathcal{L}\|_2 > \tau, scale the gradient down to have norm τ\tau.

Batch normalization divides by the L2L_2 norm of activations (approximately). This stabilizes training.

L1 vs L2 regularization effect on weights:

graph LR
  subgraph "L1 Regularization - Lasso"
      L1W["Pushes weights to exactly zero<br/>Produces sparse models<br/>Automatic feature selection<br/>Diamond constraint region"]
  end
  subgraph "L2 Regularization - Ridge"
      L2W["Shrinks all weights toward zero<br/>Keeps all features active<br/>Weights stay small but nonzero<br/>Circular constraint region"]
  end

Unit vectors and normalization

Dividing a vector by its norm gives a unit vector: a vector with norm 1 that points in the same direction.

x^=xx2\hat{\mathbf{x}} = \frac{\mathbf{x}}{\|\mathbf{x}\|_2}

This is called normalization. It strips away magnitude and keeps only direction. Cosine similarity between two vectors is the same as the dot product of their unit vectors:

cos_sim(x,y)=x^y^\text{cos\_sim}(\mathbf{x}, \mathbf{y}) = \hat{\mathbf{x}} \cdot \hat{\mathbf{y}}

Example: Normalize x=[3,4,2]T\mathbf{x} = [3, -4, 2]^T:

x^=129[3,4,2]T[0.557,0.743,0.371]T\hat{\mathbf{x}} = \frac{1}{\sqrt{29}}[3, -4, 2]^T \approx [0.557, -0.743, 0.371]^T

Verify: x^2=0.5572+0.7432+0.3712=0.310+0.552+0.138=1.0=1\|\hat{\mathbf{x}}\|_2 = \sqrt{0.557^2 + 0.743^2 + 0.371^2} = \sqrt{0.310 + 0.552 + 0.138} = \sqrt{1.0} = 1

Normalization shows up everywhere in ML. Batch normalization, layer normalization, and weight normalization all use some form of dividing by a norm to stabilize training.


Norm balls: geometric intuition

The set of all vectors with xp1\|\mathbf{x}\|_p \leq 1 is called the LpL_p unit ball. Its shape depends on pp:

  • L1L_1 ball: A diamond (rotated square in 2D). Points on the axes are included because they have L1L_1 norm exactly 1.
  • L2L_2 ball: A circle (sphere in higher dimensions). The familiar round shape.
  • LL_\infty ball: A square (cube in higher dimensions). All components can be as large as 1.

This geometry matters for regularization. L1L_1 regularization constrains weights to lie in a diamond. The corners of the diamond sit on the axes, which is why L1L_1 tends to produce sparse solutions (weights exactly equal to zero). L2L_2 regularization uses a circle, which has no corners, so weights shrink evenly but never hit zero.

Norm ball shapes and their properties:

graph TD
  subgraph "L1 Ball - Diamond"
      L1P["Shape: Diamond in 2D<br/>Corners lie on axes<br/>Produces sparse solutions<br/>Used in Lasso regression"]
  end
  subgraph "L2 Ball - Circle"
      L2P["Shape: Circle in 2D<br/>Smooth, no corners<br/>Shrinks all weights evenly<br/>Used in Ridge regression"]
  end
  subgraph "L-inf Ball - Square"
      LIP["Shape: Square in 2D<br/>Flat sides along axes<br/>Bounds max component<br/>Used in adversarial robustness"]
  end

L1, L2, and L-infinity unit balls in 2D


Summary

ConceptFormulaUse case
L1L_1 normxi\sum \|x_i\|Lasso regularization, MAE loss
L2L_2 normxi2\sqrt{\sum x_i^2}Ridge regularization, MSE loss, Euclidean distance
LL_\infty normmaxxi\max \|x_i\|Worst-case bounds, adversarial robustness
Cosine similarityxyxy\frac{\mathbf{x} \cdot \mathbf{y}}{\|\mathbf{x}\| \|\mathbf{y}\|}Text similarity, embedding comparison
Frobenius normaij2\sqrt{\sum a_{ij}^2}Matrix approximation error, weight decay

What comes next

You now have the tools to measure vectors and matrices. The next building block is calculus: how functions change. Head to Calculus Review: Derivatives and the Chain Rule to see how derivatives connect to optimization in ML.

Start typing to search across all content
navigate Enter open Esc close