Norms, Distances, and Similarity
In this series (15 parts)
- Why Maths Matters for ML: A Practical Overview
- Scalars, Vectors, and Vector Spaces
- Matrices and Matrix Operations
- Matrix Inverses and Systems of Linear Equations
- Eigenvalues and Eigenvectors
- Matrix Decompositions: LU, QR, SVD
- Norms, Distances, and Similarity
- Calculus Review: Derivatives and the Chain Rule
- Partial Derivatives and Gradients
- The Jacobian and Hessian Matrices
- Taylor series and local approximations
- Probability fundamentals
- Random variables and distributions
- Bayes theorem and its role in ML
- Information theory: entropy, KL divergence, cross-entropy
A norm is a function that takes a vector and returns a single non-negative number telling you how “big” that vector is. It sounds simple, but this idea underpins regularization, loss functions, distance metrics, and similarity measures across all of machine learning.
Prerequisites: You should be comfortable with vectors and dot products.
What is a norm?
A norm on a vector space must satisfy three properties:
- Non-negativity: , and only if
- Scaling: for any scalar
- Triangle inequality:
These properties guarantee that norms behave like our intuition for “length.”
The norm family
The general norm of a vector is:
Different values of give different norms, and each has distinct geometric and practical properties.
norm (Manhattan norm)
Sum of absolute values. Called the Manhattan norm because it measures distance like walking on a grid of city blocks. In ML, regularization (Lasso) pushes weights toward exactly zero, producing sparse models.
norm (Euclidean norm)
This is the “straight line” distance you learned in school. It is the most common norm in ML. regularization (Ridge) penalizes large weights but does not force them to zero. The norm is also directly related to the dot product: .
norm (max norm)
Just the largest absolute component. Useful when you care about worst-case behavior.
Example 1: Computing norms of a vector
Let .
norm:
norm:
norm:
Notice the ordering: . This always holds. The norm is the most “generous” measure of size, and is the most conservative.
import numpy as np
x = np.array([3, -4, 2])
l1 = np.linalg.norm(x, ord=1) # 9.0
l2 = np.linalg.norm(x, ord=2) # 5.385...
linf = np.linalg.norm(x, ord=np.inf) # 4.0
print(f"L1 = {l1}, L2 = {l2:.3f}, Linf = {linf}")
Example 2: L1 vs L2 on the same vectors
This example shows how and respond differently to “spread” vs “concentration” of values.
Vector A: (values spread evenly)
Vector B: (all mass in one component)
Both vectors have the same norm (4), but their norms are very different (2 vs 4). The norm penalizes concentration: one large component costs more than many small ones.
This is exactly why regularization discourages any single weight from getting too large, while regularization does not care about concentration. It just penalizes total absolute size.
Ratio: , but . The ratio is higher when the vector is “spikier.”
Distances
A norm naturally defines a distance. The distance between vectors and under the norm is:
The most common distance in ML is the Euclidean distance ():
This is what k-nearest neighbors, k-means clustering, and many other algorithms use by default.
Quick example
Distance between and :
Cosine similarity
Sometimes you care about the direction of two vectors, not their magnitudes. Cosine similarity measures the angle between them:
The result is between and :
- means the vectors point in the same direction
- means they are perpendicular (no relationship)
- means they point in opposite directions
Cosine similarity is the default metric for comparing text embeddings, word vectors, and document vectors. Two documents can have very different word counts ( norms) but still be about the same topic (similar direction).
Cosine similarity: angle determines relationship:
graph TD
subgraph "cos = 1"
S1["Same direction<br/>Angle = 0 degrees<br/>Maximally similar"]
end
subgraph "cos = 0"
S2["Perpendicular<br/>Angle = 90 degrees<br/>No similarity"]
end
subgraph "cos = -1"
S3["Opposite direction<br/>Angle = 180 degrees<br/>Maximally dissimilar"]
end
Example 3: Cosine similarity between two word vectors
Suppose a word embedding model gives us:
Step 1: Compute the dot product.
Step 2: Compute the norms.
Step 3: Compute cosine similarity.
A cosine similarity of is very high, as expected. “King” and “queen” are semantically close.
Now compare with a less related word:
Much lower, as expected. The direction of “car” diverges significantly from “king.”
import numpy as np
king = np.array([2, 5, 1])
queen = np.array([2, 4, 1])
car = np.array([4, 0, 3])
def cosine_sim(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
print(f"king-queen: {cosine_sim(king, queen):.4f}") # ~0.996
print(f"king-car: {cosine_sim(king, car):.4f}") # ~0.402
Frobenius norm
Norms are not just for vectors. The Frobenius norm extends the idea to matrices. For a matrix with entries :
Treat the matrix as one long vector and take the norm. Simple as that.
The Frobenius norm connects to SVD: if are the singular values of , then:
Example 4: Frobenius norm of a matrix
import numpy as np
A = np.array([[1, 2], [3, 4]])
print(np.linalg.norm(A, 'fro')) # 5.477...
Norms in machine learning
Here is why you should care about all of this.
Loss functions are norms in disguise. Mean squared error is just the squared norm of the residual vector. Mean absolute error uses the norm.
Regularization adds a norm penalty to the loss:
- : Lasso. Produces sparse weights (feature selection).
- : Ridge. Keeps all weights small but nonzero.
- Elastic Net: combines both.
Gradient clipping uses the norm. If , scale the gradient down to have norm .
Batch normalization divides by the norm of activations (approximately). This stabilizes training.
L1 vs L2 regularization effect on weights:
graph LR
subgraph "L1 Regularization - Lasso"
L1W["Pushes weights to exactly zero<br/>Produces sparse models<br/>Automatic feature selection<br/>Diamond constraint region"]
end
subgraph "L2 Regularization - Ridge"
L2W["Shrinks all weights toward zero<br/>Keeps all features active<br/>Weights stay small but nonzero<br/>Circular constraint region"]
end
Unit vectors and normalization
Dividing a vector by its norm gives a unit vector: a vector with norm 1 that points in the same direction.
This is called normalization. It strips away magnitude and keeps only direction. Cosine similarity between two vectors is the same as the dot product of their unit vectors:
Example: Normalize :
Verify: ✓
Normalization shows up everywhere in ML. Batch normalization, layer normalization, and weight normalization all use some form of dividing by a norm to stabilize training.
Norm balls: geometric intuition
The set of all vectors with is called the unit ball. Its shape depends on :
- ball: A diamond (rotated square in 2D). Points on the axes are included because they have norm exactly 1.
- ball: A circle (sphere in higher dimensions). The familiar round shape.
- ball: A square (cube in higher dimensions). All components can be as large as 1.
This geometry matters for regularization. regularization constrains weights to lie in a diamond. The corners of the diamond sit on the axes, which is why tends to produce sparse solutions (weights exactly equal to zero). regularization uses a circle, which has no corners, so weights shrink evenly but never hit zero.
Norm ball shapes and their properties:
graph TD
subgraph "L1 Ball - Diamond"
L1P["Shape: Diamond in 2D<br/>Corners lie on axes<br/>Produces sparse solutions<br/>Used in Lasso regression"]
end
subgraph "L2 Ball - Circle"
L2P["Shape: Circle in 2D<br/>Smooth, no corners<br/>Shrinks all weights evenly<br/>Used in Ridge regression"]
end
subgraph "L-inf Ball - Square"
LIP["Shape: Square in 2D<br/>Flat sides along axes<br/>Bounds max component<br/>Used in adversarial robustness"]
end
L1, L2, and L-infinity unit balls in 2D
Summary
| Concept | Formula | Use case |
|---|---|---|
| norm | Lasso regularization, MAE loss | |
| norm | Ridge regularization, MSE loss, Euclidean distance | |
| norm | Worst-case bounds, adversarial robustness | |
| Cosine similarity | Text similarity, embedding comparison | |
| Frobenius norm | Matrix approximation error, weight decay |
What comes next
You now have the tools to measure vectors and matrices. The next building block is calculus: how functions change. Head to Calculus Review: Derivatives and the Chain Rule to see how derivatives connect to optimization in ML.