Search…
Maths for ML · Part 3

Matrices and Matrix Operations

In this series (15 parts)
  1. Why Maths Matters for ML: A Practical Overview
  2. Scalars, Vectors, and Vector Spaces
  3. Matrices and Matrix Operations
  4. Matrix Inverses and Systems of Linear Equations
  5. Eigenvalues and Eigenvectors
  6. Matrix Decompositions: LU, QR, SVD
  7. Norms, Distances, and Similarity
  8. Calculus Review: Derivatives and the Chain Rule
  9. Partial Derivatives and Gradients
  10. The Jacobian and Hessian Matrices
  11. Taylor series and local approximations
  12. Probability fundamentals
  13. Random variables and distributions
  14. Bayes theorem and its role in ML
  15. Information theory: entropy, KL divergence, cross-entropy

A matrix is a rectangular grid of numbers. If vectors are the atoms of ML data, matrices are the molecules. Your dataset is a matrix. The weights of a neural network layer are a matrix. Every linear transformation can be written as a matrix multiplication. You will work with matrices constantly, so let’s get the operations down cold.

Prerequisites

This article builds on scalars, vectors, and vector spaces. You should be comfortable with vectors, the dot product, and basic vector operations.

What is a matrix?

A matrix is a 2D array of numbers arranged in rows and columns. A matrix with mm rows and nn columns is called an m×nm \times n matrix (read ”mm by nn”).

A=[a11a12a13a21a22a23]A = \begin{bmatrix} a_{11} & a_{12} & a_{13} \\ a_{21} & a_{22} & a_{23} \end{bmatrix}

This is a 2×32 \times 3 matrix. The entry aija_{ij} sits in row ii, column jj. We typically use uppercase bold letters (AA, BB, WW) for matrices.

A vector is just a special case of a matrix: a column vector is an n×1n \times 1 matrix, and a row vector is a 1×n1 \times n matrix.

Matrices in ML

  • Datasets: mm samples with nn features form an m×nm \times n matrix.
  • Weight matrices: a neural network layer with 128 inputs and 64 outputs has a 64×12864 \times 128 weight matrix.
  • Images: a grayscale image with 28 rows and 28 columns is a 28×2828 \times 28 matrix.
  • Attention scores: in a transformer, the attention matrix is n×nn \times n where nn is the sequence length.

Matrix addition

Add two matrices of the same size by adding corresponding entries:

A+B=[a11+b11a12+b12a21+b21a22+b22]A + B = \begin{bmatrix} a_{11} + b_{11} & a_{12} + b_{12} \\ a_{21} + b_{21} & a_{22} + b_{22} \end{bmatrix}

For example:

[1321]+[4025]=[5304]\begin{bmatrix} 1 & 3 \\ 2 & -1 \end{bmatrix} + \begin{bmatrix} 4 & 0 \\ -2 & 5 \end{bmatrix} = \begin{bmatrix} 5 & 3 \\ 0 & 4 \end{bmatrix}

You cannot add matrices of different sizes. A 2×32 \times 3 matrix plus a 3×23 \times 2 matrix is undefined.

Matrix addition is commutative (A+B=B+AA + B = B + A) and associative ((A+B)+C=A+(B+C)(A + B) + C = A + (B + C)).

Scalar multiplication

Multiply every entry of the matrix by the scalar:

cA=[ca11ca12ca21ca22]cA = \begin{bmatrix} ca_{11} & ca_{12} \\ ca_{21} & ca_{22} \end{bmatrix}

For example:

2[3104]=[6208]2 \cdot \begin{bmatrix} 3 & -1 \\ 0 & 4 \end{bmatrix} = \begin{bmatrix} 6 & -2 \\ 0 & 8 \end{bmatrix}

The transpose

The transpose of a matrix AA, written ATA^T, flips rows and columns. The entry at row ii, column jj of AA becomes the entry at row jj, column ii of ATA^T.

If AA is m×nm \times n, then ATA^T is n×mn \times m.

A=[123456]AT=[142536]A = \begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \end{bmatrix} \quad \Rightarrow \quad A^T = \begin{bmatrix} 1 & 4 \\ 2 & 5 \\ 3 & 6 \end{bmatrix}

Useful properties:

  • (AT)T=A(A^T)^T = A
  • (A+B)T=AT+BT(A + B)^T = A^T + B^T
  • (AB)T=BTAT(AB)^T = B^T A^T (note the order reversal)

A matrix is called symmetric if A=ATA = A^T. Symmetric matrices come up frequently in ML, for instance, covariance matrices are always symmetric.

Transpose operation: rows become columns:

graph LR
  A["Original A<br/>m x n<br/>Row i, Col j = a_ij"] -->|"Transpose"| AT["Transposed A^T<br/>n x m<br/>Row j, Col i = a_ij"]

The identity matrix

The identity matrix InI_n is the n×nn \times n square matrix with ones on the diagonal and zeros everywhere else:

I3=[100010001]I_3 = \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{bmatrix}

It is the matrix equivalent of the number 1. For any matrix AA of compatible size:

AI=AandIA=AAI = A \quad \text{and} \quad IA = A

Matrix multiplication

This is the most important operation in this article. Matrix multiplication is not done element by element. Instead, each entry of the product is a dot product of a row from the left matrix with a column from the right matrix.

For AA of size m×pm \times p and BB of size p×np \times n, the product C=ABC = AB is m×nm \times n, and:

cij=k=1paikbkjc_{ij} = \sum_{k=1}^{p} a_{ik} \, b_{kj}

The inner dimensions must match: the number of columns in AA must equal the number of rows in BB.

How matrix multiplication dimensions work:

graph LR
  A["Matrix A<br/>m x n"] -->|"inner dimension n must match"| B["Matrix B<br/>n x p"]
  B --> C["Result C<br/>m x p"]
  style A fill:#e1f5fe
  style B fill:#e1f5fe
  style C fill:#c8e6c9
Am×pBp×n=Cm×n\underbrace{A}_{m \times p} \cdot \underbrace{B}_{p \times n} = \underbrace{C}_{m \times n}

Dimension matching in a matrix multiplication chain

How to compute it by hand

For each entry cijc_{ij}:

  1. Take row ii of matrix AA.
  2. Take column jj of matrix BB.
  3. Compute their dot product. That is cijc_{ij}.

Repeat for every combination of row and column.

Worked example 1: multiplying two 2x2 matrices

Compute ABAB where:

A=[2314],B=[5123]A = \begin{bmatrix} 2 & 3 \\ 1 & 4 \end{bmatrix}, \quad B = \begin{bmatrix} 5 & -1 \\ 2 & 3 \end{bmatrix}

Solution:

Both are 2×22 \times 2, so the product is 2×22 \times 2.

Entry c11c_{11}: row 1 of AA dotted with column 1 of BB:

c11=(2)(5)+(3)(2)=10+6=16c_{11} = (2)(5) + (3)(2) = 10 + 6 = 16

Entry c12c_{12}: row 1 of AA dotted with column 2 of BB:

c12=(2)(1)+(3)(3)=2+9=7c_{12} = (2)(-1) + (3)(3) = -2 + 9 = 7

Entry c21c_{21}: row 2 of AA dotted with column 1 of BB:

c21=(1)(5)+(4)(2)=5+8=13c_{21} = (1)(5) + (4)(2) = 5 + 8 = 13

Entry c22c_{22}: row 2 of AA dotted with column 2 of BB:

c22=(1)(1)+(4)(3)=1+12=11c_{22} = (1)(-1) + (4)(3) = -1 + 12 = 11 AB=[1671311]AB = \begin{bmatrix} 16 & 7 \\ 13 & 11 \end{bmatrix}
import numpy as np

A = np.array([[2, 3], [1, 4]])
B = np.array([[5, -1], [2, 3]])
print(A @ B)
# [[16  7]
#  [13 11]]

Worked example 2: multiplying non-square matrices

Compute CDCD where:

C=[102311],D=[412035]C = \begin{bmatrix} 1 & 0 & 2 \\ 3 & 1 & -1 \end{bmatrix}, \quad D = \begin{bmatrix} 4 & 1 \\ -2 & 0 \\ 3 & 5 \end{bmatrix}

CC is 2×32 \times 3 and DD is 3×23 \times 2. The inner dimensions match (both 3), so the result is 2×22 \times 2.

Solution:

Entry c11c_{11}: row 1 of CC dotted with column 1 of DD:

(1)(4)+(0)(2)+(2)(3)=4+0+6=10(1)(4) + (0)(-2) + (2)(3) = 4 + 0 + 6 = 10

Entry c12c_{12}: row 1 of CC dotted with column 2 of DD:

(1)(1)+(0)(0)+(2)(5)=1+0+10=11(1)(1) + (0)(0) + (2)(5) = 1 + 0 + 10 = 11

Entry c21c_{21}: row 2 of CC dotted with column 1 of DD:

(3)(4)+(1)(2)+(1)(3)=12+(2)+(3)=7(3)(4) + (1)(-2) + (-1)(3) = 12 + (-2) + (-3) = 7

Entry c22c_{22}: row 2 of CC dotted with column 2 of DD:

(3)(1)+(1)(0)+(1)(5)=3+0+(5)=2(3)(1) + (1)(0) + (-1)(5) = 3 + 0 + (-5) = -2 CD=[101172]CD = \begin{bmatrix} 10 & 11 \\ 7 & -2 \end{bmatrix}
import numpy as np

C = np.array([[1, 0, 2], [3, 1, -1]])
D = np.array([[4, 1], [-2, 0], [3, 5]])
print(C @ D)
# [[10 11]
#  [ 7 -2]]

Worked example 3: matrix multiplication is not commutative

A common mistake is to assume AB=BAAB = BA. Let’s show this is false with a concrete example.

Using the same AA and BB from example 1:

A=[2314],B=[5123]A = \begin{bmatrix} 2 & 3 \\ 1 & 4 \end{bmatrix}, \quad B = \begin{bmatrix} 5 & -1 \\ 2 & 3 \end{bmatrix}

We already computed:

AB=[1671311]AB = \begin{bmatrix} 16 & 7 \\ 13 & 11 \end{bmatrix}

Now compute BABA:

Entry (1,1)(1,1): (5)(2)+(1)(1)=101=9(5)(2) + (-1)(1) = 10 - 1 = 9

Entry (1,2)(1,2): (5)(3)+(1)(4)=154=11(5)(3) + (-1)(4) = 15 - 4 = 11

Entry (2,1)(2,1): (2)(2)+(3)(1)=4+3=7(2)(2) + (3)(1) = 4 + 3 = 7

Entry (2,2)(2,2): (2)(3)+(3)(4)=6+12=18(2)(3) + (3)(4) = 6 + 12 = 18

BA=[911718]BA = \begin{bmatrix} 9 & 11 \\ 7 & 18 \end{bmatrix}

Comparing:

AB=[1671311][911718]=BAAB = \begin{bmatrix} 16 & 7 \\ 13 & 11 \end{bmatrix} \neq \begin{bmatrix} 9 & 11 \\ 7 & 18 \end{bmatrix} = BA

ABBAAB \neq BA in general. Matrix multiplication is not commutative.

This matters in ML. The order of operations in neural network computations is significant. W2W1xW_2 W_1 \mathbf{x} applies W1W_1 first, then W2W_2. Reversing them gives a different result.

import numpy as np

A = np.array([[2, 3], [1, 4]])
B = np.array([[5, -1], [2, 3]])

print("AB =")
print(A @ B)
print("\nBA =")
print(B @ A)
print("\nAB == BA?", np.array_equal(A @ B, B @ A))  # False

Note: with non-square matrices, ABAB might be defined while BABA is not. If AA is 2×32 \times 3 and BB is 3×23 \times 2, then ABAB is 2×22 \times 2 but BABA is 3×33 \times 3. They do not even have the same shape.

Element-wise vs matrix multiplication

This distinction trips up many beginners, especially in code.

Matrix multiplication (what we just covered) involves dot products of rows and columns. In NumPy, use @ or np.matmul().

Element-wise multiplication (also called the Hadamard product, written ABA \odot B) multiplies corresponding entries. Both matrices must be the same size. In NumPy, use *.

[2314][5123]=[103212]\begin{bmatrix} 2 & 3 \\ 1 & 4 \end{bmatrix} \odot \begin{bmatrix} 5 & -1 \\ 2 & 3 \end{bmatrix} = \begin{bmatrix} 10 & -3 \\ 2 & 12 \end{bmatrix}

Compare with our earlier result for matrix multiplication:

[2314][5123]=[1671311]\begin{bmatrix} 2 & 3 \\ 1 & 4 \end{bmatrix} \begin{bmatrix} 5 & -1 \\ 2 & 3 \end{bmatrix} = \begin{bmatrix} 16 & 7 \\ 13 & 11 \end{bmatrix}

Completely different results.

import numpy as np

A = np.array([[2, 3], [1, 4]])
B = np.array([[5, -1], [2, 3]])

print("Matrix multiply (A @ B):")
print(A @ B)

print("\nElement-wise multiply (A * B):")
print(A * B)

Where each appears in ML

  • Matrix multiplication: forward pass of a neural network (WxW\mathbf{x}), attention score computation, any linear transformation.
  • Element-wise multiplication: applying masks (setting certain values to zero), gating mechanisms in LSTMs and GRUs, feature scaling.

Properties of matrix multiplication

While not commutative, matrix multiplication does satisfy some useful properties:

PropertyFormula
Associative(AB)C=A(BC)(AB)C = A(BC)
DistributiveA(B+C)=AB+ACA(B + C) = AB + AC
Scalar factorc(AB)=(cA)B=A(cB)c(AB) = (cA)B = A(cB)
Transpose(AB)T=BTAT(AB)^T = B^T A^T
IdentityAI=IA=AAI = IA = A

The transpose rule is worth memorizing: the transpose of a product reverses the order. This comes up constantly in deriving gradients of matrix expressions.

Matrices as linear transformations

Here is the deeper insight: every matrix represents a linear transformation. When you multiply a vector by a matrix, you are transforming that vector.

For a 2×22 \times 2 matrix and a 2D vector:

Ax=[abcd][x1x2]=[ax1+bx2cx1+dx2]A\mathbf{x} = \begin{bmatrix} a & b \\ c & d \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} = \begin{bmatrix} ax_1 + bx_2 \\ cx_1 + dx_2 \end{bmatrix}

Different matrices encode different transformations:

  • Rotation: rotates vectors by an angle.
  • Scaling: stretches or compresses along axes.
  • Reflection: flips across an axis.
  • Projection: collapses onto a lower-dimensional space.

A neural network layer computes y=Wx+b\mathbf{y} = W\mathbf{x} + \mathbf{b}. The weight matrix WW is a linear transformation that maps the input space to the output space. The nonlinear activation function applied afterward is what lets neural networks learn complex patterns.

Understanding what eigenvalues tell you about a transformation, and how to decompose matrices using SVD, builds directly on the foundation we are laying here.

Matrix as a linear transformation:

graph LR
  X["Input vector x<br/>in R^n"] --> M["Multiply by matrix A<br/>Linear transformation"]
  M --> Y["Output vector y = Ax<br/>in R^m"]
  style X fill:#e1f5fe
  style M fill:#fff9c4
  style Y fill:#c8e6c9

Summary

OperationNotationResult sizeKey rule
AdditionA+BA + BSame as inputsSame dimensions required
Scalar multiplycAcASame as AAEvery entry times cc
TransposeATA^TRows and columns flipped(AB)T=BTAT(AB)^T = B^T A^T
Matrix multiplyABABm×nm \times n from m×pm \times p and p×np \times nInner dimensions must match
Element-wise multiplyABA \odot BSame as inputsSame dimensions required

What comes next

With vectors and matrices in hand, the next logical question is: can we undo a matrix operation? The next article covers matrix inverses and systems of linear equations, where you will learn to solve Ax=bA\mathbf{x} = \mathbf{b} and understand when a solution exists.

Start typing to search across all content
navigate Enter open Esc close