Search…
Maths for ML · Part 10

The Jacobian and Hessian Matrices

In this series (15 parts)
  1. Why Maths Matters for ML: A Practical Overview
  2. Scalars, Vectors, and Vector Spaces
  3. Matrices and Matrix Operations
  4. Matrix Inverses and Systems of Linear Equations
  5. Eigenvalues and Eigenvectors
  6. Matrix Decompositions: LU, QR, SVD
  7. Norms, Distances, and Similarity
  8. Calculus Review: Derivatives and the Chain Rule
  9. Partial Derivatives and Gradients
  10. The Jacobian and Hessian Matrices
  11. Taylor series and local approximations
  12. Probability fundamentals
  13. Random variables and distributions
  14. Bayes theorem and its role in ML
  15. Information theory: entropy, KL divergence, cross-entropy

The gradient tells you how a scalar function changes with respect to each input. But what happens when the function outputs a vector, not a scalar? And what if you want to know not just the slope, but the curvature? The Jacobian and Hessian answer these two questions.

Prerequisites: You should be comfortable with partial derivatives and gradients and the chain rule.


The Jacobian matrix

Motivation

Many functions in ML map a vector to another vector. A neural network layer takes a vector of activations and outputs a vector of new activations. A coordinate transformation takes (r,θ)(r, \theta) and outputs (x,y)(x, y). To differentiate such functions, you need the Jacobian.

Definition

Given a function f:RnRm\mathbf{f}: \mathbb{R}^n \to \mathbb{R}^m that maps an nn-dimensional input to an mm-dimensional output:

f(x)=[f1(x1,,xn)f2(x1,,xn)fm(x1,,xn)]\mathbf{f}(\mathbf{x}) = \begin{bmatrix} f_1(x_1, \ldots, x_n) \\ f_2(x_1, \ldots, x_n) \\ \vdots \\ f_m(x_1, \ldots, x_n) \end{bmatrix}

The Jacobian is the m×nm \times n matrix of all first-order partial derivatives:

J=[f1x1f1x2f1xnf2x1f2x2f2xnfmx1fmx2fmxn]J = \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \frac{\partial f_1}{\partial x_2} & \cdots & \frac{\partial f_1}{\partial x_n} \\ \frac{\partial f_2}{\partial x_1} & \frac{\partial f_2}{\partial x_2} & \cdots & \frac{\partial f_2}{\partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial f_m}{\partial x_1} & \frac{\partial f_m}{\partial x_2} & \cdots & \frac{\partial f_m}{\partial x_n} \end{bmatrix}

Each row is the gradient of one output component. The Jacobian stacks all those gradients together.

Jacobian structure: m outputs and n inputs produce an m x n matrix:

graph TD
  I["n inputs: x1, x2, ..., xn"] --> F["Vector function f: R^n to R^m"]
  F --> O["m outputs: f1, f2, ..., fm"]
  O --> J["Jacobian J: m x n matrix"]
  J --> R1["Row 1 = gradient of f1"]
  J --> R2["Row 2 = gradient of f2"]
  J --> Rm["Row m = gradient of fm"]

Special cases

  • If m=1m = 1 (scalar output), the Jacobian is a 1×n1 \times n row vector, which is just the gradient transposed.
  • If n=1n = 1 (scalar input), the Jacobian is an m×1m \times 1 column vector of ordinary derivatives.

Example 1: Jacobian of a 2D transformation

Consider the polar-to-Cartesian transformation:

f(r,θ)=[rcosθrsinθ]\mathbf{f}(r, \theta) = \begin{bmatrix} r\cos\theta \\ r\sin\theta \end{bmatrix}

The Jacobian is a 2×22 \times 2 matrix:

J=[f1rf1θf2rf2θ]=[cosθrsinθsinθrcosθ]J = \begin{bmatrix} \frac{\partial f_1}{\partial r} & \frac{\partial f_1}{\partial \theta} \\ \frac{\partial f_2}{\partial r} & \frac{\partial f_2}{\partial \theta} \end{bmatrix} = \begin{bmatrix} \cos\theta & -r\sin\theta \\ \sin\theta & r\cos\theta \end{bmatrix}

Evaluate at r=2,θ=π/6r = 2, \theta = \pi/6:

We know cos(π/6)=3/20.866\cos(\pi/6) = \sqrt{3}/2 \approx 0.866 and sin(π/6)=1/2=0.5\sin(\pi/6) = 1/2 = 0.5.

J=[0.8662(0.5)0.52(0.866)]=[0.86610.51.732]J = \begin{bmatrix} 0.866 & -2(0.5) \\ 0.5 & 2(0.866) \end{bmatrix} = \begin{bmatrix} 0.866 & -1 \\ 0.5 & 1.732 \end{bmatrix}

The determinant of the Jacobian is:

det(J)=(0.866)(1.732)(1)(0.5)=1.5+0.5=2.0\det(J) = (0.866)(1.732) - (-1)(0.5) = 1.5 + 0.5 = 2.0

The Jacobian determinant equals r=2r = 2, which is why we have the rr factor in polar integration: dxdy=rdrdθdx\,dy = r\,dr\,d\theta.


Example 2: Jacobian of a neural network layer

Consider a simple layer with two inputs and two outputs:

f(x1,x2)=[x12+x2x1x2+3x22]\mathbf{f}(x_1, x_2) = \begin{bmatrix} x_1^2 + x_2 \\ x_1 x_2 + 3x_2^2 \end{bmatrix}

Compute each partial derivative:

f1x1=2x1,f1x2=1\frac{\partial f_1}{\partial x_1} = 2x_1, \quad \frac{\partial f_1}{\partial x_2} = 1 f2x1=x2,f2x2=x1+6x2\frac{\partial f_2}{\partial x_1} = x_2, \quad \frac{\partial f_2}{\partial x_2} = x_1 + 6x_2

The Jacobian:

J=[2x11x2x1+6x2]J = \begin{bmatrix} 2x_1 & 1 \\ x_2 & x_1 + 6x_2 \end{bmatrix}

Evaluate at (x1,x2)=(1,2)(x_1, x_2) = (1, 2):

J(1,2)=[21213]J(1, 2) = \begin{bmatrix} 2 & 1 \\ 2 & 13 \end{bmatrix}

What this tells you: If you perturb the input slightly by Δx\Delta\mathbf{x}, the output changes by approximately JΔxJ \cdot \Delta\mathbf{x}.

For Δx=[0.1,0]T\Delta\mathbf{x} = [0.1, 0]^T (nudge x1x_1 only):

Δf[21213][0.10]=[0.20.2]\Delta\mathbf{f} \approx \begin{bmatrix} 2 & 1 \\ 2 & 13 \end{bmatrix}\begin{bmatrix} 0.1 \\ 0 \end{bmatrix} = \begin{bmatrix} 0.2 \\ 0.2 \end{bmatrix}

Both outputs increase by about 0.2 when we increase x1x_1 by 0.1.

The Jacobian in backpropagation

In a neural network, backpropagation propagates gradients backward through layers using the chain rule. For a layer f\mathbf{f}, if you know Lf\frac{\partial L}{\partial \mathbf{f}} (the gradient of the loss with respect to the output), then:

Lx=JTLf\frac{\partial L}{\partial \mathbf{x}} = J^T \frac{\partial L}{\partial \mathbf{f}}

This is a Jacobian-vector product. Deep learning frameworks compute these efficiently without ever forming the full Jacobian matrix, which would be too large for realistic networks.


The Hessian matrix

Motivation

The gradient is a first-order approximation: it tells you the slope. But slopes change. A function might have a steep slope but be curving downward (about to flatten out). The Hessian captures this curvature through second derivatives.

Definition

For a scalar function f:RnRf: \mathbb{R}^n \to \mathbb{R}, the Hessian is the n×nn \times n matrix of second-order partial derivatives:

H=[2fx122fx1x22fx1xn2fx2x12fx222fx2xn2fxnx12fxnx22fxn2]H = \begin{bmatrix} \frac{\partial^2 f}{\partial x_1^2} & \frac{\partial^2 f}{\partial x_1 \partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_1 \partial x_n} \\ \frac{\partial^2 f}{\partial x_2 \partial x_1} & \frac{\partial^2 f}{\partial x_2^2} & \cdots & \frac{\partial^2 f}{\partial x_2 \partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial^2 f}{\partial x_n \partial x_1} & \frac{\partial^2 f}{\partial x_n \partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_n^2} \end{bmatrix}

For “nice” functions (continuously differentiable, which covers almost everything in ML), the Hessian is symmetric: 2fxixj=2fxjxi\frac{\partial^2 f}{\partial x_i \partial x_j} = \frac{\partial^2 f}{\partial x_j \partial x_i}.

What the Hessian tells you

The Hessian describes the curvature of ff in every direction. Combined with the gradient, it gives a second-order Taylor approximation:

f(x+Δx)f(x)+f(x)TΔx+12ΔxTH(x)Δxf(\mathbf{x} + \Delta\mathbf{x}) \approx f(\mathbf{x}) + \nabla f(\mathbf{x})^T \Delta\mathbf{x} + \frac{1}{2} \Delta\mathbf{x}^T H(\mathbf{x}) \Delta\mathbf{x}

The Hessian term is the correction that accounts for curvature.


Example 3: Computing the Hessian

Let f(x,y)=x3+2x2y+y2f(x, y) = x^3 + 2x^2 y + y^2.

Step 1: First partial derivatives.

fx=3x2+4xy\frac{\partial f}{\partial x} = 3x^2 + 4xy fy=2x2+2y\frac{\partial f}{\partial y} = 2x^2 + 2y

Step 2: Second partial derivatives.

2fx2=6x+4y\frac{\partial^2 f}{\partial x^2} = 6x + 4y 2fxy=4x\frac{\partial^2 f}{\partial x \partial y} = 4x 2fyx=4x(same, by symmetry)\frac{\partial^2 f}{\partial y \partial x} = 4x \quad \text{(same, by symmetry)} 2fy2=2\frac{\partial^2 f}{\partial y^2} = 2

Step 3: Assemble the Hessian.

H=[6x+4y4x4x2]H = \begin{bmatrix} 6x + 4y & 4x \\ 4x & 2 \end{bmatrix}

Evaluate at (1,1)(1, 1):

H(1,1)=[10442]H(1, 1) = \begin{bmatrix} 10 & 4 \\ 4 & 2 \end{bmatrix}

Example 4: Classifying a critical point using the Hessian

Continuing from Example 3, let us classify the critical points of f(x,y)=x3+2x2y+y2f(x, y) = x^3 + 2x^2 y + y^2.

Step 1: Find critical points by setting f=0\nabla f = \mathbf{0}.

3x2+4xy=0    x(3x+4y)=03x^2 + 4xy = 0 \implies x(3x + 4y) = 0 2x2+2y=0    y=x22x^2 + 2y = 0 \implies y = -x^2

Case 1: x=0x = 0. Then y=0y = 0. Critical point: (0,0)(0, 0).

Case 2: 3x+4y=03x + 4y = 0, so y=3x/4y = -3x/4. Substituting into y=x2y = -x^2:

3x/4=x2    x2=3x/4    x=3/4-3x/4 = -x^2 \implies x^2 = 3x/4 \implies x = 3/4

Then y=(3/4)2=9/16y = -(3/4)^2 = -9/16. Critical point: (3/4,9/16)(3/4, -9/16).

Step 2: Evaluate the Hessian at each critical point.

At (0,0)(0, 0):

H(0,0)=[0002]H(0, 0) = \begin{bmatrix} 0 & 0 \\ 0 & 2 \end{bmatrix}

Eigenvalues: λ1=0,λ2=2\lambda_1 = 0, \lambda_2 = 2. One eigenvalue is zero, so the second derivative test is inconclusive. This is a degenerate critical point.

At (3/4,9/16)(3/4, -9/16):

H(3/4,9/16)=[6(3/4)+4(9/16)4(3/4)4(3/4)2]=[18/436/16332]H(3/4, -9/16) = \begin{bmatrix} 6(3/4) + 4(-9/16) & 4(3/4) \\ 4(3/4) & 2 \end{bmatrix} = \begin{bmatrix} 18/4 - 36/16 & 3 \\ 3 & 2 \end{bmatrix} =[72/1636/16332]=[36/16332]=[2.25332]= \begin{bmatrix} 72/16 - 36/16 & 3 \\ 3 & 2 \end{bmatrix} = \begin{bmatrix} 36/16 & 3 \\ 3 & 2 \end{bmatrix} = \begin{bmatrix} 2.25 & 3 \\ 3 & 2 \end{bmatrix}

Eigenvalues: Solve det(HλI)=0\det(H - \lambda I) = 0:

(2.25λ)(2λ)9=0(2.25 - \lambda)(2 - \lambda) - 9 = 0 λ24.25λ+4.59=0\lambda^2 - 4.25\lambda + 4.5 - 9 = 0 λ24.25λ4.5=0\lambda^2 - 4.25\lambda - 4.5 = 0

Using the quadratic formula:

λ=4.25±18.0625+182=4.25±36.06252=4.25±6.0052\lambda = \frac{4.25 \pm \sqrt{18.0625 + 18}}{2} = \frac{4.25 \pm \sqrt{36.0625}}{2} = \frac{4.25 \pm 6.005}{2} λ1=4.25+6.00525.13,λ2=4.256.00520.88\lambda_1 = \frac{4.25 + 6.005}{2} \approx 5.13, \quad \lambda_2 = \frac{4.25 - 6.005}{2} \approx -0.88

One positive, one negative. This means (3/4,9/16)(3/4, -9/16) is a saddle point. The function curves upward in one direction and downward in another.


Positive definiteness and curvature

The eigenvalues of the Hessian tell you about the curvature at a critical point:

Eigenvalue patternHessian classificationCritical point type
All λi>0\lambda_i > 0Positive definiteLocal minimum
All λi<0\lambda_i < 0Negative definiteLocal maximum
Mixed signsIndefiniteSaddle point
Some λi=0\lambda_i = 0Semi-definiteInconclusive

In optimization, we want to reach a point where f=0\nabla f = \mathbf{0} and the Hessian is positive definite. That guarantees a local minimum. For convex functions, the Hessian is positive semi-definite everywhere, which guarantees that any local minimum is also a global minimum.

Classifying critical points using Hessian eigenvalues:

graph TD
  A["Critical point: gradient = 0"] --> B["Compute Hessian eigenvalues"]
  B --> C{"All positive?"}
  C -->|Yes| D["Local minimum"]
  C -->|No| E{"All negative?"}
  E -->|Yes| F["Local maximum"]
  E -->|No| G{"Mixed signs?"}
  G -->|Yes| H["Saddle point"]
  G -->|"Some zero"| I["Inconclusive"]

Function shapes and their Hessian classification:


Example 5: A clean positive definite case

Let f(x,y)=x2+4y2+2x8y+10f(x, y) = x^2 + 4y^2 + 2x - 8y + 10.

Gradient:

f=[2x+28y8]\nabla f = \begin{bmatrix} 2x + 2 \\ 8y - 8 \end{bmatrix}

Set f=0\nabla f = \mathbf{0}: x=1,y=1x = -1, y = 1. Critical point: (1,1)(-1, 1).

Hessian:

H=[2008]H = \begin{bmatrix} 2 & 0 \\ 0 & 8 \end{bmatrix}

The Hessian is constant (does not depend on x,yx, y). The eigenvalues are simply the diagonal entries: λ1=2,λ2=8\lambda_1 = 2, \lambda_2 = 8.

Both positive, so HH is positive definite. The critical point (1,1)(-1, 1) is a local minimum (and since ff is convex, it is the global minimum).

Geometric interpretation: The curvature is 2 in the xx-direction and 8 in the yy-direction. The surface bends more sharply in yy. This means gradient descent converges faster along xx than yy, which can cause oscillation. The condition number κ=λmax/λmin=8/2=4\kappa = \lambda_{\max}/\lambda_{\min} = 8/2 = 4 quantifies this imbalance. Higher condition numbers mean slower convergence for gradient descent.


The Hessian in optimization

Newton’s method

While gradient descent uses only first-order information, Newton’s method uses both the gradient and the Hessian:

xt+1=xtH1f(xt)\mathbf{x}_{t+1} = \mathbf{x}_t - H^{-1} \nabla f(\mathbf{x}_t)

By accounting for curvature, Newton’s method can converge in far fewer steps. The trade-off: computing and inverting the Hessian is expensive (O(n2)O(n^2) storage, O(n3)O(n^3) inversion). For a neural network with millions of parameters, this is impractical.

Newton’s method iteration loop:

graph LR
  A["Current point x_t"] --> B["Compute gradient"]
  B --> C["Compute Hessian H"]
  C --> D["Solve for step: H^-1 * gradient"]
  D --> E["Update: x_t+1 = x_t - step"]
  E --> F{"Converged?"}
  F -->|No| A
  F -->|Yes| G["Minimum found"]

Practical approximations

Because the full Hessian is too expensive for large models, ML practitioners use approximations:

  • BFGS / L-BFGS: Builds an approximate Hessian inverse incrementally
  • Adam optimizer: Adapts learning rates per parameter using moving averages of first and second moments (a diagonal Hessian approximation)
  • Hessian-free methods: Compute Hessian-vector products without forming the full matrix

Jacobian vs Hessian: summary

PropertyJacobianHessian
Input functionVector-valued f:RnRm\mathbf{f}: \mathbb{R}^n \to \mathbb{R}^mScalar-valued f:RnRf: \mathbb{R}^n \to \mathbb{R}
Matrix sizem×nm \times nn×nn \times n
ContainsFirst partial derivativesSecond partial derivatives
Tells youHow outputs change with inputsCurvature of the function
Role in MLBackpropagation, change of variablesOptimization, critical point analysis
Symmetric?Not in generalYes (for smooth functions)

Computing in Python

import torch
from torch.autograd.functional import jacobian, hessian

# Jacobian example
def f_vec(x):
    return torch.stack([x[0]**2 + x[1], x[0]*x[1] + 3*x[1]**2])

x = torch.tensor([1.0, 2.0])
J = jacobian(f_vec, x)
print("Jacobian:\n", J)
# [[2, 1], [2, 13]]

# Hessian example
def f_scalar(x):
    return x[0]**3 + 2*x[0]**2 * x[1] + x[1]**2

x = torch.tensor([1.0, 1.0])
H = hessian(f_scalar, x)
print("Hessian:\n", H)
# [[10, 4], [4, 2]]

What comes next

The Jacobian and Hessian give you first and second-order information about a function at a point. To build a local approximation of a function that captures even more detail, you need the Taylor series expansion. Head to Taylor Series and Approximations to see how these ideas connect.

Start typing to search across all content
navigate Enter open Esc close