Feb 27, 2026 · 20 min read · Maths for ML

The Jacobian and Hessian Matrices

In this series (15 parts)

The gradient tells you how a scalar function changes with respect to each input. But what happens when the function outputs a vector, not a scalar? And what if you want to know not just the slope, but the curvature? The Jacobian and Hessian answer these two questions.

Prerequisites: You should be comfortable with partial derivatives and gradients and the chain rule.

The Jacobian matrix

Motivation

Many functions in ML map a vector to another vector. A neural network layer takes a vector of activations and outputs a vector of new activations. A coordinate transformation takes $(r, \theta)$ and outputs $(x, y)$ . To differentiate such functions, you need the Jacobian.

Definition

Given a function $\mathbf{f}: \mathbb{R}^n \to \mathbb{R}^m$ that maps an $n$ -dimensional input to an $m$ -dimensional output:

\mathbf{f}(\mathbf{x}) = \begin{bmatrix} f_1(x_1, \ldots, x_n) \\ f_2(x_1, \ldots, x_n) \\ \vdots \\ f_m(x_1, \ldots, x_n) \end{bmatrix}

The Jacobian is the $m \times n$ matrix of all first-order partial derivatives:

J = \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \frac{\partial f_1}{\partial x_2} & \cdots & \frac{\partial f_1}{\partial x_n} \\ \frac{\partial f_2}{\partial x_1} & \frac{\partial f_2}{\partial x_2} & \cdots & \frac{\partial f_2}{\partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial f_m}{\partial x_1} & \frac{\partial f_m}{\partial x_2} & \cdots & \frac{\partial f_m}{\partial x_n} \end{bmatrix}

Each row is the gradient of one output component. The Jacobian stacks all those gradients together.

Jacobian structure: m outputs and n inputs produce an m x n matrix:

graph TD
  I["n inputs: x1, x2, ..., xn"] --> F["Vector function f: R^n to R^m"]
  F --> O["m outputs: f1, f2, ..., fm"]
  O --> J["Jacobian J: m x n matrix"]
  J --> R1["Row 1 = gradient of f1"]
  J --> R2["Row 2 = gradient of f2"]
  J --> Rm["Row m = gradient of fm"]

Special cases

If $m = 1$ (scalar output), the Jacobian is a $1 \times n$ row vector, which is just the gradient transposed.
If $n = 1$ (scalar input), the Jacobian is an $m \times 1$ column vector of ordinary derivatives.

Example 1: Jacobian of a 2D transformation

Consider the polar-to-Cartesian transformation:

\mathbf{f}(r, \theta) = \begin{bmatrix} r\cos\theta \\ r\sin\theta \end{bmatrix}

The Jacobian is a $2 \times 2$ matrix:

J = \begin{bmatrix} \frac{\partial f_1}{\partial r} & \frac{\partial f_1}{\partial \theta} \\ \frac{\partial f_2}{\partial r} & \frac{\partial f_2}{\partial \theta} \end{bmatrix} = \begin{bmatrix} \cos\theta & -r\sin\theta \\ \sin\theta & r\cos\theta \end{bmatrix}

Evaluate at $r = 2, \theta = \pi/6$ :

We know $\cos(\pi/6) = \sqrt{3}/2 \approx 0.866$ and $\sin(\pi/6) = 1/2 = 0.5$ .

J = \begin{bmatrix} 0.866 & -2(0.5) \\ 0.5 & 2(0.866) \end{bmatrix} = \begin{bmatrix} 0.866 & -1 \\ 0.5 & 1.732 \end{bmatrix}

The determinant of the Jacobian is:

\det(J) = (0.866)(1.732) - (-1)(0.5) = 1.5 + 0.5 = 2.0

The Jacobian determinant equals $r = 2$ , which is why we have the $r$ factor in polar integration: $dx\,dy = r\,dr\,d\theta$ .

Example 2: Jacobian of a neural network layer

Consider a simple layer with two inputs and two outputs:

\mathbf{f}(x_1, x_2) = \begin{bmatrix} x_1^2 + x_2 \\ x_1 x_2 + 3x_2^2 \end{bmatrix}

Compute each partial derivative:

\frac{\partial f_1}{\partial x_1} = 2x_1, \quad \frac{\partial f_1}{\partial x_2} = 1

\frac{\partial f_2}{\partial x_1} = x_2, \quad \frac{\partial f_2}{\partial x_2} = x_1 + 6x_2

The Jacobian:

J = \begin{bmatrix} 2x_1 & 1 \\ x_2 & x_1 + 6x_2 \end{bmatrix}

Evaluate at $(x_1, x_2) = (1, 2)$ :

J(1, 2) = \begin{bmatrix} 2 & 1 \\ 2 & 13 \end{bmatrix}

What this tells you: If you perturb the input slightly by $\Delta\mathbf{x}$ , the output changes by approximately $J \cdot \Delta\mathbf{x}$ .

For $\Delta\mathbf{x} = [0.1, 0]^T$ (nudge $x_1$ only):

\Delta\mathbf{f} \approx \begin{bmatrix} 2 & 1 \\ 2 & 13 \end{bmatrix}\begin{bmatrix} 0.1 \\ 0 \end{bmatrix} = \begin{bmatrix} 0.2 \\ 0.2 \end{bmatrix}

Both outputs increase by about 0.2 when we increase $x_1$ by 0.1.

The Jacobian in backpropagation

In a neural network, backpropagation propagates gradients backward through layers using the chain rule. For a layer $\mathbf{f}$ , if you know $\frac{\partial L}{\partial \mathbf{f}}$ (the gradient of the loss with respect to the output), then:

\frac{\partial L}{\partial \mathbf{x}} = J^T \frac{\partial L}{\partial \mathbf{f}}

This is a Jacobian-vector product. Deep learning frameworks compute these efficiently without ever forming the full Jacobian matrix, which would be too large for realistic networks.

The Hessian matrix

Motivation

The gradient is a first-order approximation: it tells you the slope. But slopes change. A function might have a steep slope but be curving downward (about to flatten out). The Hessian captures this curvature through second derivatives.

Definition

For a scalar function $f: \mathbb{R}^n \to \mathbb{R}$ , the Hessian is the $n \times n$ matrix of second-order partial derivatives:

H = \begin{bmatrix} \frac{\partial^2 f}{\partial x_1^2} & \frac{\partial^2 f}{\partial x_1 \partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_1 \partial x_n} \\ \frac{\partial^2 f}{\partial x_2 \partial x_1} & \frac{\partial^2 f}{\partial x_2^2} & \cdots & \frac{\partial^2 f}{\partial x_2 \partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial^2 f}{\partial x_n \partial x_1} & \frac{\partial^2 f}{\partial x_n \partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_n^2} \end{bmatrix}

For “nice” functions (continuously differentiable, which covers almost everything in ML), the Hessian is symmetric: $\frac{\partial^2 f}{\partial x_i \partial x_j} = \frac{\partial^2 f}{\partial x_j \partial x_i}$ .

What the Hessian tells you

The Hessian describes the curvature of $f$ in every direction. Combined with the gradient, it gives a second-order Taylor approximation:

f(\mathbf{x} + \Delta\mathbf{x}) \approx f(\mathbf{x}) + \nabla f(\mathbf{x})^T \Delta\mathbf{x} + \frac{1}{2} \Delta\mathbf{x}^T H(\mathbf{x}) \Delta\mathbf{x}

The Hessian term is the correction that accounts for curvature.

Example 3: Computing the Hessian

Let $f(x, y) = x^3 + 2x^2 y + y^2$ .

Step 1: First partial derivatives.

\frac{\partial f}{\partial x} = 3x^2 + 4xy

\frac{\partial f}{\partial y} = 2x^2 + 2y

Step 2: Second partial derivatives.

\frac{\partial^2 f}{\partial x^2} = 6x + 4y

\frac{\partial^2 f}{\partial x \partial y} = 4x

\frac{\partial^2 f}{\partial y \partial x} = 4x \quad \text{(same, by symmetry)}

\frac{\partial^2 f}{\partial y^2} = 2

Step 3: Assemble the Hessian.

H = \begin{bmatrix} 6x + 4y & 4x \\ 4x & 2 \end{bmatrix}

Evaluate at $(1, 1)$ :

H(1, 1) = \begin{bmatrix} 10 & 4 \\ 4 & 2 \end{bmatrix}

Example 4: Classifying a critical point using the Hessian

Continuing from Example 3, let us classify the critical points of $f(x, y) = x^3 + 2x^2 y + y^2$ .

Step 1: Find critical points by setting $\nabla f = \mathbf{0}$ .

3x^2 + 4xy = 0 \implies x(3x + 4y) = 0

2x^2 + 2y = 0 \implies y = -x^2

Case 1: $x = 0$ . Then $y = 0$ . Critical point: $(0, 0)$ .

Case 2: $3x + 4y = 0$ , so $y = -3x/4$ . Substituting into $y = -x^2$ :

-3x/4 = -x^2 \implies x^2 = 3x/4 \implies x = 3/4

Then $y = -(3/4)^2 = -9/16$ . Critical point: $(3/4, -9/16)$ .

Step 2: Evaluate the Hessian at each critical point.

At $(0, 0)$ :

H(0, 0) = \begin{bmatrix} 0 & 0 \\ 0 & 2 \end{bmatrix}

Eigenvalues: $\lambda_1 = 0, \lambda_2 = 2$ . One eigenvalue is zero, so the second derivative test is inconclusive. This is a degenerate critical point.

At $(3/4, -9/16)$ :

H(3/4, -9/16) = \begin{bmatrix} 6(3/4) + 4(-9/16) & 4(3/4) \\ 4(3/4) & 2 \end{bmatrix} = \begin{bmatrix} 18/4 - 36/16 & 3 \\ 3 & 2 \end{bmatrix}

= \begin{bmatrix} 72/16 - 36/16 & 3 \\ 3 & 2 \end{bmatrix} = \begin{bmatrix} 36/16 & 3 \\ 3 & 2 \end{bmatrix} = \begin{bmatrix} 2.25 & 3 \\ 3 & 2 \end{bmatrix}

Eigenvalues: Solve $\det(H - \lambda I) = 0$ :

(2.25 - \lambda)(2 - \lambda) - 9 = 0

\lambda^2 - 4.25\lambda + 4.5 - 9 = 0

\lambda^2 - 4.25\lambda - 4.5 = 0

Using the quadratic formula:

\lambda = \frac{4.25 \pm \sqrt{18.0625 + 18}}{2} = \frac{4.25 \pm \sqrt{36.0625}}{2} = \frac{4.25 \pm 6.005}{2}

\lambda_1 = \frac{4.25 + 6.005}{2} \approx 5.13, \quad \lambda_2 = \frac{4.25 - 6.005}{2} \approx -0.88

One positive, one negative. This means $(3/4, -9/16)$ is a saddle point. The function curves upward in one direction and downward in another.

Positive definiteness and curvature

The eigenvalues of the Hessian tell you about the curvature at a critical point:

Eigenvalue pattern	Hessian classification	Critical point type
All $\lambda_i > 0$	Positive definite	Local minimum
All $\lambda_i < 0$	Negative definite	Local maximum
Mixed signs	Indefinite	Saddle point
Some $\lambda_i = 0$	Semi-definite	Inconclusive

In optimization, we want to reach a point where $\nabla f = \mathbf{0}$ and the Hessian is positive definite. That guarantees a local minimum. For convex functions, the Hessian is positive semi-definite everywhere, which guarantees that any local minimum is also a global minimum.

Classifying critical points using Hessian eigenvalues:

graph TD
  A["Critical point: gradient = 0"] --> B["Compute Hessian eigenvalues"]
  B --> C{"All positive?"}
  C -->|Yes| D["Local minimum"]
  C -->|No| E{"All negative?"}
  E -->|Yes| F["Local maximum"]
  E -->|No| G{"Mixed signs?"}
  G -->|Yes| H["Saddle point"]
  G -->|"Some zero"| I["Inconclusive"]

Function shapes and their Hessian classification:

Example 5: A clean positive definite case

Let $f(x, y) = x^2 + 4y^2 + 2x - 8y + 10$ .

Gradient:

\nabla f = \begin{bmatrix} 2x + 2 \\ 8y - 8 \end{bmatrix}

Set $\nabla f = \mathbf{0}$ : $x = -1, y = 1$ . Critical point: $(-1, 1)$ .

Hessian:

H = \begin{bmatrix} 2 & 0 \\ 0 & 8 \end{bmatrix}

The Hessian is constant (does not depend on $x, y$ ). The eigenvalues are simply the diagonal entries: $\lambda_1 = 2, \lambda_2 = 8$ .

Both positive, so $H$ is positive definite. The critical point $(-1, 1)$ is a local minimum (and since $f$ is convex, it is the global minimum).

Geometric interpretation: The curvature is 2 in the $x$ -direction and 8 in the $y$ -direction. The surface bends more sharply in $y$ . This means gradient descent converges faster along $x$ than $y$ , which can cause oscillation. The condition number $\kappa = \lambda_{\max}/\lambda_{\min} = 8/2 = 4$ quantifies this imbalance. Higher condition numbers mean slower convergence for gradient descent.

The Hessian in optimization

Newton’s method

While gradient descent uses only first-order information, Newton’s method uses both the gradient and the Hessian:

\mathbf{x}_{t+1} = \mathbf{x}_t - H^{-1} \nabla f(\mathbf{x}_t)

By accounting for curvature, Newton’s method can converge in far fewer steps. The trade-off: computing and inverting the Hessian is expensive ( $O(n^2)$ storage, $O(n^3)$ inversion). For a neural network with millions of parameters, this is impractical.

Newton’s method iteration loop:

graph LR
  A["Current point x_t"] --> B["Compute gradient"]
  B --> C["Compute Hessian H"]
  C --> D["Solve for step: H^-1 * gradient"]
  D --> E["Update: x_t+1 = x_t - step"]
  E --> F{"Converged?"}
  F -->|No| A
  F -->|Yes| G["Minimum found"]

Practical approximations

Because the full Hessian is too expensive for large models, ML practitioners use approximations:

BFGS / L-BFGS: Builds an approximate Hessian inverse incrementally
Adam optimizer: Adapts learning rates per parameter using moving averages of first and second moments (a diagonal Hessian approximation)
Hessian-free methods: Compute Hessian-vector products without forming the full matrix

Jacobian vs Hessian: summary

Property	Jacobian	Hessian
Input function	Vector-valued $\mathbf{f}: \mathbb{R}^n \to \mathbb{R}^m$	Scalar-valued $f: \mathbb{R}^n \to \mathbb{R}$
Matrix size	$m \times n$	$n \times n$
Contains	First partial derivatives	Second partial derivatives
Tells you	How outputs change with inputs	Curvature of the function
Role in ML	Backpropagation, change of variables	Optimization, critical point analysis
Symmetric?	Not in general	Yes (for smooth functions)

Computing in Python

import torch
from torch.autograd.functional import jacobian, hessian

# Jacobian example
def f_vec(x):
    return torch.stack([x[0]**2 + x[1], x[0]*x[1] + 3*x[1]**2])

x = torch.tensor([1.0, 2.0])
J = jacobian(f_vec, x)
print("Jacobian:\n", J)
# [[2, 1], [2, 13]]

# Hessian example
def f_scalar(x):
    return x[0]**3 + 2*x[0]**2 * x[1] + x[1]**2

x = torch.tensor([1.0, 1.0])
H = hessian(f_scalar, x)
print("Hessian:\n", H)
# [[10, 4], [4, 2]]

What comes next

The Jacobian and Hessian give you first and second-order information about a function at a point. To build a local approximation of a function that captures even more detail, you need the Taylor series expansion. Head to Taylor Series and Approximations to see how these ideas connect.

← Back to all series