The Jacobian and Hessian Matrices
In this series (15 parts)
- Why Maths Matters for ML: A Practical Overview
- Scalars, Vectors, and Vector Spaces
- Matrices and Matrix Operations
- Matrix Inverses and Systems of Linear Equations
- Eigenvalues and Eigenvectors
- Matrix Decompositions: LU, QR, SVD
- Norms, Distances, and Similarity
- Calculus Review: Derivatives and the Chain Rule
- Partial Derivatives and Gradients
- The Jacobian and Hessian Matrices
- Taylor series and local approximations
- Probability fundamentals
- Random variables and distributions
- Bayes theorem and its role in ML
- Information theory: entropy, KL divergence, cross-entropy
The gradient tells you how a scalar function changes with respect to each input. But what happens when the function outputs a vector, not a scalar? And what if you want to know not just the slope, but the curvature? The Jacobian and Hessian answer these two questions.
Prerequisites: You should be comfortable with partial derivatives and gradients and the chain rule.
The Jacobian matrix
Motivation
Many functions in ML map a vector to another vector. A neural network layer takes a vector of activations and outputs a vector of new activations. A coordinate transformation takes and outputs . To differentiate such functions, you need the Jacobian.
Definition
Given a function that maps an -dimensional input to an -dimensional output:
The Jacobian is the matrix of all first-order partial derivatives:
Each row is the gradient of one output component. The Jacobian stacks all those gradients together.
Jacobian structure: m outputs and n inputs produce an m x n matrix:
graph TD I["n inputs: x1, x2, ..., xn"] --> F["Vector function f: R^n to R^m"] F --> O["m outputs: f1, f2, ..., fm"] O --> J["Jacobian J: m x n matrix"] J --> R1["Row 1 = gradient of f1"] J --> R2["Row 2 = gradient of f2"] J --> Rm["Row m = gradient of fm"]
Special cases
- If (scalar output), the Jacobian is a row vector, which is just the gradient transposed.
- If (scalar input), the Jacobian is an column vector of ordinary derivatives.
Example 1: Jacobian of a 2D transformation
Consider the polar-to-Cartesian transformation:
The Jacobian is a matrix:
Evaluate at :
We know and .
The determinant of the Jacobian is:
The Jacobian determinant equals , which is why we have the factor in polar integration: .
Example 2: Jacobian of a neural network layer
Consider a simple layer with two inputs and two outputs:
Compute each partial derivative:
The Jacobian:
Evaluate at :
What this tells you: If you perturb the input slightly by , the output changes by approximately .
For (nudge only):
Both outputs increase by about 0.2 when we increase by 0.1.
The Jacobian in backpropagation
In a neural network, backpropagation propagates gradients backward through layers using the chain rule. For a layer , if you know (the gradient of the loss with respect to the output), then:
This is a Jacobian-vector product. Deep learning frameworks compute these efficiently without ever forming the full Jacobian matrix, which would be too large for realistic networks.
The Hessian matrix
Motivation
The gradient is a first-order approximation: it tells you the slope. But slopes change. A function might have a steep slope but be curving downward (about to flatten out). The Hessian captures this curvature through second derivatives.
Definition
For a scalar function , the Hessian is the matrix of second-order partial derivatives:
For “nice” functions (continuously differentiable, which covers almost everything in ML), the Hessian is symmetric: .
What the Hessian tells you
The Hessian describes the curvature of in every direction. Combined with the gradient, it gives a second-order Taylor approximation:
The Hessian term is the correction that accounts for curvature.
Example 3: Computing the Hessian
Let .
Step 1: First partial derivatives.
Step 2: Second partial derivatives.
Step 3: Assemble the Hessian.
Evaluate at :
Example 4: Classifying a critical point using the Hessian
Continuing from Example 3, let us classify the critical points of .
Step 1: Find critical points by setting .
Case 1: . Then . Critical point: .
Case 2: , so . Substituting into :
Then . Critical point: .
Step 2: Evaluate the Hessian at each critical point.
At :
Eigenvalues: . One eigenvalue is zero, so the second derivative test is inconclusive. This is a degenerate critical point.
At :
Eigenvalues: Solve :
Using the quadratic formula:
One positive, one negative. This means is a saddle point. The function curves upward in one direction and downward in another.
Positive definiteness and curvature
The eigenvalues of the Hessian tell you about the curvature at a critical point:
| Eigenvalue pattern | Hessian classification | Critical point type |
|---|---|---|
| All | Positive definite | Local minimum |
| All | Negative definite | Local maximum |
| Mixed signs | Indefinite | Saddle point |
| Some | Semi-definite | Inconclusive |
In optimization, we want to reach a point where and the Hessian is positive definite. That guarantees a local minimum. For convex functions, the Hessian is positive semi-definite everywhere, which guarantees that any local minimum is also a global minimum.
Classifying critical points using Hessian eigenvalues:
graph TD
A["Critical point: gradient = 0"] --> B["Compute Hessian eigenvalues"]
B --> C{"All positive?"}
C -->|Yes| D["Local minimum"]
C -->|No| E{"All negative?"}
E -->|Yes| F["Local maximum"]
E -->|No| G{"Mixed signs?"}
G -->|Yes| H["Saddle point"]
G -->|"Some zero"| I["Inconclusive"]
Function shapes and their Hessian classification:
Example 5: A clean positive definite case
Let .
Gradient:
Set : . Critical point: .
Hessian:
The Hessian is constant (does not depend on ). The eigenvalues are simply the diagonal entries: .
Both positive, so is positive definite. The critical point is a local minimum (and since is convex, it is the global minimum).
Geometric interpretation: The curvature is 2 in the -direction and 8 in the -direction. The surface bends more sharply in . This means gradient descent converges faster along than , which can cause oscillation. The condition number quantifies this imbalance. Higher condition numbers mean slower convergence for gradient descent.
The Hessian in optimization
Newton’s method
While gradient descent uses only first-order information, Newton’s method uses both the gradient and the Hessian:
By accounting for curvature, Newton’s method can converge in far fewer steps. The trade-off: computing and inverting the Hessian is expensive ( storage, inversion). For a neural network with millions of parameters, this is impractical.
Newton’s method iteration loop:
graph LR
A["Current point x_t"] --> B["Compute gradient"]
B --> C["Compute Hessian H"]
C --> D["Solve for step: H^-1 * gradient"]
D --> E["Update: x_t+1 = x_t - step"]
E --> F{"Converged?"}
F -->|No| A
F -->|Yes| G["Minimum found"]
Practical approximations
Because the full Hessian is too expensive for large models, ML practitioners use approximations:
- BFGS / L-BFGS: Builds an approximate Hessian inverse incrementally
- Adam optimizer: Adapts learning rates per parameter using moving averages of first and second moments (a diagonal Hessian approximation)
- Hessian-free methods: Compute Hessian-vector products without forming the full matrix
Jacobian vs Hessian: summary
| Property | Jacobian | Hessian |
|---|---|---|
| Input function | Vector-valued | Scalar-valued |
| Matrix size | ||
| Contains | First partial derivatives | Second partial derivatives |
| Tells you | How outputs change with inputs | Curvature of the function |
| Role in ML | Backpropagation, change of variables | Optimization, critical point analysis |
| Symmetric? | Not in general | Yes (for smooth functions) |
Computing in Python
import torch
from torch.autograd.functional import jacobian, hessian
# Jacobian example
def f_vec(x):
return torch.stack([x[0]**2 + x[1], x[0]*x[1] + 3*x[1]**2])
x = torch.tensor([1.0, 2.0])
J = jacobian(f_vec, x)
print("Jacobian:\n", J)
# [[2, 1], [2, 13]]
# Hessian example
def f_scalar(x):
return x[0]**3 + 2*x[0]**2 * x[1] + x[1]**2
x = torch.tensor([1.0, 1.0])
H = hessian(f_scalar, x)
print("Hessian:\n", H)
# [[10, 4], [4, 2]]
What comes next
The Jacobian and Hessian give you first and second-order information about a function at a point. To build a local approximation of a function that captures even more detail, you need the Taylor series expansion. Head to Taylor Series and Approximations to see how these ideas connect.