Optimality conditions: second order
In this series (18 parts)
- What is optimization and why ML needs it
- Convex sets and convex functions
- Optimality conditions: first order
- Optimality conditions: second order
- Line search methods
- Least squares: the closed-form solution
- Steepest descent (gradient descent)
- Newton's method for optimization
- Quasi-Newton methods: BFGS and L-BFGS
- Conjugate gradient methods
- Constrained optimization and Lagrangian duality
- KKT conditions
- Penalty and barrier methods
- Interior point methods
- The simplex method
- Frank-Wolfe method
- Optimization in dynamic programming and optimal control
- Stochastic gradient descent and variants
First-order conditions tell you where the critical points are. Second-order conditions tell you what kind of critical point you are looking at. The tool is the Hessian matrix, and the key property is positive definiteness.
The intuition: what kind of critical point is this?
First-order conditions find flat spots where the gradient is zero. But a flat spot could be a valley floor, a hilltop, or a mountain pass. Second-order conditions tell you which one.
Consider three simple functions and their critical points at :
| Function | Shape at origin | Classification | ||
|---|---|---|---|---|
| 0 | 2 (positive) | Curves up like a bowl | Minimum | |
| 0 | -2 (negative) | Curves down like a hill | Maximum | |
| 0 | 0 | Flat inflection | Neither (saddle in higher dims) |
Three types of critical points
graph LR A["Bowl (minimum): curves up in all directions"] --- B["Hill (maximum): curves down in all directions"] B --- C["Saddle point: curves up in some directions, down in others"]
The Hessian matrix captures curvature in every direction simultaneously.
Three types of critical points at x = 0: minimum (x^2 curves up), maximum (-x^2 curves down), and inflection (x^3 is flat)
Positive curvature everywhere means you are at a valley floor. Negative curvature everywhere means a hilltop. Mixed curvature means a saddle, like the middle of a horse saddle that curves up front-to-back and down side-to-side.
Now let’s make this precise.
Prerequisites
You should be comfortable with:
- First-order optimality conditions
- Hessians and Jacobians, including how to compute second-order partial derivatives
The Hessian matrix
For a twice-differentiable function , the Hessian is the matrix of second partial derivatives:
If has continuous second derivatives (which is almost always the case in ML), the Hessian is symmetric: .
Positive definiteness
A symmetric matrix is:
- Positive definite () if for all
- Positive semidefinite () if for all
- Negative definite () if for all
- Indefinite if takes both positive and negative values
An equivalent check uses eigenvalues:
- Positive definite: all eigenvalues are positive
- Positive semidefinite: all eigenvalues are non-negative
- Negative definite: all eigenvalues are negative
- Indefinite: eigenvalues have mixed signs
For matrices, there is a shortcut. A symmetric matrix is:
- Positive definite if and
- Negative definite if and
- Indefinite if
The quantity is the determinant of the matrix.
Second-order optimality conditions
Second-order necessary condition. If is a local minimum of , then:
- (first-order condition)
- (Hessian is positive semidefinite)
Second-order sufficient condition. If at a point :
- (Hessian is positive definite)
then is a strict local minimum.
The gap between “necessary” and “sufficient” is the semidefinite case. When the Hessian is positive semidefinite but not positive definite (some eigenvalue is exactly zero), the second-order test is inconclusive. You need higher-order information.
Classification summary
At a critical point where :
| Hessian at | Conclusion |
|---|---|
| Positive definite | Strict local minimum ✓ |
| Negative definite | Strict local maximum |
| Indefinite | Saddle point |
| Semidefinite (singular) | Inconclusive, need more analysis |
Classifying a critical point using the Hessian eigenvalues
graph TD A["Find critical point: gradient = 0"] --> B["Compute Hessian at that point"] B --> C["Find eigenvalues of Hessian"] C --> D["All positive?"] D -- Yes --> E["Local minimum"] D -- No --> F["All negative?"] F -- Yes --> G["Local maximum"] F -- No --> H["Mixed signs?"] H -- Yes --> I["Saddle point"] H -- No --> J["Some zero eigenvalues: inconclusive"]
Eigenvalue sign to classification
graph LR A["All eigenvalues > 0"] --> B["Minimum: bowl shape"] C["All eigenvalues < 0"] --> D["Maximum: hilltop"] E["Mixed positive and negative"] --> F["Saddle point: horse saddle"]
Example 1: Classify the critical point of a quadratic
Consider:
Step 1. Find the gradient and set it to zero:
Critical point: .
Step 2. Compute the Hessian:
Step 3. Check positive definiteness using the shortcut:
The Hessian is positive definite.
Step 4. Confirm with eigenvalues. Since is diagonal, the eigenvalues are and , both positive.
Conclusion: is a strict local minimum with . ✓
Since this is a quadratic with positive definite Hessian, it is also convex, so this is the global minimum.
import numpy as np
H = np.array([[4, 0], [0, 6]])
eigenvalues = np.linalg.eigvalsh(H)
print(f"Eigenvalues: {eigenvalues}") # [4. 6.]
print(f"Positive definite: {all(eigenvalues > 0)}") # True
f = lambda x, y: 2*x**2 + 3*y**2 - 4*x + 6*y + 10
print(f"f(1, -1) = {f(1, -1)}") # 5
Example 2: A saddle point
Consider:
Step 1. Gradient:
From the first equation: . Substituting into the second: , so and .
Critical point: with .
Step 2. Hessian:
Step 3. Check definiteness:
The determinant is negative, so the Hessian is indefinite.
Step 4. Verify with eigenvalues:
One positive, one negative: indefinite. ✓
Conclusion: is a saddle point. The function curves upward in the direction of the eigenvector for and downward in the direction of the eigenvector for .
Verification with specific directions:
Along : . This goes down, so the function decreases.
Along : . This goes up, so the function increases.
import numpy as np
H = np.array([[2, -4], [-4, 2]])
eigenvalues = np.linalg.eigvalsh(H)
print(f"Eigenvalues: {eigenvalues}") # [-2. 6.]
print(f"Indefinite (mixed signs): {eigenvalues[0] * eigenvalues[-1] < 0}") # True
f = lambda x, y: x**2 - 4*x*y + y**2
print(f"f(0.1, 0.1) = {f(0.1, 0.1)}") # -0.02 (below 0)
print(f"f(0.1, -0.1) = {f(0.1, -0.1)}") # 0.06 (above 0)
Contour plot of the saddle surface f(x,y) = x^2 - y^2. The hyperbolic contours show curvature is positive along x and negative along y.
Example 3: A function with two critical points of different types
Consider:
Step 1. Gradient:
This gives four critical points: , , , .
Step 2. Hessian:
Step 3. Evaluate the Hessian at each critical point:
At :
Eigenvalues: . Both positive. Strict local minimum. ✓
At :
Eigenvalues: . Both negative. Strict local maximum.
At :
Eigenvalues: . Mixed signs. Saddle point.
At :
Eigenvalues: . Mixed signs. Saddle point.
Summary of results:
| Critical point | value | Eigenvalues | Type |
|---|---|---|---|
| Local minimum ✓ | |||
| Local maximum | |||
| Saddle point | |||
| Saddle point |
import numpy as np
f = lambda x, y: x**3 + y**3 - 3*x - 3*y
critical_points = [(1,1), (-1,-1), (1,-1), (-1,1)]
for (x, y) in critical_points:
H = np.array([[6*x, 0], [0, 6*y]])
eigs = np.linalg.eigvalsh(H)
val = f(x, y)
if all(eigs > 0):
kind = "local min"
elif all(eigs < 0):
kind = "local max"
else:
kind = "saddle"
print(f"({x:+d},{y:+d}): f={val:+.0f}, eigs={eigs}, {kind}")
The inconclusive case
When the Hessian is positive semidefinite but not positive definite (at least one zero eigenvalue), the second-order test tells you nothing. You need to look at higher-order derivatives or use other methods.
Classic example: . At :
- ✓ (critical point)
- (inconclusive)
But is clearly a minimum because for all . The second-order test just cannot confirm it.
Another example: . At :
- ✓ (critical point)
- (inconclusive)
Here is neither a minimum nor a maximum; it is an inflection point.
Connection to curvature and ML
Why saddle points matter more than local minima
In high-dimensional optimization (like training neural networks), saddle points are far more common than local minima. A critical point is a local minimum only if the Hessian has all positive eigenvalues. In a space with millions of dimensions, the probability that all eigenvalues happen to be positive is very small. Most critical points in neural network loss surfaces are saddle points.
This is actually good news. Gradient descent and its variants (SGD, Adam) can escape saddle points because the noise in stochastic gradients pushes the parameters along directions of negative curvature. Pure gradient descent (without noise) can get stuck at saddle points because the gradient is zero there.
Condition number
The ratio of the largest to smallest eigenvalue of the Hessian, , is the condition number. A large condition number means the function is steep in some directions and flat in others. This makes optimization harder because a single step size cannot work well in all directions. Newton’s method handles this by using the Hessian to scale the step, which is why it converges faster than gradient descent on ill-conditioned problems.
Summary
| Condition | Meaning |
|---|---|
| (positive definite) | Local minimum ✓ |
| (negative definite) | Local maximum |
| indefinite | Saddle point |
| semidefinite | Inconclusive |
| High condition number | Hard for gradient descent |
The Hessian test gives you a definitive answer at most critical points. When it is inconclusive, you need more sophisticated tools. For practical ML, the main takeaway is that saddle points dominate high-dimensional landscapes, and stochastic methods handle them naturally.
What comes next
Now that you can find and classify critical points, the next question is: how do you get there? Line search methods are the building block of iterative optimization algorithms, controlling how far you step in a chosen direction.