Search…

Optimality conditions: second order

In this series (18 parts)
  1. What is optimization and why ML needs it
  2. Convex sets and convex functions
  3. Optimality conditions: first order
  4. Optimality conditions: second order
  5. Line search methods
  6. Least squares: the closed-form solution
  7. Steepest descent (gradient descent)
  8. Newton's method for optimization
  9. Quasi-Newton methods: BFGS and L-BFGS
  10. Conjugate gradient methods
  11. Constrained optimization and Lagrangian duality
  12. KKT conditions
  13. Penalty and barrier methods
  14. Interior point methods
  15. The simplex method
  16. Frank-Wolfe method
  17. Optimization in dynamic programming and optimal control
  18. Stochastic gradient descent and variants

First-order conditions tell you where the critical points are. Second-order conditions tell you what kind of critical point you are looking at. The tool is the Hessian matrix, and the key property is positive definiteness.

The intuition: what kind of critical point is this?

First-order conditions find flat spots where the gradient is zero. But a flat spot could be a valley floor, a hilltop, or a mountain pass. Second-order conditions tell you which one.

Consider three simple functions and their critical points at x=0x = 0:

Functionf(0)f(0)f(0)f''(0)Shape at originClassification
f(x)=x2f(x) = x^202 (positive)Curves up like a bowlMinimum
f(x)=x2f(x) = -x^20-2 (negative)Curves down like a hillMaximum
f(x)=x3f(x) = x^300Flat inflectionNeither (saddle in higher dims)

Three types of critical points

graph LR
  A["Bowl (minimum): curves up in all directions"] --- B["Hill (maximum): curves down in all directions"]
  B --- C["Saddle point: curves up in some directions, down in others"]

The Hessian matrix captures curvature in every direction simultaneously.

Three types of critical points at x = 0: minimum (x^2 curves up), maximum (-x^2 curves down), and inflection (x^3 is flat)

Positive curvature everywhere means you are at a valley floor. Negative curvature everywhere means a hilltop. Mixed curvature means a saddle, like the middle of a horse saddle that curves up front-to-back and down side-to-side.

Now let’s make this precise.

Prerequisites

You should be comfortable with:

The Hessian matrix

For a twice-differentiable function f:RnRf: \mathbb{R}^n \to \mathbb{R}, the Hessian is the n×nn \times n matrix of second partial derivatives:

2f(x)=H=[2fx122fx1x22fx1xn2fx2x12fx222fx2xn2fxnx12fxnx22fxn2]\nabla^2 f(\mathbf{x}) = H = \begin{bmatrix} \frac{\partial^2 f}{\partial x_1^2} & \frac{\partial^2 f}{\partial x_1 \partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_1 \partial x_n} \\ \frac{\partial^2 f}{\partial x_2 \partial x_1} & \frac{\partial^2 f}{\partial x_2^2} & \cdots & \frac{\partial^2 f}{\partial x_2 \partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial^2 f}{\partial x_n \partial x_1} & \frac{\partial^2 f}{\partial x_n \partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_n^2} \end{bmatrix}

If ff has continuous second derivatives (which is almost always the case in ML), the Hessian is symmetric: 2fxixj=2fxjxi\frac{\partial^2 f}{\partial x_i \partial x_j} = \frac{\partial^2 f}{\partial x_j \partial x_i}.

Positive definiteness

A symmetric matrix HH is:

  • Positive definite (H0H \succ 0) if vTHv>0\mathbf{v}^T H \mathbf{v} > 0 for all v0\mathbf{v} \neq \mathbf{0}
  • Positive semidefinite (H0H \succeq 0) if vTHv0\mathbf{v}^T H \mathbf{v} \geq 0 for all v\mathbf{v}
  • Negative definite (H0H \prec 0) if vTHv<0\mathbf{v}^T H \mathbf{v} < 0 for all v0\mathbf{v} \neq \mathbf{0}
  • Indefinite if vTHv\mathbf{v}^T H \mathbf{v} takes both positive and negative values

An equivalent check uses eigenvalues:

  • Positive definite: all eigenvalues are positive
  • Positive semidefinite: all eigenvalues are non-negative
  • Negative definite: all eigenvalues are negative
  • Indefinite: eigenvalues have mixed signs

For 2×22 \times 2 matrices, there is a shortcut. A symmetric matrix [abbc]\begin{bmatrix} a & b \\ b & c \end{bmatrix} is:

  • Positive definite if a>0a > 0 and acb2>0ac - b^2 > 0
  • Negative definite if a<0a < 0 and acb2>0ac - b^2 > 0
  • Indefinite if acb2<0ac - b^2 < 0

The quantity acb2ac - b^2 is the determinant of the matrix.

Second-order optimality conditions

Second-order necessary condition. If x\mathbf{x}^* is a local minimum of ff, then:

  1. f(x)=0\nabla f(\mathbf{x}^*) = \mathbf{0} (first-order condition)
  2. 2f(x)0\nabla^2 f(\mathbf{x}^*) \succeq 0 (Hessian is positive semidefinite)

Second-order sufficient condition. If at a point x\mathbf{x}^*:

  1. f(x)=0\nabla f(\mathbf{x}^*) = \mathbf{0}
  2. 2f(x)0\nabla^2 f(\mathbf{x}^*) \succ 0 (Hessian is positive definite)

then x\mathbf{x}^* is a strict local minimum.

The gap between “necessary” and “sufficient” is the semidefinite case. When the Hessian is positive semidefinite but not positive definite (some eigenvalue is exactly zero), the second-order test is inconclusive. You need higher-order information.

Classification summary

At a critical point x\mathbf{x}^* where f=0\nabla f = \mathbf{0}:

Hessian at x\mathbf{x}^*Conclusion
Positive definiteStrict local minimum ✓
Negative definiteStrict local maximum
IndefiniteSaddle point
Semidefinite (singular)Inconclusive, need more analysis

Classifying a critical point using the Hessian eigenvalues

graph TD
  A["Find critical point: gradient = 0"] --> B["Compute Hessian at that point"]
  B --> C["Find eigenvalues of Hessian"]
  C --> D["All positive?"]
  D -- Yes --> E["Local minimum"]
  D -- No --> F["All negative?"]
  F -- Yes --> G["Local maximum"]
  F -- No --> H["Mixed signs?"]
  H -- Yes --> I["Saddle point"]
  H -- No --> J["Some zero eigenvalues: inconclusive"]

Eigenvalue sign to classification

graph LR
  A["All eigenvalues > 0"] --> B["Minimum: bowl shape"]
  C["All eigenvalues < 0"] --> D["Maximum: hilltop"]
  E["Mixed positive and negative"] --> F["Saddle point: horse saddle"]

Example 1: Classify the critical point of a quadratic

Consider:

f(x,y)=2x2+3y24x+6y+10f(x, y) = 2x^2 + 3y^2 - 4x + 6y + 10

Step 1. Find the gradient and set it to zero:

fx=4x4=0x=1\frac{\partial f}{\partial x} = 4x - 4 = 0 \quad \Rightarrow \quad x = 1

fy=6y+6=0y=1\frac{\partial f}{\partial y} = 6y + 6 = 0 \quad \Rightarrow \quad y = -1

Critical point: (1,1)(1, -1).

Step 2. Compute the Hessian:

H=[2fx22fxy2fyx2fy2]=[4006]H = \begin{bmatrix} \frac{\partial^2 f}{\partial x^2} & \frac{\partial^2 f}{\partial x \partial y} \\ \frac{\partial^2 f}{\partial y \partial x} & \frac{\partial^2 f}{\partial y^2} \end{bmatrix} = \begin{bmatrix} 4 & 0 \\ 0 & 6 \end{bmatrix}

Step 3. Check positive definiteness using the 2×22 \times 2 shortcut:

a=4>0a = 4 > 0 \quad \checkmark

det(H)=4602=24>0\det(H) = 4 \cdot 6 - 0^2 = 24 > 0 \quad \checkmark

The Hessian is positive definite.

Step 4. Confirm with eigenvalues. Since HH is diagonal, the eigenvalues are 44 and 66, both positive.

Conclusion: (1,1)(1, -1) is a strict local minimum with f(1,1)=2+346+10=5f(1, -1) = 2 + 3 - 4 - 6 + 10 = 5. ✓

Since this is a quadratic with positive definite Hessian, it is also convex, so this is the global minimum.

import numpy as np

H = np.array([[4, 0], [0, 6]])
eigenvalues = np.linalg.eigvalsh(H)
print(f"Eigenvalues: {eigenvalues}")   # [4. 6.]
print(f"Positive definite: {all(eigenvalues > 0)}")  # True

f = lambda x, y: 2*x**2 + 3*y**2 - 4*x + 6*y + 10
print(f"f(1, -1) = {f(1, -1)}")  # 5

Example 2: A saddle point

Consider:

f(x,y)=x24xy+y2f(x, y) = x^2 - 4xy + y^2

Step 1. Gradient:

fx=2x4y=0\frac{\partial f}{\partial x} = 2x - 4y = 0

fy=4x+2y=0\frac{\partial f}{\partial y} = -4x + 2y = 0

From the first equation: x=2yx = 2y. Substituting into the second: 8y+2y=6y=0-8y + 2y = -6y = 0, so y=0y = 0 and x=0x = 0.

Critical point: (0,0)(0, 0) with f(0,0)=0f(0, 0) = 0.

Step 2. Hessian:

H=[2442]H = \begin{bmatrix} 2 & -4 \\ -4 & 2 \end{bmatrix}

Step 3. Check definiteness:

a=2>0a = 2 > 0

det(H)=22(4)2=416=12<0\det(H) = 2 \cdot 2 - (-4)^2 = 4 - 16 = -12 < 0

The determinant is negative, so the Hessian is indefinite.

Step 4. Verify with eigenvalues:

det(HλI)=(2λ)216=0\det(H - \lambda I) = (2 - \lambda)^2 - 16 = 0

2λ=±42 - \lambda = \pm 4

λ1=6,λ2=2\lambda_1 = 6, \quad \lambda_2 = -2

One positive, one negative: indefinite. ✓

Conclusion: (0,0)(0, 0) is a saddle point. The function curves upward in the direction of the eigenvector for λ=6\lambda = 6 and downward in the direction of the eigenvector for λ=2\lambda = -2.

Verification with specific directions:

Along d=(1,1)\mathbf{d} = (1, 1): f(t,t)=t24t2+t2=2t2f(t, t) = t^2 - 4t^2 + t^2 = -2t^2. This goes down, so the function decreases.

Along d=(1,1)\mathbf{d} = (1, -1): f(t,t)=t2+4t2+t2=6t2f(t, -t) = t^2 + 4t^2 + t^2 = 6t^2. This goes up, so the function increases.

import numpy as np

H = np.array([[2, -4], [-4, 2]])
eigenvalues = np.linalg.eigvalsh(H)
print(f"Eigenvalues: {eigenvalues}")  # [-2.  6.]
print(f"Indefinite (mixed signs): {eigenvalues[0] * eigenvalues[-1] < 0}")  # True

f = lambda x, y: x**2 - 4*x*y + y**2
print(f"f(0.1, 0.1) = {f(0.1, 0.1)}")    # -0.02 (below 0)
print(f"f(0.1, -0.1) = {f(0.1, -0.1)}")   # 0.06 (above 0)

Contour plot of the saddle surface f(x,y) = x^2 - y^2. The hyperbolic contours show curvature is positive along x and negative along y.

Example 3: A function with two critical points of different types

Consider:

f(x,y)=x3+y33x3yf(x, y) = x^3 + y^3 - 3x - 3y

Step 1. Gradient:

fx=3x23=0x=±1\frac{\partial f}{\partial x} = 3x^2 - 3 = 0 \quad \Rightarrow \quad x = \pm 1

fy=3y23=0y=±1\frac{\partial f}{\partial y} = 3y^2 - 3 = 0 \quad \Rightarrow \quad y = \pm 1

This gives four critical points: (1,1)(1,1), (1,1)(1,-1), (1,1)(-1,1), (1,1)(-1,-1).

Step 2. Hessian:

H(x,y)=[6x006y]H(x,y) = \begin{bmatrix} 6x & 0 \\ 0 & 6y \end{bmatrix}

Step 3. Evaluate the Hessian at each critical point:

At (1,1)(1, 1):

H=[6006]H = \begin{bmatrix} 6 & 0 \\ 0 & 6 \end{bmatrix}

Eigenvalues: 6,66, 6. Both positive. Strict local minimum.

f(1,1)=1+133=4f(1, 1) = 1 + 1 - 3 - 3 = -4

At (1,1)(-1, -1):

H=[6006]H = \begin{bmatrix} -6 & 0 \\ 0 & -6 \end{bmatrix}

Eigenvalues: 6,6-6, -6. Both negative. Strict local maximum.

f(1,1)=11+3+3=4f(-1, -1) = -1 - 1 + 3 + 3 = 4

At (1,1)(1, -1):

H=[6006]H = \begin{bmatrix} 6 & 0 \\ 0 & -6 \end{bmatrix}

Eigenvalues: 6,66, -6. Mixed signs. Saddle point.

f(1,1)=113+3=0f(1, -1) = 1 - 1 - 3 + 3 = 0

At (1,1)(-1, 1):

H=[6006]H = \begin{bmatrix} -6 & 0 \\ 0 & 6 \end{bmatrix}

Eigenvalues: 6,6-6, 6. Mixed signs. Saddle point.

f(1,1)=1+1+33=0f(-1, 1) = -1 + 1 + 3 - 3 = 0

Summary of results:

Critical pointff valueEigenvaluesType
(1,1)(1, 1)4-46,66, 6Local minimum ✓
(1,1)(-1, -1)446,6-6, -6Local maximum
(1,1)(1, -1)006,66, -6Saddle point
(1,1)(-1, 1)006,6-6, 6Saddle point
import numpy as np

f = lambda x, y: x**3 + y**3 - 3*x - 3*y

critical_points = [(1,1), (-1,-1), (1,-1), (-1,1)]
for (x, y) in critical_points:
    H = np.array([[6*x, 0], [0, 6*y]])
    eigs = np.linalg.eigvalsh(H)
    val = f(x, y)
    if all(eigs > 0):
        kind = "local min"
    elif all(eigs < 0):
        kind = "local max"
    else:
        kind = "saddle"
    print(f"({x:+d},{y:+d}): f={val:+.0f}, eigs={eigs}, {kind}")

The inconclusive case

When the Hessian is positive semidefinite but not positive definite (at least one zero eigenvalue), the second-order test tells you nothing. You need to look at higher-order derivatives or use other methods.

Classic example: f(x)=x4f(x) = x^4. At x=0x = 0:

  • f(0)=0f'(0) = 0 ✓ (critical point)
  • f(0)=0f''(0) = 0 (inconclusive)

But x=0x = 0 is clearly a minimum because x40x^4 \geq 0 for all xx. The second-order test just cannot confirm it.

Another example: f(x)=x3f(x) = x^3. At x=0x = 0:

  • f(0)=0f'(0) = 0 ✓ (critical point)
  • f(0)=0f''(0) = 0 (inconclusive)

Here x=0x = 0 is neither a minimum nor a maximum; it is an inflection point.

Connection to curvature and ML

Why saddle points matter more than local minima

In high-dimensional optimization (like training neural networks), saddle points are far more common than local minima. A critical point is a local minimum only if the Hessian has all positive eigenvalues. In a space with millions of dimensions, the probability that all eigenvalues happen to be positive is very small. Most critical points in neural network loss surfaces are saddle points.

This is actually good news. Gradient descent and its variants (SGD, Adam) can escape saddle points because the noise in stochastic gradients pushes the parameters along directions of negative curvature. Pure gradient descent (without noise) can get stuck at saddle points because the gradient is zero there.

Condition number

The ratio of the largest to smallest eigenvalue of the Hessian, κ=λmax/λmin\kappa = \lambda_{\max} / \lambda_{\min}, is the condition number. A large condition number means the function is steep in some directions and flat in others. This makes optimization harder because a single step size cannot work well in all directions. Newton’s method handles this by using the Hessian to scale the step, which is why it converges faster than gradient descent on ill-conditioned problems.

Summary

ConditionMeaning
H0H \succ 0 (positive definite)Local minimum ✓
H0H \prec 0 (negative definite)Local maximum
HH indefiniteSaddle point
HH semidefiniteInconclusive
High condition numberHard for gradient descent

The Hessian test gives you a definitive answer at most critical points. When it is inconclusive, you need more sophisticated tools. For practical ML, the main takeaway is that saddle points dominate high-dimensional landscapes, and stochastic methods handle them naturally.

What comes next

Now that you can find and classify critical points, the next question is: how do you get there? Line search methods are the building block of iterative optimization algorithms, controlling how far you step in a chosen direction.

Start typing to search across all content
navigate Enter open Esc close