Search…

Optimality conditions: first order

In this series (18 parts)
  1. What is optimization and why ML needs it
  2. Convex sets and convex functions
  3. Optimality conditions: first order
  4. Optimality conditions: second order
  5. Line search methods
  6. Least squares: the closed-form solution
  7. Steepest descent (gradient descent)
  8. Newton's method for optimization
  9. Quasi-Newton methods: BFGS and L-BFGS
  10. Conjugate gradient methods
  11. Constrained optimization and Lagrangian duality
  12. KKT conditions
  13. Penalty and barrier methods
  14. Interior point methods
  15. The simplex method
  16. Frank-Wolfe method
  17. Optimization in dynamic programming and optimal control
  18. Stochastic gradient descent and variants

At a minimum of a smooth function, the gradient is zero. That is the first-order necessary condition. It gives you a way to find candidate solutions: set f=0\nabla f = \mathbf{0} and solve. But gradient equals zero does not guarantee a minimum. You could be at a maximum or a saddle point. Understanding exactly what first-order conditions do and do not tell you is essential for optimization.

The hilltop intuition

Stand at the top of a hill. The ground is flat in every direction. No matter which way you step, you neither go up nor down, at least for that first small step. That flatness is the gradient being zero.

Now imagine standing at the bottom of a valley. Same thing: the ground is flat in every direction. You cannot go lower without climbing first. The gradient is zero here too.

And at a mountain pass, the saddle between two peaks? Flat. The gradient is zero, yet you are neither at a top nor a bottom.

The gradient tells you the slope. When the slope is zero everywhere around a point, you have found a critical point. But which kind?

Three functions and their critical points

FunctionGradientWhere gradient = 0Type of point
f(x)=x2f(x) = x^22x2xx=0x = 0Minimum (valley bottom)
f(x)=x2f(x) = -x^22x-2xx=0x = 0Maximum (hilltop)
f(x,y)=x2y2f(x,y) = x^2 - y^2(2x,2y)(2x, -2y)(0,0)(0, 0)Saddle point (mountain pass)

All three have gradient zero at the critical point. First-order conditions alone cannot distinguish between them.

The gradient as a slope indicator

graph LR
  A["Far from critical point:<br/>gradient is large,<br/>steep slope"] --> B["Approaching critical point:<br/>gradient shrinks"]
  B --> C["At critical point:<br/>gradient = 0,<br/>flat in every direction"]
  style A fill:#ffcccc
  style B fill:#fff3cd
  style C fill:#ccffcc

The gradient arrow shrinks as you approach the critical point. At the critical point itself, the arrow vanishes. Finding where it vanishes is the whole game.

The function f(x) = x^3 - 3x has a local maximum at x = -1 and a local minimum at x = 1, both where the gradient is zero

Prerequisites

You should be comfortable with:

The gradient and critical points

For a function f:RnRf: \mathbb{R}^n \to \mathbb{R}, the gradient at a point x\mathbf{x} is the vector of partial derivatives:

f(x)=[fx1fx2fxn]\nabla f(\mathbf{x}) = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \frac{\partial f}{\partial x_2} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{bmatrix}

The gradient points in the direction of steepest ascent. Its negative, f-\nabla f, points in the direction of steepest descent.

A point x\mathbf{x}^* is a critical point (or stationary point) if f(x)=0\nabla f(\mathbf{x}^*) = \mathbf{0}.

At a critical point, there is no direction that gives a first-order increase or decrease. The function is “flat” to first order.

Directional derivatives

The directional derivative of ff at x\mathbf{x} in direction d\mathbf{d} (where d=1\|\mathbf{d}\| = 1) is:

Ddf(x)=f(x)TdD_\mathbf{d} f(\mathbf{x}) = \nabla f(\mathbf{x})^T \mathbf{d}

This is the dot product of the gradient with the direction vector. It tells you the rate of change of ff as you move from x\mathbf{x} in direction d\mathbf{d}.

Key facts:

  • The directional derivative is maximized when d\mathbf{d} points in the same direction as f\nabla f (steepest ascent).
  • The directional derivative is minimized when d\mathbf{d} points opposite to f\nabla f (steepest descent).
  • When f(x)=0\nabla f(\mathbf{x}) = \mathbf{0}, the directional derivative is zero in every direction.

This last point is why critical points matter: at a critical point, no direction gives a first-order decrease, so you cannot improve the function value by taking an infinitesimally small step.

Necessary vs sufficient conditions

First-order necessary condition for a local minimum. If x\mathbf{x}^* is a local minimum of a differentiable function ff, then f(x)=0\nabla f(\mathbf{x}^*) = \mathbf{0}.

This is necessary but not sufficient. Gradient equals zero is required for a minimum, but having gradient equal zero does not mean you have a minimum. There are three possibilities at a critical point:

  1. Local minimum: ff increases in all directions from x\mathbf{x}^*.
  2. Local maximum: ff decreases in all directions from x\mathbf{x}^*.
  3. Saddle point: ff increases in some directions and decreases in others.

You need second-order information (the Hessian) to distinguish these cases, which we cover in the next article.

How to find critical points

flowchart TD
  A["Start with f of x"] --> B["Compute gradient nabla f"]
  B --> C["Set nabla f = 0"]
  C --> D["Solve the system of equations"]
  D --> E["Critical points found"]
  E --> F{"Classify each point"}
  F -->|"Need Hessian"| G["Minimum, maximum, or saddle?"]

Necessary vs sufficient: what first-order conditions guarantee

graph TD
  A["nabla f = 0 at x*"] -->|"Necessary"| B["Required for any local min"]
  A -->|"Not sufficient in general"| C["Could be max or saddle"]
  D["f is convex AND nabla f = 0"] -->|"Sufficient"| E["Guaranteed global minimum"]
  style B fill:#fff3cd
  style C fill:#ffcccc
  style E fill:#ccffcc

Exception: convex functions. For convex functions, the first-order condition is both necessary and sufficient. If ff is convex and f(x)=0\nabla f(\mathbf{x}^*) = \mathbf{0}, then x\mathbf{x}^* is a global minimum. This is one of the most powerful consequences of convexity.

Example 1: A bowl-shaped function

Consider:

f(x,y)=x2+y22x4y+8f(x, y) = x^2 + y^2 - 2x - 4y + 8

Step 1. Compute the gradient:

fx=2x2\frac{\partial f}{\partial x} = 2x - 2

fy=2y4\frac{\partial f}{\partial y} = 2y - 4

f=[2x22y4]\nabla f = \begin{bmatrix} 2x - 2 \\ 2y - 4 \end{bmatrix}

Step 2. Set the gradient to zero and solve:

2x2=0x=12x - 2 = 0 \quad \Rightarrow \quad x = 1

2y4=0y=22y - 4 = 0 \quad \Rightarrow \quad y = 2

The only critical point is (1,2)(1, 2).

Step 3. Evaluate the function:

f(1,2)=1+428+8=3f(1, 2) = 1 + 4 - 2 - 8 + 8 = 3

Step 4. Verify this is a minimum by rewriting:

f(x,y)=(x1)2+(y2)2+3f(x,y) = (x-1)^2 + (y-2)^2 + 3

This is a sum of squares plus a constant. The minimum value is 33, achieved at (1,2)(1, 2). Since ff is a sum of squared terms (convex), this critical point is the global minimum. ✓

Directional derivative check. At (1,2)(1, 2), f=(0,0)\nabla f = (0, 0), so the directional derivative is zero in every direction:

Ddf(1,2)=[00]Td=0D_\mathbf{d} f(1, 2) = \begin{bmatrix} 0 \\ 0 \end{bmatrix}^T \mathbf{d} = 0

No matter which direction you choose, the function is flat to first order.

import numpy as np

grad_f = lambda x, y: np.array([2*x - 2, 2*y - 4])
f = lambda x, y: x**2 + y**2 - 2*x - 4*y + 8

# Critical point
print(f"Gradient at (1,2): {grad_f(1, 2)}")  # [0 0]
print(f"f(1,2) = {f(1, 2)}")                  # 3

# Nearby points are all larger
for dx, dy in [(0.1, 0), (0, 0.1), (-0.1, 0), (0, -0.1)]:
    print(f"f({1+dx:.1f}, {2+dy:.1f}) = {f(1+dx, 2+dy):.2f}")
# All values > 3

Example 2: A function with a saddle point

Consider:

f(x,y)=x2y2f(x, y) = x^2 - y^2

Step 1. Compute the gradient:

f=[2x2y]\nabla f = \begin{bmatrix} 2x \\ -2y \end{bmatrix}

Step 2. Set to zero:

2x=0x=02x = 0 \quad \Rightarrow \quad x = 0

2y=0y=0-2y = 0 \quad \Rightarrow \quad y = 0

The only critical point is (0,0)(0, 0) with f(0,0)=0f(0, 0) = 0.

Step 3. Check nearby points to see if it is a minimum:

f(0.1,0)=0.01>0(function increases along x-axis)f(0.1, 0) = 0.01 > 0 \quad \text{(function increases along } x\text{-axis)}

f(0,0.1)=0.01<0(function decreases along y-axis)f(0, 0.1) = -0.01 < 0 \quad \text{(function decreases along } y\text{-axis)}

The origin is not a minimum because you can decrease ff by moving along the yy-axis. It is also not a maximum because you can increase ff by moving along the xx-axis. It is a saddle point.

Step 4. Check the directional derivatives at the origin. Since the gradient is zero, all directional derivatives are zero. First-order information alone cannot distinguish this saddle point from a minimum. This is precisely why we need second-order conditions.

Directional derivative near the origin. At (0.1,0)(0.1, 0), the gradient is (0.2,0)(0.2, 0). The directional derivative in the xx-direction is 0.20.2 (ascending), confirming the function is increasing. In the yy-direction it is 00 (the function curves downward but the first derivative is zero there).

import numpy as np

f = lambda x, y: x**2 - y**2
grad_f = lambda x, y: np.array([2*x, -2*y])

print(f"Gradient at origin: {grad_f(0, 0)}")  # [0 0]
print(f"f(0.1, 0) = {f(0.1, 0)}")             # 0.01
print(f"f(0, 0.1) = {f(0, 0.1)}")             # -0.01
# Saddle point: increases in x, decreases in y

Example 3: Multiple critical points of a cubic surface

Consider:

f(x,y)=x33xy+y3f(x, y) = x^3 - 3xy + y^3

Step 1. Compute partial derivatives:

fx=3x23y\frac{\partial f}{\partial x} = 3x^2 - 3y

fy=3x+3y2\frac{\partial f}{\partial y} = -3x + 3y^2

Step 2. Set both to zero:

From the first equation: y=x2y = x^2.

Substitute into the second: 3x+3(x2)2=0-3x + 3(x^2)^2 = 0, so 3x+3x4=0-3x + 3x^4 = 0.

3x(x31)=03x(x^3 - 1) = 0

This gives x=0x = 0 or x=1x = 1.

  • x=0x = 0: y=02=0y = 0^2 = 0, so critical point at (0,0)(0, 0).
  • x=1x = 1: y=12=1y = 1^2 = 1, so critical point at (1,1)(1, 1).

Step 3. Evaluate ff:

f(0,0)=0f(0, 0) = 0

f(1,1)=13+1=1f(1, 1) = 1 - 3 + 1 = -1

Step 4. Check if these are minima. Test nearby points around (0,0)(0, 0):

f(0.1,0)=0.001>0f(0.1, 0) = 0.001 > 0

f(0.1,0)=0.001<0f(-0.1, 0) = -0.001 < 0

The origin is not a minimum (the function takes both positive and negative values nearby). It is a saddle point.

Test nearby points around (1,1)(1, 1):

f(1.1,1.1)=1.3313.63+1.331=0.968f(1.1, 1.1) = 1.331 - 3.63 + 1.331 = -0.968

f(0.9,0.9)=0.7292.43+0.729=0.972f(0.9, 0.9) = 0.729 - 2.43 + 0.729 = -0.972

f(1,1)=1f(1, 1) = -1

Since f(1,1)=1f(1,1) = -1 is less than both neighboring values, this looks like a local minimum. But to confirm, we would need the second-order test with the Hessian.

import numpy as np

f = lambda x, y: x**3 - 3*x*y + y**3
grad = lambda x, y: np.array([3*x**2 - 3*y, -3*x + 3*y**2])

for point in [(0, 0), (1, 1)]:
    x, y = point
    print(f"Point {point}: f={f(x,y):.3f}, grad={grad(x,y)}")
# Point (0, 0): f=0.000, grad=[0 0]
# Point (1, 1): f=-1.000, grad=[0 0]

First-order conditions for constrained problems

When you have constraints, the story changes. At a constrained minimum, the gradient of ff does not have to be zero. It just has to be zero in directions that keep you inside the feasible set.

For equality constraints h(x)=0h(\mathbf{x}) = 0, the first-order condition becomes:

f(x)=λh(x)\nabla f(\mathbf{x}^*) = \lambda \nabla h(\mathbf{x}^*)

This says the gradient of ff must be parallel to the gradient of the constraint at the optimal point. The scalar λ\lambda is the Lagrange multiplier, and this condition comes from the Lagrangian framework.

For inequality constraints, the conditions become the KKT conditions, which we cover later in this series.

Practical implications for ML

Gradient descent uses first-order conditions

Gradient descent works by repeatedly moving in the direction of f-\nabla f. It is trying to reach a point where f=0\nabla f = \mathbf{0}. The algorithm stops (or should stop) when the gradient is close enough to zero.

Vanishing gradients in deep learning

In deep networks, the gradient can become very small (close to zero) even when you are not near a minimum. This “vanishing gradient” problem is not a first-order optimality issue; it is a numerical issue where the gradient signal gets lost through many layers of the chain rule. Techniques like residual connections and careful initialization help address this.

Checking convergence

In practice, you monitor f(w)\|\nabla f(\mathbf{w})\| during training. When it drops below some threshold, you are close to a critical point. For convex problems, this means you are close to the global minimum. For non-convex problems like neural networks, you might be at a local minimum, a saddle point, or just in a flat region.

Summary

ConceptKey idea
Critical pointf=0\nabla f = \mathbf{0}
Directional derivativefTd\nabla f^T \mathbf{d}, rate of change in direction d\mathbf{d}
Necessary conditionLocal min implies f=0\nabla f = \mathbf{0}
Sufficient (convex only)Convex ff and f=0\nabla f = \mathbf{0} implies global min
Saddle point trapf=0\nabla f = \mathbf{0} but not a min or max

First-order conditions give you candidates. They narrow down the search from all possible points to just the critical points. But among those candidates, you still need to figure out which are minima, which are maxima, and which are saddle points.

What comes next

To classify critical points, you need the second-order optimality conditions. The Hessian matrix tells you whether a critical point is a minimum, maximum, or saddle point.

Start typing to search across all content
navigate Enter open Esc close