Optimality conditions: first order
In this series (18 parts)
- What is optimization and why ML needs it
- Convex sets and convex functions
- Optimality conditions: first order
- Optimality conditions: second order
- Line search methods
- Least squares: the closed-form solution
- Steepest descent (gradient descent)
- Newton's method for optimization
- Quasi-Newton methods: BFGS and L-BFGS
- Conjugate gradient methods
- Constrained optimization and Lagrangian duality
- KKT conditions
- Penalty and barrier methods
- Interior point methods
- The simplex method
- Frank-Wolfe method
- Optimization in dynamic programming and optimal control
- Stochastic gradient descent and variants
At a minimum of a smooth function, the gradient is zero. That is the first-order necessary condition. It gives you a way to find candidate solutions: set and solve. But gradient equals zero does not guarantee a minimum. You could be at a maximum or a saddle point. Understanding exactly what first-order conditions do and do not tell you is essential for optimization.
The hilltop intuition
Stand at the top of a hill. The ground is flat in every direction. No matter which way you step, you neither go up nor down, at least for that first small step. That flatness is the gradient being zero.
Now imagine standing at the bottom of a valley. Same thing: the ground is flat in every direction. You cannot go lower without climbing first. The gradient is zero here too.
And at a mountain pass, the saddle between two peaks? Flat. The gradient is zero, yet you are neither at a top nor a bottom.
The gradient tells you the slope. When the slope is zero everywhere around a point, you have found a critical point. But which kind?
Three functions and their critical points
| Function | Gradient | Where gradient = 0 | Type of point |
|---|---|---|---|
| Minimum (valley bottom) | |||
| Maximum (hilltop) | |||
| Saddle point (mountain pass) |
All three have gradient zero at the critical point. First-order conditions alone cannot distinguish between them.
The gradient as a slope indicator
graph LR A["Far from critical point:<br/>gradient is large,<br/>steep slope"] --> B["Approaching critical point:<br/>gradient shrinks"] B --> C["At critical point:<br/>gradient = 0,<br/>flat in every direction"] style A fill:#ffcccc style B fill:#fff3cd style C fill:#ccffcc
The gradient arrow shrinks as you approach the critical point. At the critical point itself, the arrow vanishes. Finding where it vanishes is the whole game.
The function f(x) = x^3 - 3x has a local maximum at x = -1 and a local minimum at x = 1, both where the gradient is zero
Prerequisites
You should be comfortable with:
The gradient and critical points
For a function , the gradient at a point is the vector of partial derivatives:
The gradient points in the direction of steepest ascent. Its negative, , points in the direction of steepest descent.
A point is a critical point (or stationary point) if .
At a critical point, there is no direction that gives a first-order increase or decrease. The function is “flat” to first order.
Directional derivatives
The directional derivative of at in direction (where ) is:
This is the dot product of the gradient with the direction vector. It tells you the rate of change of as you move from in direction .
Key facts:
- The directional derivative is maximized when points in the same direction as (steepest ascent).
- The directional derivative is minimized when points opposite to (steepest descent).
- When , the directional derivative is zero in every direction.
This last point is why critical points matter: at a critical point, no direction gives a first-order decrease, so you cannot improve the function value by taking an infinitesimally small step.
Necessary vs sufficient conditions
First-order necessary condition for a local minimum. If is a local minimum of a differentiable function , then .
This is necessary but not sufficient. Gradient equals zero is required for a minimum, but having gradient equal zero does not mean you have a minimum. There are three possibilities at a critical point:
- Local minimum: increases in all directions from .
- Local maximum: decreases in all directions from .
- Saddle point: increases in some directions and decreases in others.
You need second-order information (the Hessian) to distinguish these cases, which we cover in the next article.
How to find critical points
flowchart TD
A["Start with f of x"] --> B["Compute gradient nabla f"]
B --> C["Set nabla f = 0"]
C --> D["Solve the system of equations"]
D --> E["Critical points found"]
E --> F{"Classify each point"}
F -->|"Need Hessian"| G["Minimum, maximum, or saddle?"]
Necessary vs sufficient: what first-order conditions guarantee
graph TD A["nabla f = 0 at x*"] -->|"Necessary"| B["Required for any local min"] A -->|"Not sufficient in general"| C["Could be max or saddle"] D["f is convex AND nabla f = 0"] -->|"Sufficient"| E["Guaranteed global minimum"] style B fill:#fff3cd style C fill:#ffcccc style E fill:#ccffcc
Exception: convex functions. For convex functions, the first-order condition is both necessary and sufficient. If is convex and , then is a global minimum. This is one of the most powerful consequences of convexity.
Example 1: A bowl-shaped function
Consider:
Step 1. Compute the gradient:
Step 2. Set the gradient to zero and solve:
The only critical point is .
Step 3. Evaluate the function:
Step 4. Verify this is a minimum by rewriting:
This is a sum of squares plus a constant. The minimum value is , achieved at . Since is a sum of squared terms (convex), this critical point is the global minimum. ✓
Directional derivative check. At , , so the directional derivative is zero in every direction:
No matter which direction you choose, the function is flat to first order.
import numpy as np
grad_f = lambda x, y: np.array([2*x - 2, 2*y - 4])
f = lambda x, y: x**2 + y**2 - 2*x - 4*y + 8
# Critical point
print(f"Gradient at (1,2): {grad_f(1, 2)}") # [0 0]
print(f"f(1,2) = {f(1, 2)}") # 3
# Nearby points are all larger
for dx, dy in [(0.1, 0), (0, 0.1), (-0.1, 0), (0, -0.1)]:
print(f"f({1+dx:.1f}, {2+dy:.1f}) = {f(1+dx, 2+dy):.2f}")
# All values > 3
Example 2: A function with a saddle point
Consider:
Step 1. Compute the gradient:
Step 2. Set to zero:
The only critical point is with .
Step 3. Check nearby points to see if it is a minimum:
The origin is not a minimum because you can decrease by moving along the -axis. It is also not a maximum because you can increase by moving along the -axis. It is a saddle point.
Step 4. Check the directional derivatives at the origin. Since the gradient is zero, all directional derivatives are zero. First-order information alone cannot distinguish this saddle point from a minimum. This is precisely why we need second-order conditions.
Directional derivative near the origin. At , the gradient is . The directional derivative in the -direction is (ascending), confirming the function is increasing. In the -direction it is (the function curves downward but the first derivative is zero there).
import numpy as np
f = lambda x, y: x**2 - y**2
grad_f = lambda x, y: np.array([2*x, -2*y])
print(f"Gradient at origin: {grad_f(0, 0)}") # [0 0]
print(f"f(0.1, 0) = {f(0.1, 0)}") # 0.01
print(f"f(0, 0.1) = {f(0, 0.1)}") # -0.01
# Saddle point: increases in x, decreases in y
Example 3: Multiple critical points of a cubic surface
Consider:
Step 1. Compute partial derivatives:
Step 2. Set both to zero:
From the first equation: .
Substitute into the second: , so .
This gives or .
- : , so critical point at .
- : , so critical point at .
Step 3. Evaluate :
Step 4. Check if these are minima. Test nearby points around :
The origin is not a minimum (the function takes both positive and negative values nearby). It is a saddle point.
Test nearby points around :
Since is less than both neighboring values, this looks like a local minimum. But to confirm, we would need the second-order test with the Hessian.
import numpy as np
f = lambda x, y: x**3 - 3*x*y + y**3
grad = lambda x, y: np.array([3*x**2 - 3*y, -3*x + 3*y**2])
for point in [(0, 0), (1, 1)]:
x, y = point
print(f"Point {point}: f={f(x,y):.3f}, grad={grad(x,y)}")
# Point (0, 0): f=0.000, grad=[0 0]
# Point (1, 1): f=-1.000, grad=[0 0]
First-order conditions for constrained problems
When you have constraints, the story changes. At a constrained minimum, the gradient of does not have to be zero. It just has to be zero in directions that keep you inside the feasible set.
For equality constraints , the first-order condition becomes:
This says the gradient of must be parallel to the gradient of the constraint at the optimal point. The scalar is the Lagrange multiplier, and this condition comes from the Lagrangian framework.
For inequality constraints, the conditions become the KKT conditions, which we cover later in this series.
Practical implications for ML
Gradient descent uses first-order conditions
Gradient descent works by repeatedly moving in the direction of . It is trying to reach a point where . The algorithm stops (or should stop) when the gradient is close enough to zero.
Vanishing gradients in deep learning
In deep networks, the gradient can become very small (close to zero) even when you are not near a minimum. This “vanishing gradient” problem is not a first-order optimality issue; it is a numerical issue where the gradient signal gets lost through many layers of the chain rule. Techniques like residual connections and careful initialization help address this.
Checking convergence
In practice, you monitor during training. When it drops below some threshold, you are close to a critical point. For convex problems, this means you are close to the global minimum. For non-convex problems like neural networks, you might be at a local minimum, a saddle point, or just in a flat region.
Summary
| Concept | Key idea |
|---|---|
| Critical point | |
| Directional derivative | , rate of change in direction |
| Necessary condition | Local min implies |
| Sufficient (convex only) | Convex and implies global min |
| Saddle point trap | but not a min or max |
First-order conditions give you candidates. They narrow down the search from all possible points to just the critical points. But among those candidates, you still need to figure out which are minima, which are maxima, and which are saddle points.
What comes next
To classify critical points, you need the second-order optimality conditions. The Hessian matrix tells you whether a critical point is a minimum, maximum, or saddle point.