Steepest descent (gradient descent)
In this series (18 parts)
- What is optimization and why ML needs it
- Convex sets and convex functions
- Optimality conditions: first order
- Optimality conditions: second order
- Line search methods
- Least squares: the closed-form solution
- Steepest descent (gradient descent)
- Newton's method for optimization
- Quasi-Newton methods: BFGS and L-BFGS
- Conjugate gradient methods
- Constrained optimization and Lagrangian duality
- KKT conditions
- Penalty and barrier methods
- Interior point methods
- The simplex method
- Frank-Wolfe method
- Optimization in dynamic programming and optimal control
- Stochastic gradient descent and variants
Prerequisites: This article assumes you are comfortable with first-order optimality conditions and partial derivatives and gradients.
The ball rolling downhill
Picture a ball on a hilly landscape. Release it, and it rolls downhill. It does not plan a route. It does not look ahead. At every instant, it simply moves in the steepest downward direction. That is gradient descent.
The ball moves fast on steep slopes and slow on gentle ones. If it lands in a narrow valley, it bounces from wall to wall, losing energy with each bounce. If the valley is wide and round, it rolls smoothly to the bottom.
This analogy captures the essential behavior. Gradient descent follows the steepest local slope, one step at a time, hoping to reach the bottom.
Three iterations of gradient descent on , starting at with step size
| Iteration | Position | Gradient | Step size | New position |
|---|---|---|---|---|
| 0 | ||||
| 1 | ||||
| 2 |
Each step moves opposite to the gradient. The -component shrinks fast (steep curvature), while shrinks slowly (gentle curvature). The path dives toward the valley floor, then creeps along it.
Descent trajectory on contour lines
graph TD S["Start: 4, 2 <br/> f = 32"] --> A["Step 1: 3.2, 0.4 <br/> f = 10.9"] A --> B["Step 2: 2.56, 0.08 <br/> f = 6.6"] B --> C["Step 3: 2.05, 0.02 <br/> f = 4.2"] C --> D["... more steps ..."] D --> MIN["Minimum: 0, 0 <br/> f = 0"] style S fill:#ffcccc style MIN fill:#ccffcc
The path zigzags: large jumps in early on, then slow progress in . This is the signature of gradient descent on elongated contours.
Gradient descent on f(x,y) = x^2 + 10y^2 starting from (3,3). The zigzag pattern is caused by the elongated contours (condition number = 10).
Now let’s formalize this intuition.
The update rule
You have a function and you want to find its minimum. Gradient descent does this by starting at some point and repeatedly moving in the direction that decreases the fastest.
Why is that direction the negative gradient? Consider a small step from your current point in direction with unit length. The first-order Taylor expansion gives:
To decrease as much as possible, you want to minimize subject to . By the Cauchy-Schwarz inequality:
Equality holds when . So the direction of steepest decrease is exactly the negative gradient direction.
This gives us the gradient descent update rule:
where is the step size (also called the learning rate). Start at some initial guess and repeat until convergence.
flowchart TD
A["Pick starting point x₀ and step size α"] --> B["Compute gradient ∇f at xₖ"]
B --> C{"Is the gradient small enough?"}
C -- Yes --> D["Stop: xₖ is approximately optimal"]
C -- No --> E["Update: xₖ₊₁ = xₖ − α ∇f"]
E --> B The algorithm is simple. The step size makes or breaks it.
A test function
We will use throughout this article. It is a quadratic with elliptical contours, simple enough to compute by hand but rich enough to show all the important behaviors.
The gradient is:
The minimum is at with . The Hessian is:
The eigenvalues are and . Their ratio is the condition number. It measures how elongated the contours are. Higher condition number means more elongated contours and harder optimization. Keep this number in mind.
Example 1: fixed step size
Start at with step size . At each iteration we compute the gradient, take a step, and evaluate the function.
Iteration 0 (starting point):
Iteration 1:
Iteration 2:
Iteration 3:
Iteration 4:
Iteration 5:
After 5 iterations, dropped from 32 to 1.72. The -component shrinks fast (multiplied by 0.2 each step) while the -component shrinks slowly (multiplied by 0.8 each step). The path moves quickly toward the -axis, then creeps along it toward the origin.
Why the difference? The -direction has the larger eigenvalue (8), so the gradient component in is steeper. The step size nearly eliminates in a few steps but is too conservative for .
Step size: too small, too large, just right
The step size controls how far you move each iteration. Three things can happen:
Too small (): Every step is safe but tiny. You converge, eventually. In Example 1, is on the conservative side. It works, but convergence is slow in the -direction.
Too large (): The iterates overshoot the minimum and oscillate with growing amplitude. The function value increases instead of decreasing. The method diverges.
Just right (): The step is large enough to make good progress but small enough to stay stable. For our function, (the largest eigenvalue of the Hessian), so the ideal constant step size is around .
Learning rate effects
graph TD A["Step size too small"] -->|"alpha much less than 1/L"| B["Converges, but very slowly"] C["Step size just right"] -->|"alpha near 1/L"| D["Fast, stable convergence"] E["Step size too large"] -->|"alpha greater than 2/L"| F["Oscillates and diverges"] style B fill:#fff3cd style D fill:#ccffcc style F fill:#ffcccc
The threshold for stability is . For our function, that means . Anything above 0.25 will diverge.
Example 2: step size (too large)
Same function, same starting point, but now . This exceeds the stability threshold of 0.25.
At each step the -component is multiplied by , which shrinks it. But the -component is multiplied by . The negative sign means it flips direction, and means it grows every step.
Iteration 1:
The function went up from 32 to 33.92.
Iteration 2:
Iteration 3:
Iteration 4:
Iteration 5:
The -coordinate bounces wildly: . Each time it crosses zero and grows. Meanwhile converges fine (shrinking by factor 0.4 each step), but the -direction blows up because the step size exceeds the stability limit for that direction.
This is the core problem: the step size that works for one direction can be disastrous for another. The bottleneck is always the direction with the largest curvature (largest eigenvalue of the Hessian).
Convergence for convex functions
How fast does gradient descent converge when the step size is chosen properly?
General convex functions
A key quantity is the Lipschitz constant of the gradient, . The gradient is -Lipschitz continuous when:
This bounds how fast the gradient can change. For quadratics, equals the largest eigenvalue of the Hessian. For our test function, .
With step size , gradient descent on a convex function satisfies:
This is an rate. To halve the error, you need twice as many iterations. Slow, but guaranteed.
Strongly convex functions
If is also strongly convex with parameter (the smallest eigenvalue of the Hessian, for quadratics), convergence is much faster. With step size :
This is linear convergence: the error shrinks by a constant factor each iteration. The rate depends on the condition number . When is close to 1, convergence is fast. When is large, each step barely dents the error.
For our function: , , . The per-step factor is , so the error drops by about 25% each iteration.
The optimal constant step size is , which gives a per-step factor of . For our function: , a 40% reduction per step.
Zig-zag behavior
Gradient descent has a well-known weakness on elongated contours: it zig-zags.
The gradient always points perpendicular to the contour lines. When contours are circular (condition number 1), the gradient points straight at the minimum. When contours are elongated ellipses (large condition number), the gradient mostly points across the narrow direction, not toward the minimum.
For , the contours are ellipses that are twice as wide in as in . The gradient at most points has a large -component relative to , so each step makes big progress in but then overshoots, requiring a correction back.
This creates the zig-zag pattern: the iterates bounce back and forth across the narrow valley while making slow progress along it. The worse the condition number, the worse the zig-zagging.
The eigenvalues of the Hessian tell you exactly how bad it will be. If and are far apart, the contours are highly elongated and gradient descent wastes most of its effort bouncing between walls of the valley.
Even the exact line search (choosing the best at each step) does not fix zig-zagging. It still bounces; it just picks the optimal step within each bounce. Fixing zig-zag requires second-order information, which is what Newton’s method provides.
Convergence: well-conditioned vs ill-conditioned problems
graph TD A["Well-conditioned<br/>kappa near 1"] -->|"Circular contours"| B["Gradient points at minimum<br/>Fast, smooth convergence"] C["Ill-conditioned<br/>kappa much greater than 1"] -->|"Elongated ellipse contours"| D["Gradient points across valley<br/>Slow, zig-zag convergence"] style B fill:#ccffcc style D fill:#ffcccc
The condition number determines everything. When is close to 1, the contours are nearly circular and gradient descent converges quickly. When is large, the contours are stretched and each step wastes effort bouncing sideways.
Example 3: exact line search
Exact line search. At each step, choose the that minimizes along the search direction. This gives the best possible step at each iteration. For quadratics, the formula is known in closed form.
Backtracking line search. Start with a large and shrink it until a descent condition (such as the Armijo condition) is met. Cheaper than exact search, and usually good enough.
Exact vs backtracking line search
graph LR A["Exact line search"] -->|"Minimize f along direction"| B["Optimal step each time"] C["Backtracking line search"] -->|"Shrink alpha until sufficient decrease"| D["Good enough step, cheaper"] B --> E["Expensive per step, fewer steps"] D --> F["Cheap per step, more steps"] style B fill:#ccffcc style D fill:#fff3cd
Instead of a fixed step size, pick the that minimizes along the search direction at each step. For a quadratic with gradient , the optimal step size is:
Derivation. Along the line , the function value is:
Set :
For our function with :
Starting from :
Iteration 1:
Iteration 2:
Iteration 3:
Iteration 4:
Iteration 5:
The step size alternates between roughly 0.15 and 0.31. The -coordinate flips sign every iteration: . This is the zig-zag pattern in action, even with the optimal step size at each step.
Comparison
Here are the function values across all three approaches:
| Iteration | Exact line search | ||
|---|---|---|---|
| 0 | 32.00 | 32.00 | 32.00 |
| 1 | 10.88 | 33.92 | 8.47 |
| 2 | 6.58 | 61.88 | 2.24 |
| 3 | 4.20 | 120.54 | 0.59 |
| 4 | 2.68 | 236.18 | 0.16 |
| 5 | 1.72 | 462.85 | 0.04 |
- : Steady convergence, but slow. Reduced by about in 5 steps.
- : Divergence. The function value grows every iteration because the step size exceeds .
- Exact line search: Fast convergence. Reduced by about in 5 steps. Still zig-zags, but each bounce is optimal.
Python implementation
Here is a compact implementation that runs all three strategies:
import numpy as np
def f(x):
return x[0]**2 + 4 * x[1]**2
def grad_f(x):
return np.array([2 * x[0], 8 * x[1]])
def gd_fixed(x0, alpha, n_iters):
x = x0.copy()
path = [(x.copy(), f(x))]
for _ in range(n_iters):
x = x - alpha * grad_f(x)
path.append((x.copy(), f(x)))
return path
def gd_exact(x0, n_iters):
A = np.diag([2.0, 8.0])
x = x0.copy()
path = [(x.copy(), f(x))]
for _ in range(n_iters):
g = grad_f(x)
alpha = (g @ g) / (g @ A @ g)
x = x - alpha * g
path.append((x.copy(), f(x)))
return path
x0 = np.array([4.0, 2.0])
for label, path in [
("alpha=0.1", gd_fixed(x0, 0.1, 5)),
("alpha=0.3", gd_fixed(x0, 0.3, 5)),
("exact", gd_exact(x0, 5)),
]:
print(f"\n{label}:")
for i, (xi, fi) in enumerate(path):
print(f" k={i}: x=({xi[0]:8.4f}, {xi[1]:8.4f}) f={fi:.4f}")
To visualize the trajectories on a contour plot:
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(8, 5))
xg = np.linspace(-1, 5, 300)
yg = np.linspace(-3, 3, 300)
X, Y = np.meshgrid(xg, yg)
Z = X**2 + 4 * Y**2
ax.contour(X, Y, Z, levels=20, cmap="coolwarm", alpha=0.6)
for label, path, color in [
("Fixed 0.1", gd_fixed(x0, 0.1, 20), "blue"),
("Exact", gd_exact(x0, 20), "green"),
]:
pts = np.array([p[0] for p in path])
ax.plot(pts[:, 0], pts[:, 1], "o-", color=color, label=label, ms=4)
ax.plot(0, 0, "r*", ms=12, label="Minimum")
ax.legend()
ax.set_xlabel("x")
ax.set_ylabel("y")
ax.set_title("Gradient descent on f(x,y) = x² + 4y²")
plt.tight_layout()
plt.show()
The blue path (fixed ) dives toward the -axis and then crawls along it. The green path (exact line search) zig-zags visibly but reaches the minimum much faster.
Convergence curve: f(x,y) value at each iteration of gradient descent
What comes next
Gradient descent is the foundation, but it has clear limitations. Zig-zag behavior on poorly conditioned problems slows convergence, the step size requires tuning, and the convergence rate depends on the condition number .
Newton’s method fixes the zig-zag by using second-order information (the Hessian) to reshape the step direction. Instead of following the steepest slope, Newton’s method follows the curvature of the function directly toward the minimum. It converges much faster on smooth problems, but each step requires solving a linear system involving the Hessian.
For large-scale machine learning, computing the full gradient over the entire dataset is too expensive. Stochastic gradient descent and its variants like Adam and RMSProp approximate the gradient using small random batches of data, trading per-step accuracy for cheaper, more frequent updates.