Search…

Steepest descent (gradient descent)

In this series (18 parts)
  1. What is optimization and why ML needs it
  2. Convex sets and convex functions
  3. Optimality conditions: first order
  4. Optimality conditions: second order
  5. Line search methods
  6. Least squares: the closed-form solution
  7. Steepest descent (gradient descent)
  8. Newton's method for optimization
  9. Quasi-Newton methods: BFGS and L-BFGS
  10. Conjugate gradient methods
  11. Constrained optimization and Lagrangian duality
  12. KKT conditions
  13. Penalty and barrier methods
  14. Interior point methods
  15. The simplex method
  16. Frank-Wolfe method
  17. Optimization in dynamic programming and optimal control
  18. Stochastic gradient descent and variants

Prerequisites: This article assumes you are comfortable with first-order optimality conditions and partial derivatives and gradients.

The ball rolling downhill

Picture a ball on a hilly landscape. Release it, and it rolls downhill. It does not plan a route. It does not look ahead. At every instant, it simply moves in the steepest downward direction. That is gradient descent.

The ball moves fast on steep slopes and slow on gentle ones. If it lands in a narrow valley, it bounces from wall to wall, losing energy with each bounce. If the valley is wide and round, it rolls smoothly to the bottom.

This analogy captures the essential behavior. Gradient descent follows the steepest local slope, one step at a time, hoping to reach the bottom.

Three iterations of gradient descent on f(x,y)=x2+4y2f(x, y) = x^2 + 4y^2, starting at (4,2)(4, 2) with step size 0.10.1

IterationPosition (x,y)(x, y)Gradient (gx,gy)(g_x, g_y)Step sizeNew position
0(4.00,2.00)(4.00, 2.00)(8.0,16.0)(8.0, 16.0)0.10.1(3.20,0.40)(3.20, 0.40)
1(3.20,0.40)(3.20, 0.40)(6.4,3.2)(6.4, 3.2)0.10.1(2.56,0.08)(2.56, 0.08)
2(2.56,0.08)(2.56, 0.08)(5.12,0.64)(5.12, 0.64)0.10.1(2.05,0.02)(2.05, 0.02)

Each step moves opposite to the gradient. The yy-component shrinks fast (steep curvature), while xx shrinks slowly (gentle curvature). The path dives toward the valley floor, then creeps along it.

Descent trajectory on contour lines

graph TD
  S["Start: 4, 2 <br/> f = 32"] --> A["Step 1: 3.2, 0.4 <br/> f = 10.9"]
  A --> B["Step 2: 2.56, 0.08 <br/> f = 6.6"]
  B --> C["Step 3: 2.05, 0.02 <br/> f = 4.2"]
  C --> D["... more steps ..."]
  D --> MIN["Minimum: 0, 0 <br/> f = 0"]
  style S fill:#ffcccc
  style MIN fill:#ccffcc

The path zigzags: large jumps in yy early on, then slow progress in xx. This is the signature of gradient descent on elongated contours.

Gradient descent on f(x,y) = x^2 + 10y^2 starting from (3,3). The zigzag pattern is caused by the elongated contours (condition number = 10).

Now let’s formalize this intuition.

The update rule

You have a function f:RnRf: \mathbb{R}^n \to \mathbb{R} and you want to find its minimum. Gradient descent does this by starting at some point and repeatedly moving in the direction that decreases ff the fastest.

Why is that direction the negative gradient? Consider a small step from your current point x\mathbf{x} in direction d\mathbf{d} with unit length. The first-order Taylor expansion gives:

f(x+αd)f(x)+αf(x)Tdf(\mathbf{x} + \alpha \mathbf{d}) \approx f(\mathbf{x}) + \alpha \nabla f(\mathbf{x})^T \mathbf{d}

To decrease ff as much as possible, you want to minimize f(x)Td\nabla f(\mathbf{x})^T \mathbf{d} subject to d=1\|\mathbf{d}\| = 1. By the Cauchy-Schwarz inequality:

f(x)Tdf(x)\nabla f(\mathbf{x})^T \mathbf{d} \geq -\|\nabla f(\mathbf{x})\|

Equality holds when d=f(x)/f(x)\mathbf{d} = -\nabla f(\mathbf{x}) / \|\nabla f(\mathbf{x})\|. So the direction of steepest decrease is exactly the negative gradient direction.

This gives us the gradient descent update rule:

xk+1=xkαkf(xk)\mathbf{x}_{k+1} = \mathbf{x}_k - \alpha_k \nabla f(\mathbf{x}_k)

where αk>0\alpha_k > 0 is the step size (also called the learning rate). Start at some initial guess x0\mathbf{x}_0 and repeat until convergence.

flowchart TD
  A["Pick starting point x₀ and step size α"] --> B["Compute gradient ∇f at xₖ"]
  B --> C{"Is the gradient small enough?"}
  C -- Yes --> D["Stop: xₖ is approximately optimal"]
  C -- No --> E["Update: xₖ₊₁ = xₖ − α ∇f"]
  E --> B

The algorithm is simple. The step size α\alpha makes or breaks it.

A test function

We will use f(x,y)=x2+4y2f(x, y) = x^2 + 4y^2 throughout this article. It is a quadratic with elliptical contours, simple enough to compute by hand but rich enough to show all the important behaviors.

The gradient is:

f(x,y)=(2x8y)\nabla f(x, y) = \begin{pmatrix} 2x \\ 8y \end{pmatrix}

The minimum is at (0,0)(0, 0) with f(0,0)=0f(0, 0) = 0. The Hessian is:

H=(2008)H = \begin{pmatrix} 2 & 0 \\ 0 & 8 \end{pmatrix}

The eigenvalues are λ1=2\lambda_1 = 2 and λ2=8\lambda_2 = 8. Their ratio κ=λmax/λmin=4\kappa = \lambda_{\max} / \lambda_{\min} = 4 is the condition number. It measures how elongated the contours are. Higher condition number means more elongated contours and harder optimization. Keep this number in mind.

Example 1: fixed step size α=0.1\alpha = 0.1

Start at (4,2)(4, 2) with step size α=0.1\alpha = 0.1. At each iteration we compute the gradient, take a step, and evaluate the function.

Iteration 0 (starting point):

x0=(4, 2),f(x0)=16+16=32\mathbf{x}_0 = (4,\ 2), \quad f(\mathbf{x}_0) = 16 + 16 = 32

Iteration 1:

f=(24, 82)=(8, 16)\nabla f = (2 \cdot 4,\ 8 \cdot 2) = (8,\ 16)

x1=(4, 2)0.1(8, 16)=(3.2, 0.4)\mathbf{x}_1 = (4,\ 2) - 0.1 \cdot (8,\ 16) = (3.2,\ 0.4)

f(x1)=10.24+0.64=10.88f(\mathbf{x}_1) = 10.24 + 0.64 = 10.88

Iteration 2:

f=(6.4, 3.2)\nabla f = (6.4,\ 3.2)

x2=(3.2, 0.4)0.1(6.4, 3.2)=(2.56, 0.08)\mathbf{x}_2 = (3.2,\ 0.4) - 0.1 \cdot (6.4,\ 3.2) = (2.56,\ 0.08)

f(x2)=6.5536+0.0256=6.5792f(\mathbf{x}_2) = 6.5536 + 0.0256 = 6.5792

Iteration 3:

f=(5.12, 0.64)\nabla f = (5.12,\ 0.64)

x3=(2.048, 0.016),f(x3)=4.1953\mathbf{x}_3 = (2.048,\ 0.016), \quad f(\mathbf{x}_3) = 4.1953

Iteration 4:

f=(4.096, 0.128)\nabla f = (4.096,\ 0.128)

x4=(1.6384, 0.0032),f(x4)=2.6844\mathbf{x}_4 = (1.6384,\ 0.0032), \quad f(\mathbf{x}_4) = 2.6844

Iteration 5:

f=(3.2768, 0.0256)\nabla f = (3.2768,\ 0.0256)

x5=(1.3107, 0.00064),f(x5)=1.7180\mathbf{x}_5 = (1.3107,\ 0.00064), \quad f(\mathbf{x}_5) = 1.7180

After 5 iterations, ff dropped from 32 to 1.72. The yy-component shrinks fast (multiplied by 0.2 each step) while the xx-component shrinks slowly (multiplied by 0.8 each step). The path moves quickly toward the xx-axis, then creeps along it toward the origin.

Why the difference? The yy-direction has the larger eigenvalue (8), so the gradient component in yy is steeper. The step size α=0.1\alpha = 0.1 nearly eliminates yy in a few steps but is too conservative for xx.

Step size: too small, too large, just right

The step size controls how far you move each iteration. Three things can happen:

Too small (α1/L\alpha \ll 1/L): Every step is safe but tiny. You converge, eventually. In Example 1, α=0.1\alpha = 0.1 is on the conservative side. It works, but convergence is slow in the xx-direction.

Too large (α>2/L\alpha > 2/L): The iterates overshoot the minimum and oscillate with growing amplitude. The function value increases instead of decreasing. The method diverges.

Just right (α1/L\alpha \approx 1/L): The step is large enough to make good progress but small enough to stay stable. For our function, L=8L = 8 (the largest eigenvalue of the Hessian), so the ideal constant step size is around α=1/8=0.125\alpha = 1/8 = 0.125.

Learning rate effects

graph TD
  A["Step size too small"] -->|"alpha much less than 1/L"| B["Converges, but very slowly"]
  C["Step size just right"] -->|"alpha near 1/L"| D["Fast, stable convergence"]
  E["Step size too large"] -->|"alpha greater than 2/L"| F["Oscillates and diverges"]
  style B fill:#fff3cd
  style D fill:#ccffcc
  style F fill:#ffcccc

The threshold for stability is α<2/L\alpha < 2/L. For our function, that means α<0.25\alpha < 0.25. Anything above 0.25 will diverge.

Example 2: step size α=0.3\alpha = 0.3 (too large)

Same function, same starting point, but now α=0.3\alpha = 0.3. This exceeds the stability threshold of 0.25.

At each step the xx-component is multiplied by (10.32)=0.4(1 - 0.3 \cdot 2) = 0.4, which shrinks it. But the yy-component is multiplied by (10.38)=1.4(1 - 0.3 \cdot 8) = -1.4. The negative sign means it flips direction, and 1.4>1|-1.4| > 1 means it grows every step.

Iteration 1:

f=(8, 16)\nabla f = (8,\ 16)

x1=(4, 2)0.3(8, 16)=(1.6, 2.8)\mathbf{x}_1 = (4,\ 2) - 0.3 \cdot (8,\ 16) = (1.6,\ -2.8)

f(x1)=2.56+31.36=33.92f(\mathbf{x}_1) = 2.56 + 31.36 = 33.92

The function went up from 32 to 33.92.

Iteration 2:

f=(3.2, 22.4)\nabla f = (3.2,\ -22.4)

x2=(0.64, 3.92),f(x2)=61.88\mathbf{x}_2 = (0.64,\ 3.92), \quad f(\mathbf{x}_2) = 61.88

Iteration 3:

f=(1.28, 31.36)\nabla f = (1.28,\ 31.36)

x3=(0.256, 5.488),f(x3)=120.54\mathbf{x}_3 = (0.256,\ -5.488), \quad f(\mathbf{x}_3) = 120.54

Iteration 4:

f=(0.512, 43.904)\nabla f = (0.512,\ -43.904)

x4=(0.1024, 7.6832),f(x4)=236.18\mathbf{x}_4 = (0.1024,\ 7.6832), \quad f(\mathbf{x}_4) = 236.18

Iteration 5:

f=(0.2048, 61.466)\nabla f = (0.2048,\ 61.466)

x5=(0.0410, 10.757),f(x5)=462.85\mathbf{x}_5 = (0.0410,\ -10.757), \quad f(\mathbf{x}_5) = 462.85

The yy-coordinate bounces wildly: 22.83.925.497.6810.762 \to -2.8 \to 3.92 \to -5.49 \to 7.68 \to -10.76. Each time it crosses zero and grows. Meanwhile xx converges fine (shrinking by factor 0.4 each step), but the yy-direction blows up because the step size exceeds the stability limit for that direction.

This is the core problem: the step size that works for one direction can be disastrous for another. The bottleneck is always the direction with the largest curvature (largest eigenvalue of the Hessian).

Convergence for convex functions

How fast does gradient descent converge when the step size is chosen properly?

General convex functions

A key quantity is the Lipschitz constant of the gradient, LL. The gradient f\nabla f is LL-Lipschitz continuous when:

f(x)f(y)Lxyfor all x,y\|\nabla f(\mathbf{x}) - \nabla f(\mathbf{y})\| \leq L \|\mathbf{x} - \mathbf{y}\| \quad \text{for all } \mathbf{x}, \mathbf{y}

This bounds how fast the gradient can change. For quadratics, LL equals the largest eigenvalue of the Hessian. For our test function, L=8L = 8.

With step size α=1/L\alpha = 1/L, gradient descent on a convex function satisfies:

f(xk)f(x)Lx0x22kf(\mathbf{x}_k) - f(\mathbf{x}^*) \leq \frac{L \|\mathbf{x}_0 - \mathbf{x}^*\|^2}{2k}

This is an O(1/k)O(1/k) rate. To halve the error, you need twice as many iterations. Slow, but guaranteed.

Strongly convex functions

If ff is also strongly convex with parameter μ>0\mu > 0 (the smallest eigenvalue of the Hessian, for quadratics), convergence is much faster. With step size α=1/L\alpha = 1/L:

f(xk)f(x)L2(1μL)2kx0x2f(\mathbf{x}_k) - f(\mathbf{x}^*) \leq \frac{L}{2} \left(1 - \frac{\mu}{L}\right)^{2k} \|\mathbf{x}_0 - \mathbf{x}^*\|^2

This is linear convergence: the error shrinks by a constant factor each iteration. The rate depends on the condition number κ=L/μ\kappa = L / \mu. When κ\kappa is close to 1, convergence is fast. When κ\kappa is large, each step barely dents the error.

For our function: μ=2\mu = 2, L=8L = 8, κ=4\kappa = 4. The per-step factor is (11/4)=0.75(1 - 1/4) = 0.75, so the error drops by about 25% each iteration.

The optimal constant step size is α=2/(L+μ)\alpha = 2 / (L + \mu), which gives a per-step factor of (κ1)/(κ+1)(\kappa - 1) / (\kappa + 1). For our function: (41)/(4+1)=0.6(4 - 1)/(4 + 1) = 0.6, a 40% reduction per step.

Zig-zag behavior

Gradient descent has a well-known weakness on elongated contours: it zig-zags.

The gradient always points perpendicular to the contour lines. When contours are circular (condition number 1), the gradient points straight at the minimum. When contours are elongated ellipses (large condition number), the gradient mostly points across the narrow direction, not toward the minimum.

For f(x,y)=x2+4y2f(x, y) = x^2 + 4y^2, the contours are ellipses that are twice as wide in xx as in yy. The gradient at most points has a large yy-component relative to xx, so each step makes big progress in yy but then overshoots, requiring a correction back.

This creates the zig-zag pattern: the iterates bounce back and forth across the narrow valley while making slow progress along it. The worse the condition number, the worse the zig-zagging.

The eigenvalues of the Hessian tell you exactly how bad it will be. If λmax\lambda_{\max} and λmin\lambda_{\min} are far apart, the contours are highly elongated and gradient descent wastes most of its effort bouncing between walls of the valley.

Even the exact line search (choosing the best α\alpha at each step) does not fix zig-zagging. It still bounces; it just picks the optimal step within each bounce. Fixing zig-zag requires second-order information, which is what Newton’s method provides.

Convergence: well-conditioned vs ill-conditioned problems

graph TD
  A["Well-conditioned<br/>kappa near 1"] -->|"Circular contours"| B["Gradient points at minimum<br/>Fast, smooth convergence"]
  C["Ill-conditioned<br/>kappa much greater than 1"] -->|"Elongated ellipse contours"| D["Gradient points across valley<br/>Slow, zig-zag convergence"]
  style B fill:#ccffcc
  style D fill:#ffcccc

The condition number κ=L/μ\kappa = L / \mu determines everything. When κ\kappa is close to 1, the contours are nearly circular and gradient descent converges quickly. When κ\kappa is large, the contours are stretched and each step wastes effort bouncing sideways.

Exact line search. At each step, choose the α\alpha that minimizes ff along the search direction. This gives the best possible step at each iteration. For quadratics, the formula is known in closed form.

Backtracking line search. Start with a large α\alpha and shrink it until a descent condition (such as the Armijo condition) is met. Cheaper than exact search, and usually good enough.

Exact vs backtracking line search

graph LR
  A["Exact line search"] -->|"Minimize f along direction"| B["Optimal step each time"]
  C["Backtracking line search"] -->|"Shrink alpha until sufficient decrease"| D["Good enough step, cheaper"]
  B --> E["Expensive per step, fewer steps"]
  D --> F["Cheap per step, more steps"]
  style B fill:#ccffcc
  style D fill:#fff3cd

Instead of a fixed step size, pick the α\alpha that minimizes ff along the search direction at each step. For a quadratic f(x)=12xTAxf(\mathbf{x}) = \frac{1}{2} \mathbf{x}^T A \mathbf{x} with gradient gk=Axk\mathbf{g}_k = A \mathbf{x}_k, the optimal step size is:

αk=gkTgkgkTAgk\alpha_k^* = \frac{\mathbf{g}_k^T \mathbf{g}_k}{\mathbf{g}_k^T A \, \mathbf{g}_k}

Derivation. Along the line xkαgk\mathbf{x}_k - \alpha \mathbf{g}_k, the function value is:

ϕ(α)=f(xk)αgkTgk+α22gkTAgk\phi(\alpha) = f(\mathbf{x}_k) - \alpha \, \mathbf{g}_k^T \mathbf{g}_k + \frac{\alpha^2}{2} \, \mathbf{g}_k^T A \, \mathbf{g}_k

Set ϕ(α)=0\phi'(\alpha) = 0:

gkTgk+αgkTAgk=0α=gkTgkgkTAgk-\mathbf{g}_k^T \mathbf{g}_k + \alpha \, \mathbf{g}_k^T A \, \mathbf{g}_k = 0 \quad \Longrightarrow \quad \alpha^* = \frac{\mathbf{g}_k^T \mathbf{g}_k}{\mathbf{g}_k^T A \, \mathbf{g}_k}

For our function with A=diag(2,8)A = \text{diag}(2, 8):

α=4x2+64y28x2+512y2\alpha^* = \frac{4x^2 + 64y^2}{8x^2 + 512y^2}

Starting from (4,2)(4, 2):

Iteration 1:

g=(8, 16),gTg=320,gTAg=2176\mathbf{g} = (8,\ 16), \quad \mathbf{g}^T\mathbf{g} = 320, \quad \mathbf{g}^TA\mathbf{g} = 2176

α=320/21760.1471\alpha^* = 320 / 2176 \approx 0.1471

x1=(40.14718, 20.147116)=(2.8235, 0.3529)\mathbf{x}_1 = (4 - 0.1471 \cdot 8,\ 2 - 0.1471 \cdot 16) = (2.8235,\ -0.3529)

f(x1)=7.97+0.50=8.47f(\mathbf{x}_1) = 7.97 + 0.50 = 8.47

Iteration 2:

g=(5.6471, 2.8235),α0.3125\mathbf{g} = (5.6471,\ -2.8235), \quad \alpha^* \approx 0.3125

x2=(1.0588, 0.5294),f(x2)=2.24\mathbf{x}_2 = (1.0588,\ 0.5294), \quad f(\mathbf{x}_2) = 2.24

Iteration 3:

g=(2.1176, 4.2353),α0.1471\mathbf{g} = (2.1176,\ 4.2353), \quad \alpha^* \approx 0.1471

x3=(0.7474, 0.0934),f(x3)=0.59\mathbf{x}_3 = (0.7474,\ -0.0934), \quad f(\mathbf{x}_3) = 0.59

Iteration 4:

g=(1.4948, 0.7474),α0.3125\mathbf{g} = (1.4948,\ -0.7474), \quad \alpha^* \approx 0.3125

x4=(0.2803, 0.1401),f(x4)=0.16\mathbf{x}_4 = (0.2803,\ 0.1401), \quad f(\mathbf{x}_4) = 0.16

Iteration 5:

g=(0.5606, 1.1211),α0.1471\mathbf{g} = (0.5606,\ 1.1211), \quad \alpha^* \approx 0.1471

x5=(0.1979, 0.0248),f(x5)=0.04\mathbf{x}_5 = (0.1979,\ -0.0248), \quad f(\mathbf{x}_5) = 0.04

The step size alternates between roughly 0.15 and 0.31. The yy-coordinate flips sign every iteration: 20.350.530.090.140.022 \to -0.35 \to 0.53 \to -0.09 \to 0.14 \to -0.02. This is the zig-zag pattern in action, even with the optimal step size at each step.

Comparison

Here are the function values across all three approaches:

Iterationα=0.1\alpha = 0.1α=0.3\alpha = 0.3Exact line search
032.0032.0032.00
110.8833.928.47
26.5861.882.24
34.20120.540.59
42.68236.180.16
51.72462.850.04
  • α=0.1\alpha = 0.1: Steady convergence, but slow. Reduced by about 18×18\times in 5 steps.
  • α=0.3\alpha = 0.3: Divergence. The function value grows every iteration because the step size exceeds 2/L=0.252/L = 0.25.
  • Exact line search: Fast convergence. Reduced by about 800×800\times in 5 steps. Still zig-zags, but each bounce is optimal.

Python implementation

Here is a compact implementation that runs all three strategies:

import numpy as np

def f(x):
    return x[0]**2 + 4 * x[1]**2

def grad_f(x):
    return np.array([2 * x[0], 8 * x[1]])

def gd_fixed(x0, alpha, n_iters):
    x = x0.copy()
    path = [(x.copy(), f(x))]
    for _ in range(n_iters):
        x = x - alpha * grad_f(x)
        path.append((x.copy(), f(x)))
    return path

def gd_exact(x0, n_iters):
    A = np.diag([2.0, 8.0])
    x = x0.copy()
    path = [(x.copy(), f(x))]
    for _ in range(n_iters):
        g = grad_f(x)
        alpha = (g @ g) / (g @ A @ g)
        x = x - alpha * g
        path.append((x.copy(), f(x)))
    return path

x0 = np.array([4.0, 2.0])

for label, path in [
    ("alpha=0.1", gd_fixed(x0, 0.1, 5)),
    ("alpha=0.3", gd_fixed(x0, 0.3, 5)),
    ("exact",     gd_exact(x0, 5)),
]:
    print(f"\n{label}:")
    for i, (xi, fi) in enumerate(path):
        print(f"  k={i}: x=({xi[0]:8.4f}, {xi[1]:8.4f})  f={fi:.4f}")

To visualize the trajectories on a contour plot:

import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(8, 5))
xg = np.linspace(-1, 5, 300)
yg = np.linspace(-3, 3, 300)
X, Y = np.meshgrid(xg, yg)
Z = X**2 + 4 * Y**2

ax.contour(X, Y, Z, levels=20, cmap="coolwarm", alpha=0.6)

for label, path, color in [
    ("Fixed 0.1", gd_fixed(x0, 0.1, 20), "blue"),
    ("Exact", gd_exact(x0, 20), "green"),
]:
    pts = np.array([p[0] for p in path])
    ax.plot(pts[:, 0], pts[:, 1], "o-", color=color, label=label, ms=4)

ax.plot(0, 0, "r*", ms=12, label="Minimum")
ax.legend()
ax.set_xlabel("x")
ax.set_ylabel("y")
ax.set_title("Gradient descent on f(x,y) = x² + 4y²")
plt.tight_layout()
plt.show()

The blue path (fixed α=0.1\alpha = 0.1) dives toward the xx-axis and then crawls along it. The green path (exact line search) zig-zags visibly but reaches the minimum much faster.

Convergence curve: f(x,y) value at each iteration of gradient descent

What comes next

Gradient descent is the foundation, but it has clear limitations. Zig-zag behavior on poorly conditioned problems slows convergence, the step size requires tuning, and the convergence rate depends on the condition number κ\kappa.

Newton’s method fixes the zig-zag by using second-order information (the Hessian) to reshape the step direction. Instead of following the steepest slope, Newton’s method follows the curvature of the function directly toward the minimum. It converges much faster on smooth problems, but each step requires solving a linear system involving the Hessian.

For large-scale machine learning, computing the full gradient over the entire dataset is too expensive. Stochastic gradient descent and its variants like Adam and RMSProp approximate the gradient using small random batches of data, trading per-step accuracy for cheaper, more frequent updates.

Start typing to search across all content
navigate Enter open Esc close