Mar 11, 2026 · 20 min read · ML Optimization

Steepest descent (gradient descent)

In this series (18 parts)

Prerequisites: This article assumes you are comfortable with first-order optimality conditions and partial derivatives and gradients.

The ball rolling downhill

Picture a ball on a hilly landscape. Release it, and it rolls downhill. It does not plan a route. It does not look ahead. At every instant, it simply moves in the steepest downward direction. That is gradient descent.

The ball moves fast on steep slopes and slow on gentle ones. If it lands in a narrow valley, it bounces from wall to wall, losing energy with each bounce. If the valley is wide and round, it rolls smoothly to the bottom.

This analogy captures the essential behavior. Gradient descent follows the steepest local slope, one step at a time, hoping to reach the bottom.

Three iterations of gradient descent on $f(x, y) = x^2 + 4y^2$ , starting at $(4, 2)$ with step size $0.1$

Iteration	Position $(x, y)$	Gradient $(g_x, g_y)$	Step size	New position
0	$(4.00, 2.00)$	$(8.0, 16.0)$	$0.1$	$(3.20, 0.40)$
1	$(3.20, 0.40)$	$(6.4, 3.2)$	$0.1$	$(2.56, 0.08)$
2	$(2.56, 0.08)$	$(5.12, 0.64)$	$0.1$	$(2.05, 0.02)$

Each step moves opposite to the gradient. The $y$ -component shrinks fast (steep curvature), while $x$ shrinks slowly (gentle curvature). The path dives toward the valley floor, then creeps along it.

Descent trajectory on contour lines

graph TD
  S["Start: 4, 2 <br/> f = 32"] --> A["Step 1: 3.2, 0.4 <br/> f = 10.9"]
  A --> B["Step 2: 2.56, 0.08 <br/> f = 6.6"]
  B --> C["Step 3: 2.05, 0.02 <br/> f = 4.2"]
  C --> D["... more steps ..."]
  D --> MIN["Minimum: 0, 0 <br/> f = 0"]
  style S fill:#ffcccc
  style MIN fill:#ccffcc

The path zigzags: large jumps in $y$ early on, then slow progress in $x$ . This is the signature of gradient descent on elongated contours.

Gradient descent on f(x,y) = x^2 + 10y^2 starting from (3,3). The zigzag pattern is caused by the elongated contours (condition number = 10).

Now let’s formalize this intuition.

The update rule

You have a function $f: \mathbb{R}^n \to \mathbb{R}$ and you want to find its minimum. Gradient descent does this by starting at some point and repeatedly moving in the direction that decreases $f$ the fastest.

Why is that direction the negative gradient? Consider a small step from your current point $\mathbf{x}$ in direction $\mathbf{d}$ with unit length. The first-order Taylor expansion gives:

$f(\mathbf{x} + \alpha \mathbf{d}) \approx f(\mathbf{x}) + \alpha \nabla f(\mathbf{x})^T \mathbf{d}$

To decrease $f$ as much as possible, you want to minimize $\nabla f(\mathbf{x})^T \mathbf{d}$ subject to $\|\mathbf{d}\| = 1$ . By the Cauchy-Schwarz inequality:

$\nabla f(\mathbf{x})^T \mathbf{d} \geq -\|\nabla f(\mathbf{x})\|$

Equality holds when $\mathbf{d} = -\nabla f(\mathbf{x}) / \|\nabla f(\mathbf{x})\|$ . So the direction of steepest decrease is exactly the negative gradient direction.

This gives us the gradient descent update rule:

$\mathbf{x}_{k+1} = \mathbf{x}_k - \alpha_k \nabla f(\mathbf{x}_k)$

where $\alpha_k > 0$ is the step size (also called the learning rate). Start at some initial guess $\mathbf{x}_0$ and repeat until convergence.

flowchart TD
  A["Pick starting point x₀ and step size α"] --> B["Compute gradient ∇f at xₖ"]
  B --> C{"Is the gradient small enough?"}
  C -- Yes --> D["Stop: xₖ is approximately optimal"]
  C -- No --> E["Update: xₖ₊₁ = xₖ − α ∇f"]
  E --> B

The algorithm is simple. The step size $\alpha$ makes or breaks it.

A test function

We will use $f(x, y) = x^2 + 4y^2$ throughout this article. It is a quadratic with elliptical contours, simple enough to compute by hand but rich enough to show all the important behaviors.

The gradient is:

$\nabla f(x, y) = \begin{pmatrix} 2x \\ 8y \end{pmatrix}$

The minimum is at $(0, 0)$ with $f(0, 0) = 0$ . The Hessian is:

$H = \begin{pmatrix} 2 & 0 \\ 0 & 8 \end{pmatrix}$

The eigenvalues are $\lambda_1 = 2$ and $\lambda_2 = 8$ . Their ratio $\kappa = \lambda_{\max} / \lambda_{\min} = 4$ is the condition number. It measures how elongated the contours are. Higher condition number means more elongated contours and harder optimization. Keep this number in mind.

Example 1: fixed step size $\alpha = 0.1$

Start at $(4, 2)$ with step size $\alpha = 0.1$ . At each iteration we compute the gradient, take a step, and evaluate the function.

Iteration 0 (starting point):

$\mathbf{x}_0 = (4,\ 2), \quad f(\mathbf{x}_0) = 16 + 16 = 32$

Iteration 1:

$\nabla f = (2 \cdot 4,\ 8 \cdot 2) = (8,\ 16)$

$\mathbf{x}_1 = (4,\ 2) - 0.1 \cdot (8,\ 16) = (3.2,\ 0.4)$

$f(\mathbf{x}_1) = 10.24 + 0.64 = 10.88$

Iteration 2:

$\nabla f = (6.4,\ 3.2)$

$\mathbf{x}_2 = (3.2,\ 0.4) - 0.1 \cdot (6.4,\ 3.2) = (2.56,\ 0.08)$

$f(\mathbf{x}_2) = 6.5536 + 0.0256 = 6.5792$

Iteration 3:

$\nabla f = (5.12,\ 0.64)$

$\mathbf{x}_3 = (2.048,\ 0.016), \quad f(\mathbf{x}_3) = 4.1953$

Iteration 4:

$\nabla f = (4.096,\ 0.128)$

$\mathbf{x}_4 = (1.6384,\ 0.0032), \quad f(\mathbf{x}_4) = 2.6844$

Iteration 5:

$\nabla f = (3.2768,\ 0.0256)$

$\mathbf{x}_5 = (1.3107,\ 0.00064), \quad f(\mathbf{x}_5) = 1.7180$

After 5 iterations, $f$ dropped from 32 to 1.72. The $y$ -component shrinks fast (multiplied by 0.2 each step) while the $x$ -component shrinks slowly (multiplied by 0.8 each step). The path moves quickly toward the $x$ -axis, then creeps along it toward the origin.

Why the difference? The $y$ -direction has the larger eigenvalue (8), so the gradient component in $y$ is steeper. The step size $\alpha = 0.1$ nearly eliminates $y$ in a few steps but is too conservative for $x$ .

Step size: too small, too large, just right

The step size controls how far you move each iteration. Three things can happen:

Too small ( $\alpha \ll 1/L$ ): Every step is safe but tiny. You converge, eventually. In Example 1, $\alpha = 0.1$ is on the conservative side. It works, but convergence is slow in the $x$ -direction.

Too large ( $\alpha > 2/L$ ): The iterates overshoot the minimum and oscillate with growing amplitude. The function value increases instead of decreasing. The method diverges.

Just right ( $\alpha \approx 1/L$ ): The step is large enough to make good progress but small enough to stay stable. For our function, $L = 8$ (the largest eigenvalue of the Hessian), so the ideal constant step size is around $\alpha = 1/8 = 0.125$ .

Learning rate effects

graph TD
  A["Step size too small"] -->|"alpha much less than 1/L"| B["Converges, but very slowly"]
  C["Step size just right"] -->|"alpha near 1/L"| D["Fast, stable convergence"]
  E["Step size too large"] -->|"alpha greater than 2/L"| F["Oscillates and diverges"]
  style B fill:#fff3cd
  style D fill:#ccffcc
  style F fill:#ffcccc

The threshold for stability is $\alpha < 2/L$ . For our function, that means $\alpha < 0.25$ . Anything above 0.25 will diverge.

Example 2: step size $\alpha = 0.3$ (too large)

Same function, same starting point, but now $\alpha = 0.3$ . This exceeds the stability threshold of 0.25.

At each step the $x$ -component is multiplied by $(1 - 0.3 \cdot 2) = 0.4$ , which shrinks it. But the $y$ -component is multiplied by $(1 - 0.3 \cdot 8) = -1.4$ . The negative sign means it flips direction, and $|-1.4| > 1$ means it grows every step.

Iteration 1:

$\nabla f = (8,\ 16)$

$\mathbf{x}_1 = (4,\ 2) - 0.3 \cdot (8,\ 16) = (1.6,\ -2.8)$

$f(\mathbf{x}_1) = 2.56 + 31.36 = 33.92$

The function went up from 32 to 33.92.

Iteration 2:

$\nabla f = (3.2,\ -22.4)$

$\mathbf{x}_2 = (0.64,\ 3.92), \quad f(\mathbf{x}_2) = 61.88$

Iteration 3:

$\nabla f = (1.28,\ 31.36)$

$\mathbf{x}_3 = (0.256,\ -5.488), \quad f(\mathbf{x}_3) = 120.54$

Iteration 4:

$\nabla f = (0.512,\ -43.904)$

$\mathbf{x}_4 = (0.1024,\ 7.6832), \quad f(\mathbf{x}_4) = 236.18$

Iteration 5:

$\nabla f = (0.2048,\ 61.466)$

$\mathbf{x}_5 = (0.0410,\ -10.757), \quad f(\mathbf{x}_5) = 462.85$

The $y$ -coordinate bounces wildly: $2 \to -2.8 \to 3.92 \to -5.49 \to 7.68 \to -10.76$ . Each time it crosses zero and grows. Meanwhile $x$ converges fine (shrinking by factor 0.4 each step), but the $y$ -direction blows up because the step size exceeds the stability limit for that direction.

This is the core problem: the step size that works for one direction can be disastrous for another. The bottleneck is always the direction with the largest curvature (largest eigenvalue of the Hessian).

Convergence for convex functions

How fast does gradient descent converge when the step size is chosen properly?

General convex functions

A key quantity is the Lipschitz constant of the gradient, $L$ . The gradient $\nabla f$ is $L$ -Lipschitz continuous when:

$\|\nabla f(\mathbf{x}) - \nabla f(\mathbf{y})\| \leq L \|\mathbf{x} - \mathbf{y}\| \quad \text{for all } \mathbf{x}, \mathbf{y}$

This bounds how fast the gradient can change. For quadratics, $L$ equals the largest eigenvalue of the Hessian. For our test function, $L = 8$ .

With step size $\alpha = 1/L$ , gradient descent on a convex function satisfies:

$f(\mathbf{x}_k) - f(\mathbf{x}^*) \leq \frac{L \|\mathbf{x}_0 - \mathbf{x}^*\|^2}{2k}$

This is an $O(1/k)$ rate. To halve the error, you need twice as many iterations. Slow, but guaranteed.

Strongly convex functions

If $f$ is also strongly convex with parameter $\mu > 0$ (the smallest eigenvalue of the Hessian, for quadratics), convergence is much faster. With step size $\alpha = 1/L$ :

$f(\mathbf{x}_k) - f(\mathbf{x}^*) \leq \frac{L}{2} \left(1 - \frac{\mu}{L}\right)^{2k} \|\mathbf{x}_0 - \mathbf{x}^*\|^2$

This is linear convergence: the error shrinks by a constant factor each iteration. The rate depends on the condition number $\kappa = L / \mu$ . When $\kappa$ is close to 1, convergence is fast. When $\kappa$ is large, each step barely dents the error.

For our function: $\mu = 2$ , $L = 8$ , $\kappa = 4$ . The per-step factor is $(1 - 1/4) = 0.75$ , so the error drops by about 25% each iteration.

The optimal constant step size is $\alpha = 2 / (L + \mu)$ , which gives a per-step factor of $(\kappa - 1) / (\kappa + 1)$ . For our function: $(4 - 1)/(4 + 1) = 0.6$ , a 40% reduction per step.

Zig-zag behavior

Gradient descent has a well-known weakness on elongated contours: it zig-zags.

The gradient always points perpendicular to the contour lines. When contours are circular (condition number 1), the gradient points straight at the minimum. When contours are elongated ellipses (large condition number), the gradient mostly points across the narrow direction, not toward the minimum.

For $f(x, y) = x^2 + 4y^2$ , the contours are ellipses that are twice as wide in $x$ as in $y$ . The gradient at most points has a large $y$ -component relative to $x$ , so each step makes big progress in $y$ but then overshoots, requiring a correction back.

This creates the zig-zag pattern: the iterates bounce back and forth across the narrow valley while making slow progress along it. The worse the condition number, the worse the zig-zagging.

The eigenvalues of the Hessian tell you exactly how bad it will be. If $\lambda_{\max}$ and $\lambda_{\min}$ are far apart, the contours are highly elongated and gradient descent wastes most of its effort bouncing between walls of the valley.

Even the exact line search (choosing the best $\alpha$ at each step) does not fix zig-zagging. It still bounces; it just picks the optimal step within each bounce. Fixing zig-zag requires second-order information, which is what Newton’s method provides.

Convergence: well-conditioned vs ill-conditioned problems

graph TD
  A["Well-conditioned<br/>kappa near 1"] -->|"Circular contours"| B["Gradient points at minimum<br/>Fast, smooth convergence"]
  C["Ill-conditioned<br/>kappa much greater than 1"] -->|"Elongated ellipse contours"| D["Gradient points across valley<br/>Slow, zig-zag convergence"]
  style B fill:#ccffcc
  style D fill:#ffcccc

The condition number $\kappa = L / \mu$ determines everything. When $\kappa$ is close to 1, the contours are nearly circular and gradient descent converges quickly. When $\kappa$ is large, the contours are stretched and each step wastes effort bouncing sideways.

Example 3: exact line search

Exact line search. At each step, choose the $\alpha$ that minimizes $f$ along the search direction. This gives the best possible step at each iteration. For quadratics, the formula is known in closed form.

Backtracking line search. Start with a large $\alpha$ and shrink it until a descent condition (such as the Armijo condition) is met. Cheaper than exact search, and usually good enough.

Exact vs backtracking line search

graph LR
  A["Exact line search"] -->|"Minimize f along direction"| B["Optimal step each time"]
  C["Backtracking line search"] -->|"Shrink alpha until sufficient decrease"| D["Good enough step, cheaper"]
  B --> E["Expensive per step, fewer steps"]
  D --> F["Cheap per step, more steps"]
  style B fill:#ccffcc
  style D fill:#fff3cd

Instead of a fixed step size, pick the $\alpha$ that minimizes $f$ along the search direction at each step. For a quadratic $f(\mathbf{x}) = \frac{1}{2} \mathbf{x}^T A \mathbf{x}$ with gradient $\mathbf{g}_k = A \mathbf{x}_k$ , the optimal step size is:

$\alpha_k^* = \frac{\mathbf{g}_k^T \mathbf{g}_k}{\mathbf{g}_k^T A \, \mathbf{g}_k}$

Derivation. Along the line $\mathbf{x}_k - \alpha \mathbf{g}_k$ , the function value is:

$\phi(\alpha) = f(\mathbf{x}_k) - \alpha \, \mathbf{g}_k^T \mathbf{g}_k + \frac{\alpha^2}{2} \, \mathbf{g}_k^T A \, \mathbf{g}_k$

Set $\phi'(\alpha) = 0$ :

$-\mathbf{g}_k^T \mathbf{g}_k + \alpha \, \mathbf{g}_k^T A \, \mathbf{g}_k = 0 \quad \Longrightarrow \quad \alpha^* = \frac{\mathbf{g}_k^T \mathbf{g}_k}{\mathbf{g}_k^T A \, \mathbf{g}_k}$

For our function with $A = \text{diag}(2, 8)$ :

$\alpha^* = \frac{4x^2 + 64y^2}{8x^2 + 512y^2}$

Starting from $(4, 2)$ :

Iteration 1:

$\mathbf{g} = (8,\ 16), \quad \mathbf{g}^T\mathbf{g} = 320, \quad \mathbf{g}^TA\mathbf{g} = 2176$

$\alpha^* = 320 / 2176 \approx 0.1471$

$\mathbf{x}_1 = (4 - 0.1471 \cdot 8,\ 2 - 0.1471 \cdot 16) = (2.8235,\ -0.3529)$

$f(\mathbf{x}_1) = 7.97 + 0.50 = 8.47$

Iteration 2:

$\mathbf{g} = (5.6471,\ -2.8235), \quad \alpha^* \approx 0.3125$

$\mathbf{x}_2 = (1.0588,\ 0.5294), \quad f(\mathbf{x}_2) = 2.24$

Iteration 3:

$\mathbf{g} = (2.1176,\ 4.2353), \quad \alpha^* \approx 0.1471$

$\mathbf{x}_3 = (0.7474,\ -0.0934), \quad f(\mathbf{x}_3) = 0.59$

Iteration 4:

$\mathbf{g} = (1.4948,\ -0.7474), \quad \alpha^* \approx 0.3125$

$\mathbf{x}_4 = (0.2803,\ 0.1401), \quad f(\mathbf{x}_4) = 0.16$

Iteration 5:

$\mathbf{g} = (0.5606,\ 1.1211), \quad \alpha^* \approx 0.1471$

$\mathbf{x}_5 = (0.1979,\ -0.0248), \quad f(\mathbf{x}_5) = 0.04$

The step size alternates between roughly 0.15 and 0.31. The $y$ -coordinate flips sign every iteration: $2 \to -0.35 \to 0.53 \to -0.09 \to 0.14 \to -0.02$ . This is the zig-zag pattern in action, even with the optimal step size at each step.

Comparison

Here are the function values across all three approaches:

Iteration	$\alpha = 0.1$	$\alpha = 0.3$	Exact line search
0	32.00	32.00	32.00
1	10.88	33.92	8.47
2	6.58	61.88	2.24
3	4.20	120.54	0.59
4	2.68	236.18	0.16
5	1.72	462.85	0.04

$\alpha = 0.1$ : Steady convergence, but slow. Reduced by about $18\times$ in 5 steps.
$\alpha = 0.3$ : Divergence. The function value grows every iteration because the step size exceeds $2/L = 0.25$ .
Exact line search: Fast convergence. Reduced by about $800\times$ in 5 steps. Still zig-zags, but each bounce is optimal.

Python implementation

Here is a compact implementation that runs all three strategies:

import numpy as np

def f(x):
    return x[0]**2 + 4 * x[1]**2

def grad_f(x):
    return np.array([2 * x[0], 8 * x[1]])

def gd_fixed(x0, alpha, n_iters):
    x = x0.copy()
    path = [(x.copy(), f(x))]
    for _ in range(n_iters):
        x = x - alpha * grad_f(x)
        path.append((x.copy(), f(x)))
    return path

def gd_exact(x0, n_iters):
    A = np.diag([2.0, 8.0])
    x = x0.copy()
    path = [(x.copy(), f(x))]
    for _ in range(n_iters):
        g = grad_f(x)
        alpha = (g @ g) / (g @ A @ g)
        x = x - alpha * g
        path.append((x.copy(), f(x)))
    return path

x0 = np.array([4.0, 2.0])

for label, path in [
    ("alpha=0.1", gd_fixed(x0, 0.1, 5)),
    ("alpha=0.3", gd_fixed(x0, 0.3, 5)),
    ("exact",     gd_exact(x0, 5)),
]:
    print(f"\n{label}:")
    for i, (xi, fi) in enumerate(path):
        print(f"  k={i}: x=({xi[0]:8.4f}, {xi[1]:8.4f})  f={fi:.4f}")

To visualize the trajectories on a contour plot:

import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(8, 5))
xg = np.linspace(-1, 5, 300)
yg = np.linspace(-3, 3, 300)
X, Y = np.meshgrid(xg, yg)
Z = X**2 + 4 * Y**2

ax.contour(X, Y, Z, levels=20, cmap="coolwarm", alpha=0.6)

for label, path, color in [
    ("Fixed 0.1", gd_fixed(x0, 0.1, 20), "blue"),
    ("Exact", gd_exact(x0, 20), "green"),
]:
    pts = np.array([p[0] for p in path])
    ax.plot(pts[:, 0], pts[:, 1], "o-", color=color, label=label, ms=4)

ax.plot(0, 0, "r*", ms=12, label="Minimum")
ax.legend()
ax.set_xlabel("x")
ax.set_ylabel("y")
ax.set_title("Gradient descent on f(x,y) = x² + 4y²")
plt.tight_layout()
plt.show()

The blue path (fixed $\alpha = 0.1$ ) dives toward the $x$ -axis and then crawls along it. The green path (exact line search) zig-zags visibly but reaches the minimum much faster.

Convergence curve: f(x,y) value at each iteration of gradient descent

What comes next

Gradient descent is the foundation, but it has clear limitations. Zig-zag behavior on poorly conditioned problems slows convergence, the step size requires tuning, and the convergence rate depends on the condition number $\kappa$ .

Newton’s method fixes the zig-zag by using second-order information (the Hessian) to reshape the step direction. Instead of following the steepest slope, Newton’s method follows the curvature of the function directly toward the minimum. It converges much faster on smooth problems, but each step requires solving a linear system involving the Hessian.

For large-scale machine learning, computing the full gradient over the entire dataset is too expensive. Stochastic gradient descent and its variants like Adam and RMSProp approximate the gradient using small random batches of data, trading per-step accuracy for cheaper, more frequent updates.

← Back to all series

Steepest descent (gradient descent)

The ball rolling downhill

The update rule

A test function

Example 1: fixed step size α=0.1\alpha = 0.1α=0.1

Step size: too small, too large, just right

Example 2: step size α=0.3\alpha = 0.3α=0.3 (too large)

Convergence for convex functions

General convex functions

Strongly convex functions

Zig-zag behavior

Example 3: exact line search

Comparison

Python implementation

What comes next

Example 1: fixed step size $\alpha = 0.1$

Example 2: step size $\alpha = 0.3$ (too large)