Search…

Line search methods

In this series (18 parts)
  1. What is optimization and why ML needs it
  2. Convex sets and convex functions
  3. Optimality conditions: first order
  4. Optimality conditions: second order
  5. Line search methods
  6. Least squares: the closed-form solution
  7. Steepest descent (gradient descent)
  8. Newton's method for optimization
  9. Quasi-Newton methods: BFGS and L-BFGS
  10. Conjugate gradient methods
  11. Constrained optimization and Lagrangian duality
  12. KKT conditions
  13. Penalty and barrier methods
  14. Interior point methods
  15. The simplex method
  16. Frank-Wolfe method
  17. Optimization in dynamic programming and optimal control
  18. Stochastic gradient descent and variants

Every iterative optimization algorithm follows the same pattern: pick a direction, then decide how far to go. The “how far” part is the line search problem. Get the step size wrong and your algorithm either crawls or overshoots. Line search methods give you principled ways to choose a good step size at each iteration.

Prerequisites

You should be familiar with first-order optimality conditions, particularly how the gradient indicates the direction of steepest descent.

The line search problem

Suppose you are at point xk\mathbf{x}_k and you have chosen a descent direction dk\mathbf{d}_k (typically dk=f(xk)\mathbf{d}_k = -\nabla f(\mathbf{x}_k) for gradient descent). The next iterate is:

xk+1=xk+αkdk\mathbf{x}_{k+1} = \mathbf{x}_k + \alpha_k \mathbf{d}_k

where αk>0\alpha_k > 0 is the step size (also called the learning rate). The line search problem is to choose αk\alpha_k so that f(xk+1)f(\mathbf{x}_{k+1}) is sufficiently smaller than f(xk)f(\mathbf{x}_k).

We define the univariate function:

ϕ(α)=f(xk+αdk)\phi(\alpha) = f(\mathbf{x}_k + \alpha \mathbf{d}_k)

Line search reduces the multidimensional optimization to a 1D problem: find a good α\alpha that makes ϕ(α)\phi(\alpha) small.

The ideal choice is to minimize ϕ(α)\phi(\alpha) exactly:

αk=argminα>0ϕ(α)\alpha_k^* = \arg\min_{\alpha > 0} \phi(\alpha)

This is called exact line search. You solve a 1D optimization problem at every step.

For quadratic functions f(x)=12xTAxbTxf(\mathbf{x}) = \frac{1}{2}\mathbf{x}^T A \mathbf{x} - \mathbf{b}^T\mathbf{x}, the exact step size has a closed form:

αk=f(xk)Tf(xk)f(xk)TAf(xk)\alpha_k^* = \frac{\nabla f(\mathbf{x}_k)^T \nabla f(\mathbf{x}_k)}{\nabla f(\mathbf{x}_k)^T A \, \nabla f(\mathbf{x}_k)}

For general functions, exact line search is expensive. You would need to solve a full 1D optimization at each iteration, which is wasteful when an approximate answer works just as well.

Inexact line search: the Armijo condition

In practice, you do not need the exact minimum along the line. You just need “sufficient decrease.” The Armijo condition (also called the sufficient decrease condition) formalizes this.

The Armijo condition requires:

f(xk+αdk)f(xk)+c1αf(xk)Tdkf(\mathbf{x}_k + \alpha \mathbf{d}_k) \leq f(\mathbf{x}_k) + c_1 \alpha \nabla f(\mathbf{x}_k)^T \mathbf{d}_k

where c1(0,1)c_1 \in (0, 1) is a small constant, typically c1=104c_1 = 10^{-4}.

The right-hand side is a linear approximation of ff along the direction dk\mathbf{d}_k, scaled down by c1c_1. Since dk\mathbf{d}_k is a descent direction, f(xk)Tdk<0\nabla f(\mathbf{x}_k)^T \mathbf{d}_k < 0, so the right-hand side decreases as α\alpha increases. The Armijo condition says: the actual function value must drop at least as much as a fraction c1c_1 of the linear prediction.

graph TD
  A["Start with initial step size α"] --> B{"Does Armijo condition hold?"}
  B -->|Yes| C["Accept α"]
  B -->|No| D["Reduce α: α ← ρα"]
  D --> B

The 1D line search function phi(alpha) = f(x + alpha*d) and the Armijo sufficient decrease line. The step size is accepted where phi falls below the Armijo line.

The Wolfe conditions

The Armijo condition alone can accept step sizes that are too small. The Wolfe conditions add a second requirement to prevent this:

Condition 1 (Armijo / sufficient decrease):

f(xk+αdk)f(xk)+c1αf(xk)Tdkf(\mathbf{x}_k + \alpha \mathbf{d}_k) \leq f(\mathbf{x}_k) + c_1 \alpha \nabla f(\mathbf{x}_k)^T \mathbf{d}_k

Condition 2 (curvature condition):

f(xk+αdk)Tdkc2f(xk)Tdk\nabla f(\mathbf{x}_k + \alpha \mathbf{d}_k)^T \mathbf{d}_k \geq c_2 \nabla f(\mathbf{x}_k)^T \mathbf{d}_k

where 0<c1<c2<10 < c_1 < c_2 < 1. Typical values: c1=104c_1 = 10^{-4}, c2=0.9c_2 = 0.9 (for quasi-Newton methods) or c2=0.1c_2 = 0.1 (for conjugate gradient).

The curvature condition says the slope at the new point should be less negative than at the starting point (the gradient should have flattened out). This prevents accepting a step so short that you barely move.

There is also the strong Wolfe condition, which replaces the curvature condition with:

f(xk+αdk)Tdkc2f(xk)Tdk|\nabla f(\mathbf{x}_k + \alpha \mathbf{d}_k)^T \mathbf{d}_k| \leq c_2 |\nabla f(\mathbf{x}_k)^T \mathbf{d}_k|

This additionally prevents the step from being too long (overshooting past the minimum along the line).

The most common inexact line search in practice is backtracking. It only checks the Armijo condition and finds a step size by starting large and shrinking:

Algorithm:

  1. Choose initial step size αˉ>0\bar{\alpha} > 0, shrink factor ρ(0,1)\rho \in (0, 1), and Armijo parameter c1(0,1)c_1 \in (0, 1).
  2. Set ααˉ\alpha \leftarrow \bar{\alpha}.
  3. While f(xk+αdk)>f(xk)+c1αf(xk)Tdkf(\mathbf{x}_k + \alpha \mathbf{d}_k) > f(\mathbf{x}_k) + c_1 \alpha \nabla f(\mathbf{x}_k)^T \mathbf{d}_k:
    • αρα\alpha \leftarrow \rho \alpha
  4. Return α\alpha.

Typical choices: αˉ=1\bar{\alpha} = 1, ρ=0.5\rho = 0.5, c1=104c_1 = 10^{-4}.

This is simple, robust, and cheap. Each iteration requires one function evaluation. The algorithm terminates in finitely many steps as long as dk\mathbf{d}_k is a descent direction (guaranteed because for small enough α\alpha, the Armijo condition always holds).

Example 1: Backtracking on a 1D quadratic

Let f(x)=3x212x+15f(x) = 3x^2 - 12x + 15. We start at x0=0x_0 = 0 and use steepest descent with backtracking.

Setup:

  • f(x)=6x12f'(x) = 6x - 12
  • At x0=0x_0 = 0: f(0)=15f(0) = 15, f(0)=12f'(0) = -12
  • Direction: d=f(0)=12d = -f'(0) = 12 (but we normalize: d=1d = 1 in 1D steepest descent means d=f(0)/f(0)f(0)d = -f'(0)/|f'(0)| \cdot |f'(0)|, or simply use d=f\mathbf{d} = -\nabla f)

For 1D, the update is x1=x0+αd0x_1 = x_0 + \alpha \cdot d_0 where d0=f(0)=12d_0 = -f'(0) = 12.

Parameters: αˉ=1\bar{\alpha} = 1, ρ=0.5\rho = 0.5, c1=0.3c_1 = 0.3.

Iteration 1: α=1\alpha = 1

x1=0+112=12x_1 = 0 + 1 \cdot 12 = 12

f(12)=3(144)144+15=432144+15=303f(12) = 3(144) - 144 + 15 = 432 - 144 + 15 = 303

Armijo check: f(0)+c1αf(0)d0f(0) + c_1 \alpha f'(0) \cdot d_0. Here fTd=f(0)d0=(12)(12)=144\nabla f^T d = f'(0) \cdot d_0 = (-12)(12) = -144.

Wait, let me reconsider. With d0=f(x0)=12d_0 = -f'(x_0) = 12, the directional derivative is f(x0)d0=(12)(12)=144f'(x_0) \cdot d_0 = (-12)(12) = -144.

f(0)+0.31(144)=1543.2=28.2f(0) + 0.3 \cdot 1 \cdot (-144) = 15 - 43.2 = -28.2

Check: f(12)=30328.2f(12) = 303 \leq -28.2? No. ✗ Reduce α\alpha.

Iteration 2: α=0.5\alpha = 0.5

x1=0+0.512=6x_1 = 0 + 0.5 \cdot 12 = 6

f(6)=10872+15=51f(6) = 108 - 72 + 15 = 51

15+0.30.5(144)=1521.6=6.615 + 0.3 \cdot 0.5 \cdot (-144) = 15 - 21.6 = -6.6

Check: 516.651 \leq -6.6? No. ✗ Reduce.

Iteration 3: α=0.25\alpha = 0.25

x1=0+0.2512=3x_1 = 0 + 0.25 \cdot 12 = 3

f(3)=2736+15=6f(3) = 27 - 36 + 15 = 6

15+0.30.25(144)=1510.8=4.215 + 0.3 \cdot 0.25 \cdot (-144) = 15 - 10.8 = 4.2

Check: 64.26 \leq 4.2? No. ✗ Reduce.

Iteration 4: α=0.125\alpha = 0.125

x1=0+0.12512=1.5x_1 = 0 + 0.125 \cdot 12 = 1.5

f(1.5)=3(2.25)18+15=6.7518+15=3.75f(1.5) = 3(2.25) - 18 + 15 = 6.75 - 18 + 15 = 3.75

15+0.30.125(144)=155.4=9.615 + 0.3 \cdot 0.125 \cdot (-144) = 15 - 5.4 = 9.6

Check: 3.759.63.75 \leq 9.6? Yes. ✓ Accept α=0.125\alpha = 0.125.

The new point is x1=1.5x_1 = 1.5 with f(1.5)=3.75f(1.5) = 3.75, down from f(0)=15f(0) = 15.

The true minimum is at x=2x^* = 2 with f(2)=3f(2) = 3. We did not land exactly there, but we made good progress. The next iteration of gradient descent would get closer.

def backtracking(f, grad_f, x, d, alpha_init=1.0, rho=0.5, c1=0.3):
    alpha = alpha_init
    fx = f(x)
    slope = grad_f(x) * d  # directional derivative (1D)
    while f(x + alpha * d) > fx + c1 * alpha * slope:
        alpha *= rho
    return alpha

f = lambda x: 3*x**2 - 12*x + 15
grad_f = lambda x: 6*x - 12

x0 = 0.0
d0 = -grad_f(x0)  # = 12
alpha = backtracking(f, grad_f, x0, d0)
print(f"Step size: {alpha}")      # 0.125
print(f"New x: {x0 + alpha*d0}")  # 1.5
print(f"f(new x): {f(x0 + alpha*d0)}")  # 3.75

Example 2: Backtracking on a 2D function

Let f(x,y)=x2+4y2f(x, y) = x^2 + 4y^2. Start at x0=(4,2)\mathbf{x}_0 = (4, 2).

Setup:

  • Gradient: f=(2x,8y)\nabla f = (2x, 8y)
  • At (4,2)(4, 2): f=(8,16)\nabla f = (8, 16), f(4,2)=16+16=32f(4, 2) = 16 + 16 = 32
  • Direction: d=f=(8,16)\mathbf{d} = -\nabla f = (-8, -16)
  • Directional derivative: fTd=(8)(8)+(16)(16)=64256=320\nabla f^T \mathbf{d} = (8)(-8) + (16)(-16) = -64 - 256 = -320

Parameters: αˉ=1\bar{\alpha} = 1, ρ=0.5\rho = 0.5, c1=0.1c_1 = 0.1.

Iteration 1: α=1\alpha = 1

x1=(4,2)+1(8,16)=(4,14)\mathbf{x}_1 = (4, 2) + 1 \cdot (-8, -16) = (-4, -14)

f(4,14)=16+4(196)=16+784=800f(-4, -14) = 16 + 4(196) = 16 + 784 = 800

32+0.11(320)=3232=032 + 0.1 \cdot 1 \cdot (-320) = 32 - 32 = 0

Check: 8000800 \leq 0? No. ✗

Iteration 2: α=0.5\alpha = 0.5

x1=(4,2)+0.5(8,16)=(0,6)\mathbf{x}_1 = (4, 2) + 0.5(-8, -16) = (0, -6)

f(0,6)=0+144=144f(0, -6) = 0 + 144 = 144

32+0.10.5(320)=3216=1632 + 0.1 \cdot 0.5 \cdot (-320) = 32 - 16 = 16

Check: 14416144 \leq 16? No. ✗

Iteration 3: α=0.25\alpha = 0.25

x1=(4,2)+0.25(8,16)=(2,2)\mathbf{x}_1 = (4, 2) + 0.25(-8, -16) = (2, -2)

f(2,2)=4+16=20f(2, -2) = 4 + 16 = 20

32+0.10.25(320)=328=2432 + 0.1 \cdot 0.25 \cdot (-320) = 32 - 8 = 24

Check: 202420 \leq 24? Yes. ✓ Accept α=0.25\alpha = 0.25.

New point: (2,2)(2, -2) with f=20f = 20, down from 3232. The minimum is at the origin with f=0f = 0. The large step size of 11 overshot badly because this function has different curvatures along xx and yy. The Hessian is diag(2,8)\text{diag}(2, 8), giving a condition number of 8/2=48/2 = 4. Ill-conditioned problems need more backtracking iterations.

import numpy as np

f = lambda x: x[0]**2 + 4*x[1]**2
grad_f = lambda x: np.array([2*x[0], 8*x[1]])

x0 = np.array([4.0, 2.0])
d = -grad_f(x0)

rho, c1 = 0.5, 0.1
alpha = 1.0
fx0 = f(x0)
slope = grad_f(x0) @ d

while f(x0 + alpha * d) > fx0 + c1 * alpha * slope:
    alpha *= rho

x1 = x0 + alpha * d
print(f"alpha = {alpha}")
print(f"x1 = {x1}")
print(f"f(x1) = {f(x1)}")
# alpha = 0.25
# x1 = [ 2. -2.]
# f(x1) = 20.0

Example 3: Comparing step sizes on Rosenbrock’s function

The Rosenbrock function is a classic test for optimization:

f(x,y)=(1x)2+100(yx2)2f(x, y) = (1 - x)^2 + 100(y - x^2)^2

Start at (1,1)(-1, 1).

Setup:

  • fx=2(1x)400x(yx2)\frac{\partial f}{\partial x} = -2(1-x) - 400x(y - x^2)
  • fy=200(yx2)\frac{\partial f}{\partial y} = 200(y - x^2)
  • At (1,1)(-1, 1): fx=2(2)400(1)(11)=4\frac{\partial f}{\partial x} = -2(2) - 400(-1)(1 - 1) = -4, fy=200(11)=0\frac{\partial f}{\partial y} = 200(1 - 1) = 0
  • f=(4,0)\nabla f = (-4, 0), f(1,1)=4+0=4f(-1, 1) = 4 + 0 = 4
  • d=(4,0)\mathbf{d} = (4, 0)
  • fTd=(4)(4)=16\nabla f^T \mathbf{d} = (-4)(4) = -16

Parameters: αˉ=1\bar{\alpha} = 1, ρ=0.5\rho = 0.5, c1=104c_1 = 10^{-4}.

Iteration 1: α=1\alpha = 1

x1=(1,1)+1(4,0)=(3,1)\mathbf{x}_1 = (-1, 1) + 1 \cdot (4, 0) = (3, 1)

f(3,1)=(13)2+100(19)2=4+100(64)=6404f(3, 1) = (1-3)^2 + 100(1 - 9)^2 = 4 + 100(64) = 6404

4+1041(16)=40.0016=3.99844 + 10^{-4} \cdot 1 \cdot (-16) = 4 - 0.0016 = 3.9984

Check: 64043.99846404 \leq 3.9984? No. ✗

Iteration 2: α=0.5\alpha = 0.5

x1=(1+2,1)=(1,1)\mathbf{x}_1 = (-1 + 2, 1) = (1, 1)

f(1,1)=0+0=0f(1, 1) = 0 + 0 = 0

40.0008=3.99924 - 0.0008 = 3.9992

Check: 03.99920 \leq 3.9992? Yes. ✓

We got lucky here. With α=0.5\alpha = 0.5, we landed exactly on the global minimum (1,1)(1, 1). This is a coincidence of the starting point and direction, not something that happens in general. But notice how the full step α=1\alpha = 1 overshot spectacularly (ff went from 44 to 64046404), while the halved step was perfect.

import numpy as np

f = lambda x: (1 - x[0])**2 + 100*(x[1] - x[0]**2)**2
grad_f = lambda x: np.array([
    -2*(1 - x[0]) - 400*x[0]*(x[1] - x[0]**2),
    200*(x[1] - x[0]**2)
])

x0 = np.array([-1.0, 1.0])
g = grad_f(x0)
d = -g

rho, c1 = 0.5, 1e-4
alpha = 1.0
fx0 = f(x0)
slope = g @ d

iterations = 0
while f(x0 + alpha * d) > fx0 + c1 * alpha * slope:
    alpha *= rho
    iterations += 1

x1 = x0 + alpha * d
print(f"Backtracking iterations: {iterations}")
print(f"alpha = {alpha}")
print(f"x1 = {x1}")
print(f"f(x1) = {f(x1)}")
# Backtracking iterations: 1
# alpha = 0.5
# x1 = [1. 1.]
# f(x1) = 0.0

Choosing parameters

The backtracking parameters ρ\rho and c1c_1 are usually not sensitive:

ParameterTypical valueRole
αˉ\bar{\alpha}1.01.0Initial step size. Use 1.0 for Newton-type methods.
ρ\rho0.50.5 (range: 0.10.1 to 0.80.8)Shrink factor. Smaller means more aggressive shrinkage.
c1c_110410^{-4}Armijo parameter. Very small, so almost any decrease is accepted.
c2c_20.90.9 (quasi-Newton) or 0.10.1 (CG)Curvature parameter for Wolfe conditions.

For gradient descent, the initial step size αˉ\bar{\alpha} matters more. Starting too large wastes function evaluations on backtracking. Starting too small means you take tiny steps. A common heuristic is to use the step size from the previous iteration as the starting point for the next one.

When to use which method

Backtracking (Armijo only): Simple, cheap, good default. Works well with gradient descent and most algorithms. The most common choice in practice.

Wolfe conditions: Needed for quasi-Newton methods like L-BFGS, where the curvature condition ensures the Hessian approximation stays positive definite. Also important for nonlinear conjugate gradient.

Exact line search: Rarely used in practice. The computational cost of solving the 1D subproblem exactly is almost never worth it. The exception is when the 1D subproblem has a cheap closed-form solution (as with quadratics).

Connection to learning rates in deep learning

In deep learning, people typically use a fixed learning rate (or a schedule) instead of line search. Why?

  1. Cost. Line search requires multiple forward passes per step. With large datasets and models, each forward pass is expensive.
  2. Stochasticity. SGD uses mini-batch gradients, which are noisy estimates. Running a line search on a noisy function value does not make much sense.
  3. Good enough. Fixed learning rates with schedules (warmup, cosine decay) or adaptive methods (Adam, RMSProp) work well enough in practice.

That said, understanding line search is important because it is the foundation of classical optimization. When you move beyond deep learning to operations research, scientific computing, or any setting where function evaluations are cheap and gradients are exact, line search becomes essential.

Summary

MethodCondition checkedCostWhen to use
Exact line searchminαϕ(α)\min_\alpha \phi(\alpha)ExpensiveQuadratics or cheap 1D problems
BacktrackingArmijoCheapDefault choice
Wolfe line searchArmijo + curvatureModerateQuasi-Newton, conjugate gradient

Line search answers “how far?” once you have answered “which direction?” Together, these two decisions define an iterative optimization algorithm.

What comes next

With line search in hand, we can now tackle specific optimization algorithms. The next article covers least squares, the most important problem with a closed-form solution, and the foundation of linear regression.

Start typing to search across all content
navigate Enter open Esc close