Mar 9, 2026 · 16 min read · ML Optimization

Line search methods

In this series (18 parts)

Every iterative optimization algorithm follows the same pattern: pick a direction, then decide how far to go. The “how far” part is the line search problem. Get the step size wrong and your algorithm either crawls or overshoots. Line search methods give you principled ways to choose a good step size at each iteration.

Prerequisites

You should be familiar with first-order optimality conditions, particularly how the gradient indicates the direction of steepest descent.

The line search problem

Suppose you are at point $\mathbf{x}_k$ and you have chosen a descent direction $\mathbf{d}_k$ (typically $\mathbf{d}_k = -\nabla f(\mathbf{x}_k)$ for gradient descent). The next iterate is:

$\mathbf{x}_{k+1} = \mathbf{x}_k + \alpha_k \mathbf{d}_k$

where $\alpha_k > 0$ is the step size (also called the learning rate). The line search problem is to choose $\alpha_k$ so that $f(\mathbf{x}_{k+1})$ is sufficiently smaller than $f(\mathbf{x}_k)$ .

We define the univariate function:

$\phi(\alpha) = f(\mathbf{x}_k + \alpha \mathbf{d}_k)$

Line search reduces the multidimensional optimization to a 1D problem: find a good $\alpha$ that makes $\phi(\alpha)$ small.

Exact line search

The ideal choice is to minimize $\phi(\alpha)$ exactly:

$\alpha_k^* = \arg\min_{\alpha > 0} \phi(\alpha)$

This is called exact line search. You solve a 1D optimization problem at every step.

For quadratic functions $f(\mathbf{x}) = \frac{1}{2}\mathbf{x}^T A \mathbf{x} - \mathbf{b}^T\mathbf{x}$ , the exact step size has a closed form:

$\alpha_k^* = \frac{\nabla f(\mathbf{x}_k)^T \nabla f(\mathbf{x}_k)}{\nabla f(\mathbf{x}_k)^T A \, \nabla f(\mathbf{x}_k)}$

For general functions, exact line search is expensive. You would need to solve a full 1D optimization at each iteration, which is wasteful when an approximate answer works just as well.

Inexact line search: the Armijo condition

In practice, you do not need the exact minimum along the line. You just need “sufficient decrease.” The Armijo condition (also called the sufficient decrease condition) formalizes this.

The Armijo condition requires:

$f(\mathbf{x}_k + \alpha \mathbf{d}_k) \leq f(\mathbf{x}_k) + c_1 \alpha \nabla f(\mathbf{x}_k)^T \mathbf{d}_k$

where $c_1 \in (0, 1)$ is a small constant, typically $c_1 = 10^{-4}$ .

The right-hand side is a linear approximation of $f$ along the direction $\mathbf{d}_k$ , scaled down by $c_1$ . Since $\mathbf{d}_k$ is a descent direction, $\nabla f(\mathbf{x}_k)^T \mathbf{d}_k < 0$ , so the right-hand side decreases as $\alpha$ increases. The Armijo condition says: the actual function value must drop at least as much as a fraction $c_1$ of the linear prediction.

graph TD
  A["Start with initial step size α"] --> B{"Does Armijo condition hold?"}
  B -->|Yes| C["Accept α"]
  B -->|No| D["Reduce α: α ← ρα"]
  D --> B

The 1D line search function phi(alpha) = f(x + alpha*d) and the Armijo sufficient decrease line. The step size is accepted where phi falls below the Armijo line.

The Wolfe conditions

The Armijo condition alone can accept step sizes that are too small. The Wolfe conditions add a second requirement to prevent this:

Condition 1 (Armijo / sufficient decrease):

$f(\mathbf{x}_k + \alpha \mathbf{d}_k) \leq f(\mathbf{x}_k) + c_1 \alpha \nabla f(\mathbf{x}_k)^T \mathbf{d}_k$

Condition 2 (curvature condition):

$\nabla f(\mathbf{x}_k + \alpha \mathbf{d}_k)^T \mathbf{d}_k \geq c_2 \nabla f(\mathbf{x}_k)^T \mathbf{d}_k$

where $0 < c_1 < c_2 < 1$ . Typical values: $c_1 = 10^{-4}$ , $c_2 = 0.9$ (for quasi-Newton methods) or $c_2 = 0.1$ (for conjugate gradient).

The curvature condition says the slope at the new point should be less negative than at the starting point (the gradient should have flattened out). This prevents accepting a step so short that you barely move.

There is also the strong Wolfe condition, which replaces the curvature condition with:

$|\nabla f(\mathbf{x}_k + \alpha \mathbf{d}_k)^T \mathbf{d}_k| \leq c_2 |\nabla f(\mathbf{x}_k)^T \mathbf{d}_k|$

This additionally prevents the step from being too long (overshooting past the minimum along the line).

Backtracking line search

The most common inexact line search in practice is backtracking. It only checks the Armijo condition and finds a step size by starting large and shrinking:

Algorithm:

Choose initial step size $\bar{\alpha} > 0$ , shrink factor $\rho \in (0, 1)$ , and Armijo parameter $c_1 \in (0, 1)$ .
Set $\alpha \leftarrow \bar{\alpha}$ .
While $f(\mathbf{x}_k + \alpha \mathbf{d}_k) > f(\mathbf{x}_k) + c_1 \alpha \nabla f(\mathbf{x}_k)^T \mathbf{d}_k$ $f (x_{k} + α d_{k}) > f (x_{k}) + c_{1} α \nabla f (x_{k})^{T} d_{k}$ :
- $\alpha \leftarrow \rho \alpha$
Return $\alpha$ .

Typical choices: $\bar{\alpha} = 1$ , $\rho = 0.5$ , $c_1 = 10^{-4}$ .

This is simple, robust, and cheap. Each iteration requires one function evaluation. The algorithm terminates in finitely many steps as long as $\mathbf{d}_k$ is a descent direction (guaranteed because for small enough $\alpha$ , the Armijo condition always holds).

Example 1: Backtracking on a 1D quadratic

Let $f(x) = 3x^2 - 12x + 15$ . We start at $x_0 = 0$ and use steepest descent with backtracking.

Setup:

$f'(x) = 6x - 12$
At $x_0 = 0$ : $f(0) = 15$ , $f'(0) = -12$
Direction: $d = -f'(0) = 12$ (but we normalize: $d = 1$ in 1D steepest descent means $d = -f'(0)/|f'(0)| \cdot |f'(0)|$ , or simply use $\mathbf{d} = -\nabla f$ )

For 1D, the update is $x_1 = x_0 + \alpha \cdot d_0$ where $d_0 = -f'(0) = 12$ .

Parameters: $\bar{\alpha} = 1$ , $\rho = 0.5$ , $c_1 = 0.3$ .

Iteration 1: $\alpha = 1$

$x_1 = 0 + 1 \cdot 12 = 12$

$f(12) = 3(144) - 144 + 15 = 432 - 144 + 15 = 303$

Armijo check: $f(0) + c_1 \alpha f'(0) \cdot d_0$ . Here $\nabla f^T d = f'(0) \cdot d_0 = (-12)(12) = -144$ .

Wait, let me reconsider. With $d_0 = -f'(x_0) = 12$ , the directional derivative is $f'(x_0) \cdot d_0 = (-12)(12) = -144$ .

$f(0) + 0.3 \cdot 1 \cdot (-144) = 15 - 43.2 = -28.2$

Check: $f(12) = 303 \leq -28.2$ ? No. ✗ Reduce $\alpha$ .

Iteration 2: $\alpha = 0.5$

$x_1 = 0 + 0.5 \cdot 12 = 6$

$f(6) = 108 - 72 + 15 = 51$

$15 + 0.3 \cdot 0.5 \cdot (-144) = 15 - 21.6 = -6.6$

Check: $51 \leq -6.6$ ? No. ✗ Reduce.

Iteration 3: $\alpha = 0.25$

$x_1 = 0 + 0.25 \cdot 12 = 3$

$f(3) = 27 - 36 + 15 = 6$

$15 + 0.3 \cdot 0.25 \cdot (-144) = 15 - 10.8 = 4.2$

Check: $6 \leq 4.2$ ? No. ✗ Reduce.

Iteration 4: $\alpha = 0.125$

$x_1 = 0 + 0.125 \cdot 12 = 1.5$

$f(1.5) = 3(2.25) - 18 + 15 = 6.75 - 18 + 15 = 3.75$

$15 + 0.3 \cdot 0.125 \cdot (-144) = 15 - 5.4 = 9.6$

Check: $3.75 \leq 9.6$ ? Yes. ✓ Accept $\alpha = 0.125$ .

The new point is $x_1 = 1.5$ with $f(1.5) = 3.75$ , down from $f(0) = 15$ .

The true minimum is at $x^* = 2$ with $f(2) = 3$ . We did not land exactly there, but we made good progress. The next iteration of gradient descent would get closer.

def backtracking(f, grad_f, x, d, alpha_init=1.0, rho=0.5, c1=0.3):
    alpha = alpha_init
    fx = f(x)
    slope = grad_f(x) * d  # directional derivative (1D)
    while f(x + alpha * d) > fx + c1 * alpha * slope:
        alpha *= rho
    return alpha

f = lambda x: 3*x**2 - 12*x + 15
grad_f = lambda x: 6*x - 12

x0 = 0.0
d0 = -grad_f(x0)  # = 12
alpha = backtracking(f, grad_f, x0, d0)
print(f"Step size: {alpha}")      # 0.125
print(f"New x: {x0 + alpha*d0}")  # 1.5
print(f"f(new x): {f(x0 + alpha*d0)}")  # 3.75

Example 2: Backtracking on a 2D function

Let $f(x, y) = x^2 + 4y^2$ . Start at $\mathbf{x}_0 = (4, 2)$ .

Setup:

Gradient: $\nabla f = (2x, 8y)$
At $(4, 2)$ : $\nabla f = (8, 16)$ , $f(4, 2) = 16 + 16 = 32$
Direction: $\mathbf{d} = -\nabla f = (-8, -16)$
Directional derivative: $\nabla f^T \mathbf{d} = (8)(-8) + (16)(-16) = -64 - 256 = -320$

Parameters: $\bar{\alpha} = 1$ , $\rho = 0.5$ , $c_1 = 0.1$ .

Iteration 1: $\alpha = 1$

$\mathbf{x}_1 = (4, 2) + 1 \cdot (-8, -16) = (-4, -14)$

$f(-4, -14) = 16 + 4(196) = 16 + 784 = 800$

$32 + 0.1 \cdot 1 \cdot (-320) = 32 - 32 = 0$

Check: $800 \leq 0$ ? No. ✗

Iteration 2: $\alpha = 0.5$

$\mathbf{x}_1 = (4, 2) + 0.5(-8, -16) = (0, -6)$

$f(0, -6) = 0 + 144 = 144$

$32 + 0.1 \cdot 0.5 \cdot (-320) = 32 - 16 = 16$

Check: $144 \leq 16$ ? No. ✗

Iteration 3: $\alpha = 0.25$

$\mathbf{x}_1 = (4, 2) + 0.25(-8, -16) = (2, -2)$

$f(2, -2) = 4 + 16 = 20$

$32 + 0.1 \cdot 0.25 \cdot (-320) = 32 - 8 = 24$

Check: $20 \leq 24$ ? Yes. ✓ Accept $\alpha = 0.25$ .

New point: $(2, -2)$ with $f = 20$ , down from $32$ . The minimum is at the origin with $f = 0$ . The large step size of $1$ overshot badly because this function has different curvatures along $x$ and $y$ . The Hessian is $\text{diag}(2, 8)$ , giving a condition number of $8/2 = 4$ . Ill-conditioned problems need more backtracking iterations.

import numpy as np

f = lambda x: x[0]**2 + 4*x[1]**2
grad_f = lambda x: np.array([2*x[0], 8*x[1]])

x0 = np.array([4.0, 2.0])
d = -grad_f(x0)

rho, c1 = 0.5, 0.1
alpha = 1.0
fx0 = f(x0)
slope = grad_f(x0) @ d

while f(x0 + alpha * d) > fx0 + c1 * alpha * slope:
    alpha *= rho

x1 = x0 + alpha * d
print(f"alpha = {alpha}")
print(f"x1 = {x1}")
print(f"f(x1) = {f(x1)}")
# alpha = 0.25
# x1 = [ 2. -2.]
# f(x1) = 20.0

Example 3: Comparing step sizes on Rosenbrock’s function

The Rosenbrock function is a classic test for optimization:

$f(x, y) = (1 - x)^2 + 100(y - x^2)^2$

Start at $(-1, 1)$ .

Setup:

$\frac{\partial f}{\partial x} = -2(1-x) - 400x(y - x^2)$
$\frac{\partial f}{\partial y} = 200(y - x^2)$
At $(-1, 1)$ : $\frac{\partial f}{\partial x} = -2(2) - 400(-1)(1 - 1) = -4$ , $\frac{\partial f}{\partial y} = 200(1 - 1) = 0$
$\nabla f = (-4, 0)$ , $f(-1, 1) = 4 + 0 = 4$
$\mathbf{d} = (4, 0)$
$\nabla f^T \mathbf{d} = (-4)(4) = -16$

Parameters: $\bar{\alpha} = 1$ , $\rho = 0.5$ , $c_1 = 10^{-4}$ .

Iteration 1: $\alpha = 1$

$\mathbf{x}_1 = (-1, 1) + 1 \cdot (4, 0) = (3, 1)$

$f(3, 1) = (1-3)^2 + 100(1 - 9)^2 = 4 + 100(64) = 6404$

$4 + 10^{-4} \cdot 1 \cdot (-16) = 4 - 0.0016 = 3.9984$

Check: $6404 \leq 3.9984$ ? No. ✗

Iteration 2: $\alpha = 0.5$

$\mathbf{x}_1 = (-1 + 2, 1) = (1, 1)$

$f(1, 1) = 0 + 0 = 0$

$4 - 0.0008 = 3.9992$

Check: $0 \leq 3.9992$ ? Yes. ✓

We got lucky here. With $\alpha = 0.5$ , we landed exactly on the global minimum $(1, 1)$ . This is a coincidence of the starting point and direction, not something that happens in general. But notice how the full step $\alpha = 1$ overshot spectacularly ( $f$ went from $4$ to $6404$ ), while the halved step was perfect.

import numpy as np

f = lambda x: (1 - x[0])**2 + 100*(x[1] - x[0]**2)**2
grad_f = lambda x: np.array([
    -2*(1 - x[0]) - 400*x[0]*(x[1] - x[0]**2),
    200*(x[1] - x[0]**2)
])

x0 = np.array([-1.0, 1.0])
g = grad_f(x0)
d = -g

rho, c1 = 0.5, 1e-4
alpha = 1.0
fx0 = f(x0)
slope = g @ d

iterations = 0
while f(x0 + alpha * d) > fx0 + c1 * alpha * slope:
    alpha *= rho
    iterations += 1

x1 = x0 + alpha * d
print(f"Backtracking iterations: {iterations}")
print(f"alpha = {alpha}")
print(f"x1 = {x1}")
print(f"f(x1) = {f(x1)}")
# Backtracking iterations: 1
# alpha = 0.5
# x1 = [1. 1.]
# f(x1) = 0.0

Choosing parameters

The backtracking parameters $\rho$ and $c_1$ are usually not sensitive:

Parameter	Typical value	Role
$\bar{\alpha}$	$1.0$	Initial step size. Use 1.0 for Newton-type methods.
$\rho$	$0.5$ (range: $0.1$ to $0.8$ )	Shrink factor. Smaller means more aggressive shrinkage.
$c_1$	$10^{-4}$	Armijo parameter. Very small, so almost any decrease is accepted.
$c_2$	$0.9$ (quasi-Newton) or $0.1$ (CG)	Curvature parameter for Wolfe conditions.

For gradient descent, the initial step size $\bar{\alpha}$ matters more. Starting too large wastes function evaluations on backtracking. Starting too small means you take tiny steps. A common heuristic is to use the step size from the previous iteration as the starting point for the next one.

When to use which method

Backtracking (Armijo only): Simple, cheap, good default. Works well with gradient descent and most algorithms. The most common choice in practice.

Wolfe conditions: Needed for quasi-Newton methods like L-BFGS, where the curvature condition ensures the Hessian approximation stays positive definite. Also important for nonlinear conjugate gradient.

Exact line search: Rarely used in practice. The computational cost of solving the 1D subproblem exactly is almost never worth it. The exception is when the 1D subproblem has a cheap closed-form solution (as with quadratics).

Connection to learning rates in deep learning

In deep learning, people typically use a fixed learning rate (or a schedule) instead of line search. Why?

Cost. Line search requires multiple forward passes per step. With large datasets and models, each forward pass is expensive.
Stochasticity. SGD uses mini-batch gradients, which are noisy estimates. Running a line search on a noisy function value does not make much sense.
Good enough. Fixed learning rates with schedules (warmup, cosine decay) or adaptive methods (Adam, RMSProp) work well enough in practice.

That said, understanding line search is important because it is the foundation of classical optimization. When you move beyond deep learning to operations research, scientific computing, or any setting where function evaluations are cheap and gradients are exact, line search becomes essential.

Summary

Method	Condition checked	Cost	When to use
Exact line search	$\min_\alpha \phi(\alpha)$	Expensive	Quadratics or cheap 1D problems
Backtracking	Armijo	Cheap	Default choice
Wolfe line search	Armijo + curvature	Moderate	Quasi-Newton, conjugate gradient

Line search answers “how far?” once you have answered “which direction?” Together, these two decisions define an iterative optimization algorithm.

What comes next

With line search in hand, we can now tackle specific optimization algorithms. The next article covers least squares, the most important problem with a closed-form solution, and the foundation of linear regression.

← Back to all series