Line search methods
In this series (18 parts)
- What is optimization and why ML needs it
- Convex sets and convex functions
- Optimality conditions: first order
- Optimality conditions: second order
- Line search methods
- Least squares: the closed-form solution
- Steepest descent (gradient descent)
- Newton's method for optimization
- Quasi-Newton methods: BFGS and L-BFGS
- Conjugate gradient methods
- Constrained optimization and Lagrangian duality
- KKT conditions
- Penalty and barrier methods
- Interior point methods
- The simplex method
- Frank-Wolfe method
- Optimization in dynamic programming and optimal control
- Stochastic gradient descent and variants
Every iterative optimization algorithm follows the same pattern: pick a direction, then decide how far to go. The “how far” part is the line search problem. Get the step size wrong and your algorithm either crawls or overshoots. Line search methods give you principled ways to choose a good step size at each iteration.
Prerequisites
You should be familiar with first-order optimality conditions, particularly how the gradient indicates the direction of steepest descent.
The line search problem
Suppose you are at point and you have chosen a descent direction (typically for gradient descent). The next iterate is:
where is the step size (also called the learning rate). The line search problem is to choose so that is sufficiently smaller than .
We define the univariate function:
Line search reduces the multidimensional optimization to a 1D problem: find a good that makes small.
Exact line search
The ideal choice is to minimize exactly:
This is called exact line search. You solve a 1D optimization problem at every step.
For quadratic functions , the exact step size has a closed form:
For general functions, exact line search is expensive. You would need to solve a full 1D optimization at each iteration, which is wasteful when an approximate answer works just as well.
Inexact line search: the Armijo condition
In practice, you do not need the exact minimum along the line. You just need “sufficient decrease.” The Armijo condition (also called the sufficient decrease condition) formalizes this.
The Armijo condition requires:
where is a small constant, typically .
The right-hand side is a linear approximation of along the direction , scaled down by . Since is a descent direction, , so the right-hand side decreases as increases. The Armijo condition says: the actual function value must drop at least as much as a fraction of the linear prediction.
graph TD
A["Start with initial step size α"] --> B{"Does Armijo condition hold?"}
B -->|Yes| C["Accept α"]
B -->|No| D["Reduce α: α ← ρα"]
D --> B
The 1D line search function phi(alpha) = f(x + alpha*d) and the Armijo sufficient decrease line. The step size is accepted where phi falls below the Armijo line.
The Wolfe conditions
The Armijo condition alone can accept step sizes that are too small. The Wolfe conditions add a second requirement to prevent this:
Condition 1 (Armijo / sufficient decrease):
Condition 2 (curvature condition):
where . Typical values: , (for quasi-Newton methods) or (for conjugate gradient).
The curvature condition says the slope at the new point should be less negative than at the starting point (the gradient should have flattened out). This prevents accepting a step so short that you barely move.
There is also the strong Wolfe condition, which replaces the curvature condition with:
This additionally prevents the step from being too long (overshooting past the minimum along the line).
Backtracking line search
The most common inexact line search in practice is backtracking. It only checks the Armijo condition and finds a step size by starting large and shrinking:
Algorithm:
- Choose initial step size , shrink factor , and Armijo parameter .
- Set .
- While :
- Return .
Typical choices: , , .
This is simple, robust, and cheap. Each iteration requires one function evaluation. The algorithm terminates in finitely many steps as long as is a descent direction (guaranteed because for small enough , the Armijo condition always holds).
Example 1: Backtracking on a 1D quadratic
Let . We start at and use steepest descent with backtracking.
Setup:
- At : ,
- Direction: (but we normalize: in 1D steepest descent means , or simply use )
For 1D, the update is where .
Parameters: , , .
Iteration 1:
Armijo check: . Here .
Wait, let me reconsider. With , the directional derivative is .
Check: ? No. ✗ Reduce .
Iteration 2:
Check: ? No. ✗ Reduce.
Iteration 3:
Check: ? No. ✗ Reduce.
Iteration 4:
Check: ? Yes. ✓ Accept .
The new point is with , down from .
The true minimum is at with . We did not land exactly there, but we made good progress. The next iteration of gradient descent would get closer.
def backtracking(f, grad_f, x, d, alpha_init=1.0, rho=0.5, c1=0.3):
alpha = alpha_init
fx = f(x)
slope = grad_f(x) * d # directional derivative (1D)
while f(x + alpha * d) > fx + c1 * alpha * slope:
alpha *= rho
return alpha
f = lambda x: 3*x**2 - 12*x + 15
grad_f = lambda x: 6*x - 12
x0 = 0.0
d0 = -grad_f(x0) # = 12
alpha = backtracking(f, grad_f, x0, d0)
print(f"Step size: {alpha}") # 0.125
print(f"New x: {x0 + alpha*d0}") # 1.5
print(f"f(new x): {f(x0 + alpha*d0)}") # 3.75
Example 2: Backtracking on a 2D function
Let . Start at .
Setup:
- Gradient:
- At : ,
- Direction:
- Directional derivative:
Parameters: , , .
Iteration 1:
Check: ? No. ✗
Iteration 2:
Check: ? No. ✗
Iteration 3:
Check: ? Yes. ✓ Accept .
New point: with , down from . The minimum is at the origin with . The large step size of overshot badly because this function has different curvatures along and . The Hessian is , giving a condition number of . Ill-conditioned problems need more backtracking iterations.
import numpy as np
f = lambda x: x[0]**2 + 4*x[1]**2
grad_f = lambda x: np.array([2*x[0], 8*x[1]])
x0 = np.array([4.0, 2.0])
d = -grad_f(x0)
rho, c1 = 0.5, 0.1
alpha = 1.0
fx0 = f(x0)
slope = grad_f(x0) @ d
while f(x0 + alpha * d) > fx0 + c1 * alpha * slope:
alpha *= rho
x1 = x0 + alpha * d
print(f"alpha = {alpha}")
print(f"x1 = {x1}")
print(f"f(x1) = {f(x1)}")
# alpha = 0.25
# x1 = [ 2. -2.]
# f(x1) = 20.0
Example 3: Comparing step sizes on Rosenbrock’s function
The Rosenbrock function is a classic test for optimization:
Start at .
Setup:
- At : ,
- ,
Parameters: , , .
Iteration 1:
Check: ? No. ✗
Iteration 2:
Check: ? Yes. ✓
We got lucky here. With , we landed exactly on the global minimum . This is a coincidence of the starting point and direction, not something that happens in general. But notice how the full step overshot spectacularly ( went from to ), while the halved step was perfect.
import numpy as np
f = lambda x: (1 - x[0])**2 + 100*(x[1] - x[0]**2)**2
grad_f = lambda x: np.array([
-2*(1 - x[0]) - 400*x[0]*(x[1] - x[0]**2),
200*(x[1] - x[0]**2)
])
x0 = np.array([-1.0, 1.0])
g = grad_f(x0)
d = -g
rho, c1 = 0.5, 1e-4
alpha = 1.0
fx0 = f(x0)
slope = g @ d
iterations = 0
while f(x0 + alpha * d) > fx0 + c1 * alpha * slope:
alpha *= rho
iterations += 1
x1 = x0 + alpha * d
print(f"Backtracking iterations: {iterations}")
print(f"alpha = {alpha}")
print(f"x1 = {x1}")
print(f"f(x1) = {f(x1)}")
# Backtracking iterations: 1
# alpha = 0.5
# x1 = [1. 1.]
# f(x1) = 0.0
Choosing parameters
The backtracking parameters and are usually not sensitive:
| Parameter | Typical value | Role |
|---|---|---|
| Initial step size. Use 1.0 for Newton-type methods. | ||
| (range: to ) | Shrink factor. Smaller means more aggressive shrinkage. | |
| Armijo parameter. Very small, so almost any decrease is accepted. | ||
| (quasi-Newton) or (CG) | Curvature parameter for Wolfe conditions. |
For gradient descent, the initial step size matters more. Starting too large wastes function evaluations on backtracking. Starting too small means you take tiny steps. A common heuristic is to use the step size from the previous iteration as the starting point for the next one.
When to use which method
Backtracking (Armijo only): Simple, cheap, good default. Works well with gradient descent and most algorithms. The most common choice in practice.
Wolfe conditions: Needed for quasi-Newton methods like L-BFGS, where the curvature condition ensures the Hessian approximation stays positive definite. Also important for nonlinear conjugate gradient.
Exact line search: Rarely used in practice. The computational cost of solving the 1D subproblem exactly is almost never worth it. The exception is when the 1D subproblem has a cheap closed-form solution (as with quadratics).
Connection to learning rates in deep learning
In deep learning, people typically use a fixed learning rate (or a schedule) instead of line search. Why?
- Cost. Line search requires multiple forward passes per step. With large datasets and models, each forward pass is expensive.
- Stochasticity. SGD uses mini-batch gradients, which are noisy estimates. Running a line search on a noisy function value does not make much sense.
- Good enough. Fixed learning rates with schedules (warmup, cosine decay) or adaptive methods (Adam, RMSProp) work well enough in practice.
That said, understanding line search is important because it is the foundation of classical optimization. When you move beyond deep learning to operations research, scientific computing, or any setting where function evaluations are cheap and gradients are exact, line search becomes essential.
Summary
| Method | Condition checked | Cost | When to use |
|---|---|---|---|
| Exact line search | Expensive | Quadratics or cheap 1D problems | |
| Backtracking | Armijo | Cheap | Default choice |
| Wolfe line search | Armijo + curvature | Moderate | Quasi-Newton, conjugate gradient |
Line search answers “how far?” once you have answered “which direction?” Together, these two decisions define an iterative optimization algorithm.
What comes next
With line search in hand, we can now tackle specific optimization algorithms. The next article covers least squares, the most important problem with a closed-form solution, and the foundation of linear regression.