Quasi-Newton methods: BFGS and L-BFGS
In this series (18 parts)
- What is optimization and why ML needs it
- Convex sets and convex functions
- Optimality conditions: first order
- Optimality conditions: second order
- Line search methods
- Least squares: the closed-form solution
- Steepest descent (gradient descent)
- Newton's method for optimization
- Quasi-Newton methods: BFGS and L-BFGS
- Conjugate gradient methods
- Constrained optimization and Lagrangian duality
- KKT conditions
- Penalty and barrier methods
- Interior point methods
- The simplex method
- Frank-Wolfe method
- Optimization in dynamic programming and optimal control
- Stochastic gradient descent and variants
Prerequisites: This article builds on Newton’s method for optimization. You should be comfortable with gradients, Hessians, and how Newton’s method uses them to find minima.
The problem with Newton’s method at scale
Newton’s method converges fast. On a smooth function with a good starting point, it reaches the minimum in very few iterations. The catch: each iteration requires the full Hessian matrix, which has entries for a problem with variables. You also need to solve a linear system involving it, which costs operations.
If you are optimizing a model with 10 million parameters, the Hessian is a matrix. Storing it alone would need roughly 400 terabytes of memory. That is not happening.
Gradient descent avoids this entirely by using only the gradient, but it converges slowly. The worse the condition number of the problem, the worse gradient descent performs.
Quasi-Newton methods sit between these two extremes. They build an approximation to the Hessian (or its inverse) using only gradient information. You never compute second derivatives. The approximation starts rough but improves at every step.
The most successful quasi-Newton method is BFGS, named after Broyden, Fletcher, Goldfarb, and Shanno, who all independently discovered it in 1970.
The secant condition
The idea behind quasi-Newton methods comes from a simple observation. After taking a step from to , you know the gradient at both points. Define:
For a quadratic function , the true Hessian satisfies:
This is exact for quadratics and approximately true near a minimum for general smooth functions. The secant condition says: our Hessian approximation should satisfy the same relationship:
This gives us equations for an symmetric matrix, which has unknowns. The secant condition alone does not pin down uniquely. Different choices for the remaining degrees of freedom give different quasi-Newton methods. BFGS makes a specific choice: keep symmetric and positive definite, and make it the closest such matrix to (in a certain norm) that satisfies the secant condition.
The BFGS update rule
In practice, we work with the inverse Hessian approximation directly. This lets us compute the search direction as a simple matrix-vector product instead of solving a linear system.
The BFGS update for the inverse Hessian approximation is:
where:
Here is what each piece does:
- is a scalar that normalizes the update. It measures how much curvature information the step provided.
- The sandwich term transforms the previous approximation to be consistent with new curvature information.
- The rank-one correction adds information along the step direction.
You typically start with (the identity matrix). This makes the first step identical to gradient descent. After each step, the BFGS update folds in new curvature information, and moves closer to the true inverse Hessian.
On a quadratic function in dimensions, BFGS with exact line search recovers the exact inverse Hessian in at most steps. On general smooth functions, it builds an increasingly accurate local approximation.
The full BFGS algorithm:
flowchart TD
A["Initialize: x₀, H₀ = I"] --> B["Compute gradient gₖ = ∇f(xₖ)"]
B --> C{"‖gₖ‖ < tolerance?"}
C -->|Yes| D["Return xₖ"]
C -->|No| E["Direction: pₖ = −Hₖ gₖ"]
E --> F["Line search: find step size αₖ"]
F --> G["Update: xₖ₊₁ = xₖ + αₖ pₖ"]
G --> H["Compute sₖ = xₖ₊₁ − xₖ, yₖ = gₖ₊₁ − gₖ"]
H --> I["BFGS update: Hₖ₊₁"]
I --> B
Example 1: One BFGS update step (2×2)
Let us perform one BFGS update by hand with small numbers.
Given:
- (start with the identity)
- (the step we took)
- (the gradient change)
Step 1: Compute .
Step 2: Compute the outer products.
Step 3: Build the left and right factors.
Step 4: Compute the sandwich product. Since , this simplifies to multiplying the two factors:
Row 1:
Row 2:
Step 5: Add the rank-one term.
Step 6: Combine.
Verification: Check that (the inverse secant condition):
Starting from the identity, one BFGS step produced an inverse Hessian approximation that correctly maps the gradient change back to the step .
Example 2: How many iterations does each method need?
Consider the function:
The minimum is at . We start from where .
The gradient is and the Hessian is constant: .
The condition number is the ratio of the largest to smallest eigenvalue: . The function curves 4 times more steeply in the direction than in , which makes life hard for gradient descent.
Newton’s method uses the exact Hessian:
Newton reaches the exact minimum in 1 iteration. For any pure quadratic, Newton always converges in a single step because the Hessian is constant everywhere.
Steepest descent (with exact line search) zigzags toward the minimum:
| Iteration | |||
|---|---|---|---|
| 0 | 4.000 | 2.000 | 32.000 |
| 1 | 2.824 | 8.471 | |
| 2 | 1.059 | 0.529 | 2.242 |
| 3 | 0.747 | 0.594 |
Notice the coordinate flipping sign each step. That is the classic zigzag. The convergence rate per iteration is , so each step only removes about 64% of the remaining error. Steepest descent needs 14 iterations to bring below .
BFGS starts with , so its first step matches steepest descent. But it updates its Hessian approximation after each step. On a 2D quadratic, BFGS with exact line search converges in 2 iterations, the same as the number of variables.
| Method | Iterations to converge | Cost per iteration | Hessian needed? |
|---|---|---|---|
| Newton | 1 | Yes | |
| BFGS | 2 | No | |
| Steepest descent | 14 | No |
Newton is fastest per iteration but most expensive per step. Steepest descent is cheapest per step but needs many more. BFGS sits in the sweet spot: near-Newton convergence without computing the Hessian.
Convergence comparison: gradient descent converges slowly, Newton’s method converges in few iterations, and BFGS closely tracks Newton without computing the exact Hessian.
Example 3: Full BFGS run on a 2D quadratic
Let us trace BFGS on starting from .
The gradient is . The true Hessian is and the true inverse Hessian is:
We will watch the BFGS approximation converge to this matrix.
Iteration 1:
Starting values: , , , .
Search direction: .
Exact line search gives .
Compute the update vectors:
After applying the BFGS formula:
The entry is 0.173, already close to the true value 0.167. But is 1.051, still far from 0.500. One step captured curvature in one direction but not the other.
Iteration 2:
Using to compute the new direction: .
Exact line search gives .
The BFGS approximation after this step:
After just 2 iterations, BFGS found the exact minimum and recovered the true inverse Hessian. This is the key property of BFGS on quadratics: in dimensions, it converges in at most steps, building up the exact inverse Hessian one direction at a time.
Summary table:
| Iteration | diagonal | True diagonal | ||
|---|---|---|---|---|
| 0 | 36.000 | |||
| 1 | 3.857 | |||
| 2 | 0.000 |
You can verify these numbers with a few lines of Python:
import numpy as np
def f(x):
return 3 * x[0]**2 + x[1]**2
def grad(x):
return np.array([6 * x[0], 2 * x[1]])
def line_search(x, p):
"""Exact line search for f(x + alpha * p) = 3*(x0+a*p0)^2 + (x1+a*p1)^2."""
num = -(6 * x[0] * p[0] + 2 * x[1] * p[1])
den = 6 * p[0]**2 + 2 * p[1]**2
return num / den
x = np.array([3.0, 3.0])
H = np.eye(2)
for k in range(3):
g = grad(x)
print(f"Iter {k}: x=({x[0]:.3f}, {x[1]:.3f}), f={f(x):.3f}")
if np.linalg.norm(g) < 1e-10:
print("Converged!")
break
p = -H @ g
alpha = line_search(x, p)
x_new = x + alpha * p
g_new = grad(x_new)
s = (x_new - x).reshape(-1, 1)
y = (g_new - g).reshape(-1, 1)
rho = 1.0 / float(y.T @ s)
I = np.eye(2)
H = (I - rho * s @ y.T) @ H @ (I - rho * y @ s.T) + rho * s @ s.T
x = x_new
L-BFGS: the memory-limited version
BFGS stores the full matrix . For a problem with variables, that matrix takes 800 MB. For , it takes 8 TB. This does not scale.
L-BFGS (Limited-memory BFGS) solves this by never storing the full matrix. Instead, it keeps only the last pairs of vectors , typically with between 3 and 20.
When you need the product , L-BFGS computes it using a two-loop recursion that touches only those vector pairs. Storage drops from to , and per-iteration cost drops from to .
The two-loop recursion works as follows:
- Backward loop: Starting from down to , compute scalars and update .
- Scaling: Set where gives a good initial scaling.
- Forward loop: Starting from up to , compute and update .
The output is the search direction , computed without ever forming the full matrix. In practice, works well for most problems. Going beyond rarely helps.
BFGS vs L-BFGS:
| BFGS | L-BFGS | |
|---|---|---|
| Storage | ||
| Per-iteration cost | ||
| Hessian quality | Full history | Last steps |
| Best for |
L-BFGS is the default optimizer in many large-scale applications: logistic regression, conditional random fields, and neural network fine-tuning. If you have used scipy.optimize.minimize with method='L-BFGS-B', you have already used it.
Practical guidance: when to use which method
| Problem size | Recommended method | Reason |
|---|---|---|
| Newton’s method | Hessian is small enough to compute and invert; quadratic convergence pays off | |
| BFGS | Near-Newton convergence without computing the Hessian | |
| L-BFGS | Full BFGS matrix does not fit in memory |
Other things to keep in mind:
- Noisy gradients (stochastic optimization): Quasi-Newton methods struggle because the vectors are noisy. Use SGD with momentum or Adam instead.
- Sparse Hessians: If the Hessian has a known sparsity pattern, specialized Newton or trust-region methods can exploit that structure and outperform BFGS.
- Bound constraints: L-BFGS-B handles simple box constraints () natively.
- Non-smooth functions: BFGS and L-BFGS assume smoothness. For non-smooth objectives, consider subgradient methods or proximal methods.
Python implementation
Here is how to use both methods in practice with SciPy:
import numpy as np
from scipy.optimize import minimize
# Rosenbrock function: a classic test problem
def rosenbrock(x):
return (1 - x[0])**2 + 100 * (x[1] - x[0]**2)**2
def rosenbrock_grad(x):
dx = -2 * (1 - x[0]) - 400 * x[0] * (x[1] - x[0]**2)
dy = 200 * (x[1] - x[0]**2)
return np.array([dx, dy])
x0 = np.array([-1.0, 1.0])
# BFGS: good for small to medium problems
result_bfgs = minimize(rosenbrock, x0, jac=rosenbrock_grad,
method='BFGS')
print(f"BFGS: x = {result_bfgs.x}, iterations = {result_bfgs.nit}")
# L-BFGS-B: good for large-scale problems
result_lbfgs = minimize(rosenbrock, x0, jac=rosenbrock_grad,
method='L-BFGS-B')
print(f"L-BFGS-B: x = {result_lbfgs.x}, iterations = {result_lbfgs.nit}")
Typical output:
BFGS: x = [1. 1.], iterations = 30
L-BFGS-B: x = [1. 1.], iterations = 24
Both find the minimum at . L-BFGS-B uses far less memory while taking a similar number of iterations.
You can tune the memory parameter for L-BFGS-B:
result = minimize(rosenbrock, x0, jac=rosenbrock_grad,
method='L-BFGS-B',
options={'maxcor': 20}) # store 20 correction pairs (default is 10)
Increasing maxcor uses more memory but can improve convergence on difficult problems.
What comes next
BFGS and L-BFGS are excellent general-purpose optimizers for smooth, unconstrained problems. They work well when you can compute (or approximate) the gradient cheaply and the function is reasonably smooth.
Conjugate gradient methods take a different approach. Instead of building a Hessian approximation, they generate search directions that are conjugate with respect to the curvature of the function. This gives storage, similar to L-BFGS, with strong convergence properties on quadratics. In the next article, we cover conjugate gradient methods and compare them with L-BFGS on practical problems.