Search…

Quasi-Newton methods: BFGS and L-BFGS

In this series (18 parts)
  1. What is optimization and why ML needs it
  2. Convex sets and convex functions
  3. Optimality conditions: first order
  4. Optimality conditions: second order
  5. Line search methods
  6. Least squares: the closed-form solution
  7. Steepest descent (gradient descent)
  8. Newton's method for optimization
  9. Quasi-Newton methods: BFGS and L-BFGS
  10. Conjugate gradient methods
  11. Constrained optimization and Lagrangian duality
  12. KKT conditions
  13. Penalty and barrier methods
  14. Interior point methods
  15. The simplex method
  16. Frank-Wolfe method
  17. Optimization in dynamic programming and optimal control
  18. Stochastic gradient descent and variants

Prerequisites: This article builds on Newton’s method for optimization. You should be comfortable with gradients, Hessians, and how Newton’s method uses them to find minima.

The problem with Newton’s method at scale

Newton’s method converges fast. On a smooth function with a good starting point, it reaches the minimum in very few iterations. The catch: each iteration requires the full Hessian matrix, which has n2n^2 entries for a problem with nn variables. You also need to solve a linear system involving it, which costs O(n3)O(n^3) operations.

If you are optimizing a model with 10 million parameters, the Hessian is a 107×10710^7 \times 10^7 matrix. Storing it alone would need roughly 400 terabytes of memory. That is not happening.

Gradient descent avoids this entirely by using only the gradient, but it converges slowly. The worse the condition number of the problem, the worse gradient descent performs.

Quasi-Newton methods sit between these two extremes. They build an approximation to the Hessian (or its inverse) using only gradient information. You never compute second derivatives. The approximation starts rough but improves at every step.

The most successful quasi-Newton method is BFGS, named after Broyden, Fletcher, Goldfarb, and Shanno, who all independently discovered it in 1970.

The secant condition

The idea behind quasi-Newton methods comes from a simple observation. After taking a step from xkx_k to xk+1x_{k+1}, you know the gradient at both points. Define:

sk=xk+1xk(the step taken)s_k = x_{k+1} - x_k \quad \text{(the step taken)}

yk=f(xk+1)f(xk)(the gradient change)y_k = \nabla f(x_{k+1}) - \nabla f(x_k) \quad \text{(the gradient change)}

For a quadratic function f(x)=12xTAx+bTxf(x) = \frac{1}{2} x^T A x + b^T x, the true Hessian AA satisfies:

Ask=ykA \, s_k = y_k

This is exact for quadratics and approximately true near a minimum for general smooth functions. The secant condition says: our Hessian approximation Bk+1B_{k+1} should satisfy the same relationship:

Bk+1sk=ykB_{k+1} \, s_k = y_k

This gives us nn equations for an n×nn \times n symmetric matrix, which has n(n+1)/2n(n+1)/2 unknowns. The secant condition alone does not pin down Bk+1B_{k+1} uniquely. Different choices for the remaining degrees of freedom give different quasi-Newton methods. BFGS makes a specific choice: keep Bk+1B_{k+1} symmetric and positive definite, and make it the closest such matrix to BkB_k (in a certain norm) that satisfies the secant condition.

The BFGS update rule

In practice, we work with the inverse Hessian approximation HkBk1H_k \approx B_k^{-1} directly. This lets us compute the search direction as a simple matrix-vector product pk=Hkf(xk)p_k = -H_k \nabla f(x_k) instead of solving a linear system.

The BFGS update for the inverse Hessian approximation is:

Hk+1=(IρkskykT)Hk(IρkykskT)+ρkskskTH_{k+1} = \left(I - \rho_k \, s_k \, y_k^T\right) H_k \left(I - \rho_k \, y_k \, s_k^T\right) + \rho_k \, s_k \, s_k^T

where:

ρk=1ykTsk\rho_k = \frac{1}{y_k^T \, s_k}

Here is what each piece does:

  • ρk\rho_k is a scalar that normalizes the update. It measures how much curvature information the step provided.
  • The sandwich term (IρkskykT)Hk(IρkykskT)(I - \rho_k s_k y_k^T) \, H_k \, (I - \rho_k y_k s_k^T) transforms the previous approximation to be consistent with new curvature information.
  • The rank-one correction ρkskskT\rho_k \, s_k \, s_k^T adds information along the step direction.

You typically start with H0=IH_0 = I (the identity matrix). This makes the first step identical to gradient descent. After each step, the BFGS update folds in new curvature information, and HkH_k moves closer to the true inverse Hessian.

On a quadratic function in nn dimensions, BFGS with exact line search recovers the exact inverse Hessian in at most nn steps. On general smooth functions, it builds an increasingly accurate local approximation.

The full BFGS algorithm:

flowchart TD
  A["Initialize: x₀, H₀ = I"] --> B["Compute gradient gₖ = ∇f(xₖ)"]
  B --> C{"‖gₖ‖ < tolerance?"}
  C -->|Yes| D["Return xₖ"]
  C -->|No| E["Direction: pₖ = −Hₖ gₖ"]
  E --> F["Line search: find step size αₖ"]
  F --> G["Update: xₖ₊₁ = xₖ + αₖ pₖ"]
  G --> H["Compute sₖ = xₖ₊₁ − xₖ, yₖ = gₖ₊₁ − gₖ"]
  H --> I["BFGS update: Hₖ₊₁"]
  I --> B

Example 1: One BFGS update step (2×2)

Let us perform one BFGS update by hand with small numbers.

Given:

  • H0=I=(1001)H_0 = I = \begin{pmatrix} 1 & 0 \\ 0 & 1 \end{pmatrix} (start with the identity)
  • s0=(10)s_0 = \begin{pmatrix} 1 \\ 0 \end{pmatrix} (the step we took)
  • y0=(21)y_0 = \begin{pmatrix} 2 \\ 1 \end{pmatrix} (the gradient change)

Step 1: Compute ρ0\rho_0.

ρ0=1y0Ts0=1(2)(1)+(1)(0)=12\rho_0 = \frac{1}{y_0^T s_0} = \frac{1}{(2)(1) + (1)(0)} = \frac{1}{2}

Step 2: Compute the outer products.

s0y0T=(10)(21)=(2100)s_0 \, y_0^T = \begin{pmatrix} 1 \\ 0 \end{pmatrix} \begin{pmatrix} 2 & 1 \end{pmatrix} = \begin{pmatrix} 2 & 1 \\ 0 & 0 \end{pmatrix}

y0s0T=(21)(10)=(2010)y_0 \, s_0^T = \begin{pmatrix} 2 \\ 1 \end{pmatrix} \begin{pmatrix} 1 & 0 \end{pmatrix} = \begin{pmatrix} 2 & 0 \\ 1 & 0 \end{pmatrix}

Step 3: Build the left and right factors.

Iρ0s0y0T=I12(2100)=(1001)(10.500)=(00.501)I - \rho_0 \, s_0 \, y_0^T = I - \frac{1}{2}\begin{pmatrix} 2 & 1 \\ 0 & 0 \end{pmatrix} = \begin{pmatrix} 1 & 0 \\ 0 & 1 \end{pmatrix} - \begin{pmatrix} 1 & 0.5 \\ 0 & 0 \end{pmatrix} = \begin{pmatrix} 0 & -0.5 \\ 0 & 1 \end{pmatrix}

Iρ0y0s0T=I12(2010)=(000.51)I - \rho_0 \, y_0 \, s_0^T = I - \frac{1}{2}\begin{pmatrix} 2 & 0 \\ 1 & 0 \end{pmatrix} = \begin{pmatrix} 0 & 0 \\ -0.5 & 1 \end{pmatrix}

Step 4: Compute the sandwich product. Since H0=IH_0 = I, this simplifies to multiplying the two factors:

(00.501)(000.51)\begin{pmatrix} 0 & -0.5 \\ 0 & 1 \end{pmatrix} \begin{pmatrix} 0 & 0 \\ -0.5 & 1 \end{pmatrix}

Row 1: (00+(0.5)(0.5),    00+(0.5)(1))=(0.25,  0.5)(0 \cdot 0 + (-0.5)(-0.5), \;\; 0 \cdot 0 + (-0.5)(1)) = (0.25, \; -0.5)

Row 2: (00+1(0.5),    00+11)=(0.5,  1)(0 \cdot 0 + 1 \cdot (-0.5), \;\; 0 \cdot 0 + 1 \cdot 1) = (-0.5, \; 1)

Sandwich=(0.250.50.51)\text{Sandwich} = \begin{pmatrix} 0.25 & -0.5 \\ -0.5 & 1 \end{pmatrix}

Step 5: Add the rank-one term.

ρ0s0s0T=12(10)(10)=(0.5000)\rho_0 \, s_0 \, s_0^T = \frac{1}{2} \begin{pmatrix} 1 \\ 0 \end{pmatrix} \begin{pmatrix} 1 & 0 \end{pmatrix} = \begin{pmatrix} 0.5 & 0 \\ 0 & 0 \end{pmatrix}

Step 6: Combine.

H1=(0.250.50.51)+(0.5000)=(0.750.50.51)H_1 = \begin{pmatrix} 0.25 & -0.5 \\ -0.5 & 1 \end{pmatrix} + \begin{pmatrix} 0.5 & 0 \\ 0 & 0 \end{pmatrix} = \begin{pmatrix} 0.75 & -0.5 \\ -0.5 & 1 \end{pmatrix}

Verification: Check that H1y0=s0H_1 y_0 = s_0 (the inverse secant condition):

(0.750.50.51)(21)=(1.50.51+1)=(10)=s0\begin{pmatrix} 0.75 & -0.5 \\ -0.5 & 1 \end{pmatrix} \begin{pmatrix} 2 \\ 1 \end{pmatrix} = \begin{pmatrix} 1.5 - 0.5 \\ -1 + 1 \end{pmatrix} = \begin{pmatrix} 1 \\ 0 \end{pmatrix} = s_0 \quad ✓

Starting from the identity, one BFGS step produced an inverse Hessian approximation that correctly maps the gradient change y0y_0 back to the step s0s_0.

Example 2: How many iterations does each method need?

Consider the function:

f(x,y)=x2+4y2f(x, y) = x^2 + 4y^2

The minimum is at (0,0)(0, 0). We start from (4,2)(4, 2) where f(4,2)=16+16=32f(4, 2) = 16 + 16 = 32.

The gradient is f=(2x,  8y)\nabla f = (2x, \; 8y) and the Hessian is constant: H=(2008)H = \begin{pmatrix} 2 & 0 \\ 0 & 8 \end{pmatrix}.

The condition number is the ratio of the largest to smallest eigenvalue: κ=8/2=4\kappa = 8/2 = 4. The function curves 4 times more steeply in the yy direction than in xx, which makes life hard for gradient descent.

Newton’s method uses the exact Hessian:

x1=x0H1f(x0)=(42)(0.5000.125)(816)=(42)(42)=(00)x_1 = x_0 - H^{-1} \nabla f(x_0) = \begin{pmatrix} 4 \\ 2 \end{pmatrix} - \begin{pmatrix} 0.5 & 0 \\ 0 & 0.125 \end{pmatrix} \begin{pmatrix} 8 \\ 16 \end{pmatrix} = \begin{pmatrix} 4 \\ 2 \end{pmatrix} - \begin{pmatrix} 4 \\ 2 \end{pmatrix} = \begin{pmatrix} 0 \\ 0 \end{pmatrix}

Newton reaches the exact minimum in 1 iteration. For any pure quadratic, Newton always converges in a single step because the Hessian is constant everywhere.

Steepest descent (with exact line search) zigzags toward the minimum:

Iterationxxyyf(x,y)f(x,y)
04.0002.00032.000
12.8240.353-0.3538.471
21.0590.5292.242
30.7470.093-0.0930.594

Notice the yy coordinate flipping sign each step. That is the classic zigzag. The convergence rate per iteration is (κ1κ+1)2=(35)2=0.36\left(\frac{\kappa - 1}{\kappa + 1}\right)^2 = \left(\frac{3}{5}\right)^2 = 0.36, so each step only removes about 64% of the remaining error. Steepest descent needs 14 iterations to bring ff below 10610^{-6}.

BFGS starts with H0=IH_0 = I, so its first step matches steepest descent. But it updates its Hessian approximation after each step. On a 2D quadratic, BFGS with exact line search converges in 2 iterations, the same as the number of variables.

MethodIterations to convergeCost per iterationHessian needed?
Newton1O(n3)O(n^3)Yes
BFGS2O(n2)O(n^2)No
Steepest descent14O(n)O(n)No

Newton is fastest per iteration but most expensive per step. Steepest descent is cheapest per step but needs many more. BFGS sits in the sweet spot: near-Newton convergence without computing the Hessian.

Convergence comparison: gradient descent converges slowly, Newton’s method converges in few iterations, and BFGS closely tracks Newton without computing the exact Hessian.

Example 3: Full BFGS run on a 2D quadratic

Let us trace BFGS on f(x,y)=3x2+y2f(x, y) = 3x^2 + y^2 starting from (3,3)(3, 3).

The gradient is f=(6x,  2y)\nabla f = (6x, \; 2y). The true Hessian is (6002)\begin{pmatrix} 6 & 0 \\ 0 & 2 \end{pmatrix} and the true inverse Hessian is:

Htrue1=(1/6001/2)(0.167000.500)H^{-1}_{\text{true}} = \begin{pmatrix} 1/6 & 0 \\ 0 & 1/2 \end{pmatrix} \approx \begin{pmatrix} 0.167 & 0 \\ 0 & 0.500 \end{pmatrix}

We will watch the BFGS approximation converge to this matrix.

Iteration 1:

Starting values: x0=(3,3)x_0 = (3, 3), f0=36f_0 = 36, g0=(18,6)g_0 = (18, 6), H0=IH_0 = I.

Search direction: p0=H0g0=(18,6)p_0 = -H_0 \, g_0 = (-18, -6).

Exact line search gives α0=5/280.179\alpha_0 = 5/28 \approx 0.179.

x1=(318×0.179,    36×0.179)(0.214,    1.929)x_1 = (3 - 18 \times 0.179, \;\; 3 - 6 \times 0.179) \approx (-0.214, \;\; 1.929)

f13.857,g1(1.286,    3.857)f_1 \approx 3.857, \quad g_1 \approx (-1.286, \;\; 3.857)

Compute the update vectors:

s0=x1x0(3.214,    1.071)s_0 = x_1 - x_0 \approx (-3.214, \;\; -1.071)

y0=g1g0(19.286,    2.143)y_0 = g_1 - g_0 \approx (-19.286, \;\; -2.143)

ρ0=1y0Ts00.01556\rho_0 = \frac{1}{y_0^T s_0} \approx 0.01556

After applying the BFGS formula:

H1(0.1730.0610.0611.051)H_1 \approx \begin{pmatrix} 0.173 & -0.061 \\ -0.061 & 1.051 \end{pmatrix}

The (1,1)(1,1) entry is 0.173, already close to the true value 0.167. But (2,2)(2,2) is 1.051, still far from 0.500. One step captured curvature in one direction but not the other.

Iteration 2:

Using H1H_1 to compute the new direction: p1=H1g1(0.459,    4.133)p_1 = -H_1 \, g_1 \approx (0.459, \;\; -4.133).

Exact line search gives α10.467\alpha_1 \approx 0.467.

x2(0.000,    0.000),f20.000x_2 \approx (0.000, \;\; 0.000), \quad f_2 \approx 0.000

The BFGS approximation after this step:

H2(0.1670.0000.0000.500)=Htrue1H_2 \approx \begin{pmatrix} 0.167 & 0.000 \\ 0.000 & 0.500 \end{pmatrix} = H^{-1}_{\text{true}}

After just 2 iterations, BFGS found the exact minimum and recovered the true inverse Hessian. This is the key property of BFGS on quadratics: in nn dimensions, it converges in at most nn steps, building up the exact inverse Hessian one direction at a time.

Summary table:

Iterationxxf(x)f(x)HkH_k diagonalTrue H1H^{-1} diagonal
0(3.000,  3.000)(3.000, \; 3.000)36.000(1.000,  1.000)(1.000, \; 1.000)(0.167,  0.500)(0.167, \; 0.500)
1(0.214,  1.929)(-0.214, \; 1.929)3.857(0.173,  1.051)(0.173, \; 1.051)(0.167,  0.500)(0.167, \; 0.500)
2(0.000,  0.000)(0.000, \; 0.000)0.000(0.167,  0.500)(0.167, \; 0.500)(0.167,  0.500)(0.167, \; 0.500)

You can verify these numbers with a few lines of Python:

import numpy as np

def f(x):
    return 3 * x[0]**2 + x[1]**2

def grad(x):
    return np.array([6 * x[0], 2 * x[1]])

def line_search(x, p):
    """Exact line search for f(x + alpha * p) = 3*(x0+a*p0)^2 + (x1+a*p1)^2."""
    num = -(6 * x[0] * p[0] + 2 * x[1] * p[1])
    den = 6 * p[0]**2 + 2 * p[1]**2
    return num / den

x = np.array([3.0, 3.0])
H = np.eye(2)

for k in range(3):
    g = grad(x)
    print(f"Iter {k}: x=({x[0]:.3f}, {x[1]:.3f}), f={f(x):.3f}")
    if np.linalg.norm(g) < 1e-10:
        print("Converged!")
        break
    p = -H @ g
    alpha = line_search(x, p)
    x_new = x + alpha * p
    g_new = grad(x_new)
    s = (x_new - x).reshape(-1, 1)
    y = (g_new - g).reshape(-1, 1)
    rho = 1.0 / float(y.T @ s)
    I = np.eye(2)
    H = (I - rho * s @ y.T) @ H @ (I - rho * y @ s.T) + rho * s @ s.T
    x = x_new

L-BFGS: the memory-limited version

BFGS stores the full n×nn \times n matrix HkH_k. For a problem with n=10,000n = 10{,}000 variables, that matrix takes 800 MB. For n=1,000,000n = 1{,}000{,}000, it takes 8 TB. This does not scale.

L-BFGS (Limited-memory BFGS) solves this by never storing the full matrix. Instead, it keeps only the last mm pairs of vectors (si,yi)(s_i, y_i), typically with mm between 3 and 20.

When you need the product HkgkH_k \, g_k, L-BFGS computes it using a two-loop recursion that touches only those mm vector pairs. Storage drops from O(n2)O(n^2) to O(mn)O(mn), and per-iteration cost drops from O(n2)O(n^2) to O(mn)O(mn).

The two-loop recursion works as follows:

  1. Backward loop: Starting from i=k1i = k-1 down to kmk-m, compute scalars αi=ρisiTq\alpha_i = \rho_i \, s_i^T q and update qqαiyiq \leftarrow q - \alpha_i \, y_i.
  2. Scaling: Set r=γkqr = \gamma_k \, q where γk=sk1Tyk1yk1Tyk1\gamma_k = \frac{s_{k-1}^T y_{k-1}}{y_{k-1}^T y_{k-1}} gives a good initial scaling.
  3. Forward loop: Starting from i=kmi = k-m up to k1k-1, compute β=ρiyiTr\beta = \rho_i \, y_i^T r and update rr+(αiβ)sir \leftarrow r + (\alpha_i - \beta) \, s_i.

The output rr is the search direction HkgkH_k \, g_k, computed without ever forming the full matrix. In practice, m=10m = 10 works well for most problems. Going beyond m=20m = 20 rarely helps.

BFGS vs L-BFGS:

BFGSL-BFGS
StorageO(n2)O(n^2)O(mn)O(mn)
Per-iteration costO(n2)O(n^2)O(mn)O(mn)
Hessian qualityFull historyLast mm steps
Best forn<10,000n < 10{,}000n>10,000n > 10{,}000

L-BFGS is the default optimizer in many large-scale applications: logistic regression, conditional random fields, and neural network fine-tuning. If you have used scipy.optimize.minimize with method='L-BFGS-B', you have already used it.

Practical guidance: when to use which method

Problem sizeRecommended methodReason
n<100n < 100Newton’s methodHessian is small enough to compute and invert; quadratic convergence pays off
100<n<10,000100 < n < 10{,}000BFGSNear-Newton convergence without computing the Hessian
n>10,000n > 10{,}000L-BFGSFull BFGS matrix does not fit in memory

Other things to keep in mind:

  • Noisy gradients (stochastic optimization): Quasi-Newton methods struggle because the yky_k vectors are noisy. Use SGD with momentum or Adam instead.
  • Sparse Hessians: If the Hessian has a known sparsity pattern, specialized Newton or trust-region methods can exploit that structure and outperform BFGS.
  • Bound constraints: L-BFGS-B handles simple box constraints (lxul \leq x \leq u) natively.
  • Non-smooth functions: BFGS and L-BFGS assume smoothness. For non-smooth objectives, consider subgradient methods or proximal methods.

Python implementation

Here is how to use both methods in practice with SciPy:

import numpy as np
from scipy.optimize import minimize

# Rosenbrock function: a classic test problem
def rosenbrock(x):
    return (1 - x[0])**2 + 100 * (x[1] - x[0]**2)**2

def rosenbrock_grad(x):
    dx = -2 * (1 - x[0]) - 400 * x[0] * (x[1] - x[0]**2)
    dy = 200 * (x[1] - x[0]**2)
    return np.array([dx, dy])

x0 = np.array([-1.0, 1.0])

# BFGS: good for small to medium problems
result_bfgs = minimize(rosenbrock, x0, jac=rosenbrock_grad,
                       method='BFGS')
print(f"BFGS:     x = {result_bfgs.x}, iterations = {result_bfgs.nit}")

# L-BFGS-B: good for large-scale problems
result_lbfgs = minimize(rosenbrock, x0, jac=rosenbrock_grad,
                        method='L-BFGS-B')
print(f"L-BFGS-B: x = {result_lbfgs.x}, iterations = {result_lbfgs.nit}")

Typical output:

BFGS:     x = [1. 1.], iterations = 30
L-BFGS-B: x = [1. 1.], iterations = 24

Both find the minimum at (1,1)(1, 1). L-BFGS-B uses far less memory while taking a similar number of iterations.

You can tune the memory parameter for L-BFGS-B:

result = minimize(rosenbrock, x0, jac=rosenbrock_grad,
                  method='L-BFGS-B',
                  options={'maxcor': 20})  # store 20 correction pairs (default is 10)

Increasing maxcor uses more memory but can improve convergence on difficult problems.

What comes next

BFGS and L-BFGS are excellent general-purpose optimizers for smooth, unconstrained problems. They work well when you can compute (or approximate) the gradient cheaply and the function is reasonably smooth.

Conjugate gradient methods take a different approach. Instead of building a Hessian approximation, they generate search directions that are conjugate with respect to the curvature of the function. This gives O(n)O(n) storage, similar to L-BFGS, with strong convergence properties on quadratics. In the next article, we cover conjugate gradient methods and compare them with L-BFGS on practical problems.

Start typing to search across all content
navigate Enter open Esc close