Mar 13, 2026 · 18 min read · ML Optimization

Quasi-Newton methods: BFGS and L-BFGS

In this series (18 parts)

Prerequisites: This article builds on Newton’s method for optimization. You should be comfortable with gradients, Hessians, and how Newton’s method uses them to find minima.

The problem with Newton’s method at scale

Newton’s method converges fast. On a smooth function with a good starting point, it reaches the minimum in very few iterations. The catch: each iteration requires the full Hessian matrix, which has $n^2$ entries for a problem with $n$ variables. You also need to solve a linear system involving it, which costs $O(n^3)$ operations.

If you are optimizing a model with 10 million parameters, the Hessian is a $10^7 \times 10^7$ matrix. Storing it alone would need roughly 400 terabytes of memory. That is not happening.

Gradient descent avoids this entirely by using only the gradient, but it converges slowly. The worse the condition number of the problem, the worse gradient descent performs.

Quasi-Newton methods sit between these two extremes. They build an approximation to the Hessian (or its inverse) using only gradient information. You never compute second derivatives. The approximation starts rough but improves at every step.

The most successful quasi-Newton method is BFGS, named after Broyden, Fletcher, Goldfarb, and Shanno, who all independently discovered it in 1970.

The secant condition

The idea behind quasi-Newton methods comes from a simple observation. After taking a step from $x_k$ to $x_{k+1}$ , you know the gradient at both points. Define:

$s_k = x_{k+1} - x_k \quad \text{(the step taken)}$

$y_k = \nabla f(x_{k+1}) - \nabla f(x_k) \quad \text{(the gradient change)}$

For a quadratic function $f(x) = \frac{1}{2} x^T A x + b^T x$ , the true Hessian $A$ satisfies:

$A \, s_k = y_k$

This is exact for quadratics and approximately true near a minimum for general smooth functions. The secant condition says: our Hessian approximation $B_{k+1}$ should satisfy the same relationship:

$B_{k+1} \, s_k = y_k$

This gives us $n$ equations for an $n \times n$ symmetric matrix, which has $n(n+1)/2$ unknowns. The secant condition alone does not pin down $B_{k+1}$ uniquely. Different choices for the remaining degrees of freedom give different quasi-Newton methods. BFGS makes a specific choice: keep $B_{k+1}$ symmetric and positive definite, and make it the closest such matrix to $B_k$ (in a certain norm) that satisfies the secant condition.

The BFGS update rule

In practice, we work with the inverse Hessian approximation $H_k \approx B_k^{-1}$ directly. This lets us compute the search direction as a simple matrix-vector product $p_k = -H_k \nabla f(x_k)$ instead of solving a linear system.

The BFGS update for the inverse Hessian approximation is:

$H_{k+1} = \left(I - \rho_k \, s_k \, y_k^T\right) H_k \left(I - \rho_k \, y_k \, s_k^T\right) + \rho_k \, s_k \, s_k^T$

where:

$\rho_k = \frac{1}{y_k^T \, s_k}$

Here is what each piece does:

$\rho_k$ is a scalar that normalizes the update. It measures how much curvature information the step provided.
The sandwich term $(I - \rho_k s_k y_k^T) \, H_k \, (I - \rho_k y_k s_k^T)$ transforms the previous approximation to be consistent with new curvature information.
The rank-one correction $\rho_k \, s_k \, s_k^T$ adds information along the step direction.

You typically start with $H_0 = I$ (the identity matrix). This makes the first step identical to gradient descent. After each step, the BFGS update folds in new curvature information, and $H_k$ moves closer to the true inverse Hessian.

On a quadratic function in $n$ dimensions, BFGS with exact line search recovers the exact inverse Hessian in at most $n$ steps. On general smooth functions, it builds an increasingly accurate local approximation.

The full BFGS algorithm:

flowchart TD
  A["Initialize: x₀, H₀ = I"] --> B["Compute gradient gₖ = ∇f(xₖ)"]
  B --> C{"‖gₖ‖ < tolerance?"}
  C -->|Yes| D["Return xₖ"]
  C -->|No| E["Direction: pₖ = −Hₖ gₖ"]
  E --> F["Line search: find step size αₖ"]
  F --> G["Update: xₖ₊₁ = xₖ + αₖ pₖ"]
  G --> H["Compute sₖ = xₖ₊₁ − xₖ, yₖ = gₖ₊₁ − gₖ"]
  H --> I["BFGS update: Hₖ₊₁"]
  I --> B

Example 1: One BFGS update step (2×2)

Let us perform one BFGS update by hand with small numbers.

Given:

$H_0 = I = \begin{pmatrix} 1 & 0 \\ 0 & 1 \end{pmatrix}$ (start with the identity)
$s_0 = \begin{pmatrix} 1 \\ 0 \end{pmatrix}$ (the step we took)
$y_0 = \begin{pmatrix} 2 \\ 1 \end{pmatrix}$ (the gradient change)

Step 1: Compute $\rho_0$ .

$\rho_0 = \frac{1}{y_0^T s_0} = \frac{1}{(2)(1) + (1)(0)} = \frac{1}{2}$

Step 2: Compute the outer products.

$s_0 \, y_0^T = \begin{pmatrix} 1 \\ 0 \end{pmatrix} \begin{pmatrix} 2 & 1 \end{pmatrix} = \begin{pmatrix} 2 & 1 \\ 0 & 0 \end{pmatrix}$

$y_0 \, s_0^T = \begin{pmatrix} 2 \\ 1 \end{pmatrix} \begin{pmatrix} 1 & 0 \end{pmatrix} = \begin{pmatrix} 2 & 0 \\ 1 & 0 \end{pmatrix}$

Step 3: Build the left and right factors.

$I - \rho_0 \, s_0 \, y_0^T = I - \frac{1}{2}\begin{pmatrix} 2 & 1 \\ 0 & 0 \end{pmatrix} = \begin{pmatrix} 1 & 0 \\ 0 & 1 \end{pmatrix} - \begin{pmatrix} 1 & 0.5 \\ 0 & 0 \end{pmatrix} = \begin{pmatrix} 0 & -0.5 \\ 0 & 1 \end{pmatrix}$

$I - \rho_0 \, y_0 \, s_0^T = I - \frac{1}{2}\begin{pmatrix} 2 & 0 \\ 1 & 0 \end{pmatrix} = \begin{pmatrix} 0 & 0 \\ -0.5 & 1 \end{pmatrix}$

Step 4: Compute the sandwich product. Since $H_0 = I$ , this simplifies to multiplying the two factors:

$\begin{pmatrix} 0 & -0.5 \\ 0 & 1 \end{pmatrix} \begin{pmatrix} 0 & 0 \\ -0.5 & 1 \end{pmatrix}$

Row 1: $(0 \cdot 0 + (-0.5)(-0.5), \;\; 0 \cdot 0 + (-0.5)(1)) = (0.25, \; -0.5)$

Row 2: $(0 \cdot 0 + 1 \cdot (-0.5), \;\; 0 \cdot 0 + 1 \cdot 1) = (-0.5, \; 1)$

$\text{Sandwich} = \begin{pmatrix} 0.25 & -0.5 \\ -0.5 & 1 \end{pmatrix}$

Step 5: Add the rank-one term.

$\rho_0 \, s_0 \, s_0^T = \frac{1}{2} \begin{pmatrix} 1 \\ 0 \end{pmatrix} \begin{pmatrix} 1 & 0 \end{pmatrix} = \begin{pmatrix} 0.5 & 0 \\ 0 & 0 \end{pmatrix}$

Step 6: Combine.

$H_1 = \begin{pmatrix} 0.25 & -0.5 \\ -0.5 & 1 \end{pmatrix} + \begin{pmatrix} 0.5 & 0 \\ 0 & 0 \end{pmatrix} = \begin{pmatrix} 0.75 & -0.5 \\ -0.5 & 1 \end{pmatrix}$

Verification: Check that $H_1 y_0 = s_0$ (the inverse secant condition):

$\begin{pmatrix} 0.75 & -0.5 \\ -0.5 & 1 \end{pmatrix} \begin{pmatrix} 2 \\ 1 \end{pmatrix} = \begin{pmatrix} 1.5 - 0.5 \\ -1 + 1 \end{pmatrix} = \begin{pmatrix} 1 \\ 0 \end{pmatrix} = s_0 \quad ✓$

Starting from the identity, one BFGS step produced an inverse Hessian approximation that correctly maps the gradient change $y_0$ back to the step $s_0$ .

Example 2: How many iterations does each method need?

Consider the function:

$f(x, y) = x^2 + 4y^2$

The minimum is at $(0, 0)$ . We start from $(4, 2)$ where $f(4, 2) = 16 + 16 = 32$ .

The gradient is $\nabla f = (2x, \; 8y)$ and the Hessian is constant: $H = \begin{pmatrix} 2 & 0 \\ 0 & 8 \end{pmatrix}$ .

The condition number is the ratio of the largest to smallest eigenvalue: $\kappa = 8/2 = 4$ . The function curves 4 times more steeply in the $y$ direction than in $x$ , which makes life hard for gradient descent.

Newton’s method uses the exact Hessian:

$x_1 = x_0 - H^{-1} \nabla f(x_0) = \begin{pmatrix} 4 \\ 2 \end{pmatrix} - \begin{pmatrix} 0.5 & 0 \\ 0 & 0.125 \end{pmatrix} \begin{pmatrix} 8 \\ 16 \end{pmatrix} = \begin{pmatrix} 4 \\ 2 \end{pmatrix} - \begin{pmatrix} 4 \\ 2 \end{pmatrix} = \begin{pmatrix} 0 \\ 0 \end{pmatrix}$

Newton reaches the exact minimum in 1 iteration. For any pure quadratic, Newton always converges in a single step because the Hessian is constant everywhere.

Steepest descent (with exact line search) zigzags toward the minimum:

Iteration	$x$	$y$	$f(x,y)$
0	4.000	2.000	32.000
1	2.824	$-0.353$	8.471
2	1.059	0.529	2.242
3	0.747	$-0.093$	0.594

Notice the $y$ coordinate flipping sign each step. That is the classic zigzag. The convergence rate per iteration is $\left(\frac{\kappa - 1}{\kappa + 1}\right)^2 = \left(\frac{3}{5}\right)^2 = 0.36$ , so each step only removes about 64% of the remaining error. Steepest descent needs 14 iterations to bring $f$ below $10^{-6}$ .

BFGS starts with $H_0 = I$ , so its first step matches steepest descent. But it updates its Hessian approximation after each step. On a 2D quadratic, BFGS with exact line search converges in 2 iterations, the same as the number of variables.

Method	Iterations to converge	Cost per iteration	Hessian needed?
Newton	1	$O(n^3)$	Yes
BFGS	2	$O(n^2)$	No
Steepest descent	14	$O(n)$	No

Newton is fastest per iteration but most expensive per step. Steepest descent is cheapest per step but needs many more. BFGS sits in the sweet spot: near-Newton convergence without computing the Hessian.

Convergence comparison: gradient descent converges slowly, Newton’s method converges in few iterations, and BFGS closely tracks Newton without computing the exact Hessian.

Example 3: Full BFGS run on a 2D quadratic

Let us trace BFGS on $f(x, y) = 3x^2 + y^2$ starting from $(3, 3)$ .

The gradient is $\nabla f = (6x, \; 2y)$ . The true Hessian is $\begin{pmatrix} 6 & 0 \\ 0 & 2 \end{pmatrix}$ and the true inverse Hessian is:

$H^{-1}_{\text{true}} = \begin{pmatrix} 1/6 & 0 \\ 0 & 1/2 \end{pmatrix} \approx \begin{pmatrix} 0.167 & 0 \\ 0 & 0.500 \end{pmatrix}$

We will watch the BFGS approximation converge to this matrix.

Iteration 1:

Starting values: $x_0 = (3, 3)$ , $f_0 = 36$ , $g_0 = (18, 6)$ , $H_0 = I$ .

Search direction: $p_0 = -H_0 \, g_0 = (-18, -6)$ .

Exact line search gives $\alpha_0 = 5/28 \approx 0.179$ .

$x_1 = (3 - 18 \times 0.179, \;\; 3 - 6 \times 0.179) \approx (-0.214, \;\; 1.929)$

$f_1 \approx 3.857, \quad g_1 \approx (-1.286, \;\; 3.857)$

Compute the update vectors:

$s_0 = x_1 - x_0 \approx (-3.214, \;\; -1.071)$

$y_0 = g_1 - g_0 \approx (-19.286, \;\; -2.143)$

$\rho_0 = \frac{1}{y_0^T s_0} \approx 0.01556$

After applying the BFGS formula:

$H_1 \approx \begin{pmatrix} 0.173 & -0.061 \\ -0.061 & 1.051 \end{pmatrix}$

The $(1,1)$ entry is 0.173, already close to the true value 0.167. But $(2,2)$ is 1.051, still far from 0.500. One step captured curvature in one direction but not the other.

Iteration 2:

Using $H_1$ to compute the new direction: $p_1 = -H_1 \, g_1 \approx (0.459, \;\; -4.133)$ .

Exact line search gives $\alpha_1 \approx 0.467$ .

$x_2 \approx (0.000, \;\; 0.000), \quad f_2 \approx 0.000$

The BFGS approximation after this step:

$H_2 \approx \begin{pmatrix} 0.167 & 0.000 \\ 0.000 & 0.500 \end{pmatrix} = H^{-1}_{\text{true}}$

After just 2 iterations, BFGS found the exact minimum and recovered the true inverse Hessian. This is the key property of BFGS on quadratics: in $n$ dimensions, it converges in at most $n$ steps, building up the exact inverse Hessian one direction at a time.

Summary table:

Iteration	$x$	$f(x)$	$H_k$ diagonal	True $H^{-1}$ diagonal
0	$(3.000, \; 3.000)$	36.000	$(1.000, \; 1.000)$	$(0.167, \; 0.500)$
1	$(-0.214, \; 1.929)$	3.857	$(0.173, \; 1.051)$	$(0.167, \; 0.500)$
2	$(0.000, \; 0.000)$	0.000	$(0.167, \; 0.500)$	$(0.167, \; 0.500)$

You can verify these numbers with a few lines of Python:

import numpy as np

def f(x):
    return 3 * x[0]**2 + x[1]**2

def grad(x):
    return np.array([6 * x[0], 2 * x[1]])

def line_search(x, p):
    """Exact line search for f(x + alpha * p) = 3*(x0+a*p0)^2 + (x1+a*p1)^2."""
    num = -(6 * x[0] * p[0] + 2 * x[1] * p[1])
    den = 6 * p[0]**2 + 2 * p[1]**2
    return num / den

x = np.array([3.0, 3.0])
H = np.eye(2)

for k in range(3):
    g = grad(x)
    print(f"Iter {k}: x=({x[0]:.3f}, {x[1]:.3f}), f={f(x):.3f}")
    if np.linalg.norm(g) < 1e-10:
        print("Converged!")
        break
    p = -H @ g
    alpha = line_search(x, p)
    x_new = x + alpha * p
    g_new = grad(x_new)
    s = (x_new - x).reshape(-1, 1)
    y = (g_new - g).reshape(-1, 1)
    rho = 1.0 / float(y.T @ s)
    I = np.eye(2)
    H = (I - rho * s @ y.T) @ H @ (I - rho * y @ s.T) + rho * s @ s.T
    x = x_new

L-BFGS: the memory-limited version

BFGS stores the full $n \times n$ matrix $H_k$ . For a problem with $n = 10{,}000$ variables, that matrix takes 800 MB. For $n = 1{,}000{,}000$ , it takes 8 TB. This does not scale.

L-BFGS (Limited-memory BFGS) solves this by never storing the full matrix. Instead, it keeps only the last $m$ pairs of vectors $(s_i, y_i)$ , typically with $m$ between 3 and 20.

When you need the product $H_k \, g_k$ , L-BFGS computes it using a two-loop recursion that touches only those $m$ vector pairs. Storage drops from $O(n^2)$ to $O(mn)$ , and per-iteration cost drops from $O(n^2)$ to $O(mn)$ .

The two-loop recursion works as follows:

Backward loop: Starting from $i = k-1$ down to $k-m$ , compute scalars $\alpha_i = \rho_i \, s_i^T q$ and update $q \leftarrow q - \alpha_i \, y_i$ .
Scaling: Set $r = \gamma_k \, q$ where $\gamma_k = \frac{s_{k-1}^T y_{k-1}}{y_{k-1}^T y_{k-1}}$ gives a good initial scaling.
Forward loop: Starting from $i = k-m$ up to $k-1$ , compute $\beta = \rho_i \, y_i^T r$ and update $r \leftarrow r + (\alpha_i - \beta) \, s_i$ .

The output $r$ is the search direction $H_k \, g_k$ , computed without ever forming the full matrix. In practice, $m = 10$ works well for most problems. Going beyond $m = 20$ rarely helps.

BFGS vs L-BFGS:

	BFGS	L-BFGS
Storage	$O(n^2)$	$O(mn)$
Per-iteration cost	$O(n^2)$	$O(mn)$
Hessian quality	Full history	Last $m$ steps
Best for	$n < 10{,}000$	$n > 10{,}000$

L-BFGS is the default optimizer in many large-scale applications: logistic regression, conditional random fields, and neural network fine-tuning. If you have used scipy.optimize.minimize with method='L-BFGS-B', you have already used it.

Practical guidance: when to use which method

Problem size	Recommended method	Reason
$n < 100$	Newton’s method	Hessian is small enough to compute and invert; quadratic convergence pays off
$100 < n < 10{,}000$	BFGS	Near-Newton convergence without computing the Hessian
$n > 10{,}000$	L-BFGS	Full BFGS matrix does not fit in memory

Other things to keep in mind:

Noisy gradients (stochastic optimization): Quasi-Newton methods struggle because the $y_k$ vectors are noisy. Use SGD with momentum or Adam instead.
Sparse Hessians: If the Hessian has a known sparsity pattern, specialized Newton or trust-region methods can exploit that structure and outperform BFGS.
Bound constraints: L-BFGS-B handles simple box constraints ( $l \leq x \leq u$ ) natively.
Non-smooth functions: BFGS and L-BFGS assume smoothness. For non-smooth objectives, consider subgradient methods or proximal methods.

Python implementation

Here is how to use both methods in practice with SciPy:

import numpy as np
from scipy.optimize import minimize

# Rosenbrock function: a classic test problem
def rosenbrock(x):
    return (1 - x[0])**2 + 100 * (x[1] - x[0]**2)**2

def rosenbrock_grad(x):
    dx = -2 * (1 - x[0]) - 400 * x[0] * (x[1] - x[0]**2)
    dy = 200 * (x[1] - x[0]**2)
    return np.array([dx, dy])

x0 = np.array([-1.0, 1.0])

# BFGS: good for small to medium problems
result_bfgs = minimize(rosenbrock, x0, jac=rosenbrock_grad,
                       method='BFGS')
print(f"BFGS:     x = {result_bfgs.x}, iterations = {result_bfgs.nit}")

# L-BFGS-B: good for large-scale problems
result_lbfgs = minimize(rosenbrock, x0, jac=rosenbrock_grad,
                        method='L-BFGS-B')
print(f"L-BFGS-B: x = {result_lbfgs.x}, iterations = {result_lbfgs.nit}")

Typical output:

BFGS:     x = [1. 1.], iterations = 30
L-BFGS-B: x = [1. 1.], iterations = 24

Both find the minimum at $(1, 1)$ . L-BFGS-B uses far less memory while taking a similar number of iterations.

You can tune the memory parameter for L-BFGS-B:

result = minimize(rosenbrock, x0, jac=rosenbrock_grad,
                  method='L-BFGS-B',
                  options={'maxcor': 20})  # store 20 correction pairs (default is 10)

Increasing maxcor uses more memory but can improve convergence on difficult problems.

What comes next

BFGS and L-BFGS are excellent general-purpose optimizers for smooth, unconstrained problems. They work well when you can compute (or approximate) the gradient cheaply and the function is reasonably smooth.

Conjugate gradient methods take a different approach. Instead of building a Hessian approximation, they generate search directions that are conjugate with respect to the curvature of the function. This gives $O(n)$ storage, similar to L-BFGS, with strong convergence properties on quadratics. In the next article, we cover conjugate gradient methods and compare them with L-BFGS on practical problems.

← Back to all series