Search…

Conjugate gradient methods

In this series (18 parts)
  1. What is optimization and why ML needs it
  2. Convex sets and convex functions
  3. Optimality conditions: first order
  4. Optimality conditions: second order
  5. Line search methods
  6. Least squares: the closed-form solution
  7. Steepest descent (gradient descent)
  8. Newton's method for optimization
  9. Quasi-Newton methods: BFGS and L-BFGS
  10. Conjugate gradient methods
  11. Constrained optimization and Lagrangian duality
  12. KKT conditions
  13. Penalty and barrier methods
  14. Interior point methods
  15. The simplex method
  16. Frank-Wolfe method
  17. Optimization in dynamic programming and optimal control
  18. Stochastic gradient descent and variants

Prerequisites: This post builds on steepest descent and matrix operations. You should be comfortable with matrix-vector products, gradients, and the idea of iterative minimization before reading further.

Why conjugate gradient exists

Solving large linear systems Ax=bAx = b shows up everywhere in ML. Least squares, Gaussian processes, Newton steps, and Hessian-vector products all reduce to linear solves. Direct methods like Gaussian elimination cost O(n3)O(n^3). When nn is in the millions, that cost is brutal.

Gradient descent can solve these systems iteratively, but it zigzags badly on elongated bowls. Each step undoes some of the progress from the previous step, wasting iterations.

MethodCost per stepBehaviorSteps for nn-dim quadratic
Direct solve (Gaussian elimination)O(n3)O(n^3)Exact1 (but expensive)
Gradient descentO(n)O(n)Zigzags on ill-conditioned problemsMany (depends on condition number)
Conjugate gradientO(n)O(n) + one matrix-vector productNo zigzag, steady progressAt most nn

The zigzag problem with gradient descent

graph LR
  A["Start"] --> B["Step 1: move toward minimum"]
  B --> C["Step 2: overshoot, correct sideways"]
  C --> D["Step 3: correct back"]
  D --> E["Step 4: still zigzagging..."]
  E --> F["Eventually converges (slowly)"]

Conjugate gradient picks search directions that do not undo previous progress.

Contour plot comparing steepest descent (red, zigzag) vs conjugate gradient (green, direct) on f(x,y) = x^2 + 10y^2. CG reaches the exact minimum in just 2 steps.

Each step fully solves the problem along its direction, and later steps never interfere. On an nn-dimensional quadratic, CG reaches the exact solution in at most nn steps.

Now let’s see how it works.

The problem: solving linear systems cheaply

Suppose you need to solve Ax=bAx = b where AA is an n×nn \times n symmetric positive definite matrix. Direct methods like Gaussian elimination cost O(n3)O(n^3) operations. When nn is large (millions of variables in engineering or ML problems), that cost is brutal.

Here is the key insight. Solving Ax=bAx = b is the same as minimizing the quadratic function:

f(x)=12xTAxbTxf(x) = \frac{1}{2}x^TAx - b^Tx

Take the gradient and set it to zero:

f(x)=Axb=0    Ax=b\nabla f(x) = Ax - b = 0 \implies Ax = b

So any method that minimizes ff also solves the linear system. Gradient descent can do this, but it zig-zags badly when the eigenvalues of AA are spread out. The conjugate gradient method (CG) fixes this. It solves an nn-dimensional quadratic in at most nn iterations, using only matrix-vector products, no matrix factorization needed.

Conjugate directions: the core idea

Two vectors did_i and djd_j are conjugate (or AA-conjugate) with respect to AA if:

diTAdj=0for ijd_i^T A \, d_j = 0 \quad \text{for } i \neq j

This looks like orthogonality, but twisted by AA. Ordinary orthogonality means diTdj=0d_i^T d_j = 0. Conjugacy means orthogonality in a space stretched by AA.

Why does this matter? If you search along conjugate directions, each step solves the problem completely along that direction. Later steps never undo the progress of earlier ones. This is exactly what steepest descent gets wrong: it keeps revisiting the same directions, zig-zagging toward the solution.

Think of it this way. For a 2D quadratic with elliptical contours, steepest descent bounces back and forth across the narrow valley. CG picks two directions that are “independent” in the geometry of the ellipse, so it nails the answer in just 2 steps.

If you have nn mutually conjugate directions d0,d1,,dn1d_0, d_1, \ldots, d_{n-1}, you can express the solution as a linear combination along those directions. Each coefficient is found by a single line search. That gives you the exact answer in nn steps.

Conjugate directions vs steepest descent directions

graph TD
  A["Steepest descent: each direction is the negative gradient"] --> B["Directions overlap, undo previous work"]
  C["Conjugate gradient: each direction is A-orthogonal to all previous ones"] --> D["Directions are independent in the geometry of A"]
  B --> E["Slow convergence, zigzagging"]
  D --> F["Fast convergence, at most n steps"]

The CG algorithm for linear systems

Given: symmetric positive definite AA, right-hand side bb, initial guess x0x_0.

Initialize:

r0=bAx0(residual)r_0 = b - Ax_0 \quad \text{(residual)} d0=r0(first search direction = steepest descent direction)d_0 = r_0 \quad \text{(first search direction = steepest descent direction)}

For k=0,1,2,k = 0, 1, 2, \ldots:

αk=rkTrkdkTAdk(step size)\alpha_k = \frac{r_k^T r_k}{d_k^T A \, d_k} \quad \text{(step size)}

xk+1=xk+αkdk(update position)x_{k+1} = x_k + \alpha_k \, d_k \quad \text{(update position)}

rk+1=rkαkAdk(update residual)r_{k+1} = r_k - \alpha_k \, A \, d_k \quad \text{(update residual)}

βk+1=rk+1Trk+1rkTrk(conjugacy parameter)\beta_{k+1} = \frac{r_{k+1}^T r_{k+1}}{r_k^T r_k} \quad \text{(conjugacy parameter)}

dk+1=rk+1+βk+1dk(new search direction)d_{k+1} = r_{k+1} + \beta_{k+1} \, d_k \quad \text{(new search direction)}

Stop when rk+1\|r_{k+1}\| is small enough.

Each iteration costs one matrix-vector product (AdkAd_k), two dot products, and a few vector additions. No matrix inversions. No storing previous directions. The β\beta formula automatically builds conjugacy into the new search direction.

The residual rk=bAxkr_k = b - Ax_k is the negative gradient of ff at xkx_k. It tells you how far you are from the solution. The search direction dkd_k is the residual corrected by βkdk1\beta_k d_{k-1} to maintain conjugacy.

CG algorithm flow

graph TD
  A["Start: compute residual r0, set direction d0 = r0"] --> B["Compute step size alpha"]
  B --> C["Update position: x = x + alpha * d"]
  C --> D["Update residual: r = r - alpha * A * d"]
  D --> E["Residual small enough?"]
  E -- Yes --> F["Done: x is the solution"]
  E -- No --> G["Compute beta (conjugacy parameter)"]
  G --> H["New direction: d = r + beta * d_prev"]
  H --> B

Worked examples

Example 1: CG on a 2x2 linear system

Solve Ax=bAx = b where:

A=(2113),b=(12),x0=(00)A = \begin{pmatrix} 2 & 1 \\ 1 & 3 \end{pmatrix}, \quad b = \begin{pmatrix} 1 \\ 2 \end{pmatrix}, \quad x_0 = \begin{pmatrix} 0 \\ 0 \end{pmatrix}

Iteration 0: Setup

r0=bAx0=(12)(00)=(12)r_0 = b - Ax_0 = \begin{pmatrix}1\\2\end{pmatrix} - \begin{pmatrix}0\\0\end{pmatrix} = \begin{pmatrix}1\\2\end{pmatrix}

d0=r0=(12)d_0 = r_0 = \begin{pmatrix}1\\2\end{pmatrix}

Iteration 1: First step

Compute the step size:

r0Tr0=12+22=5r_0^T r_0 = 1^2 + 2^2 = 5

Ad0=(2113)(12)=(47)A \, d_0 = \begin{pmatrix}2&1\\1&3\end{pmatrix}\begin{pmatrix}1\\2\end{pmatrix} = \begin{pmatrix}4\\7\end{pmatrix}

d0TAd0=14+27=18d_0^T A \, d_0 = 1 \cdot 4 + 2 \cdot 7 = 18

α0=518\alpha_0 = \frac{5}{18}

Update position and residual:

x1=(00)+518(12)=(5/185/9)x_1 = \begin{pmatrix}0\\0\end{pmatrix} + \frac{5}{18}\begin{pmatrix}1\\2\end{pmatrix} = \begin{pmatrix}5/18\\5/9\end{pmatrix}

r1=r0α0Ad0=(12)518(47)=(120/18235/18)=(1/91/18)r_1 = r_0 - \alpha_0 \, A \, d_0 = \begin{pmatrix}1\\2\end{pmatrix} - \frac{5}{18}\begin{pmatrix}4\\7\end{pmatrix} = \begin{pmatrix}1 - 20/18 \\ 2 - 35/18\end{pmatrix} = \begin{pmatrix}-1/9\\1/18\end{pmatrix}

Build the next search direction:

r1Tr1=181+1324=4324+1324=5324r_1^T r_1 = \frac{1}{81} + \frac{1}{324} = \frac{4}{324} + \frac{1}{324} = \frac{5}{324}

β1=r1Tr1r0Tr0=5/3245=1324\beta_1 = \frac{r_1^T r_1}{r_0^T r_0} = \frac{5/324}{5} = \frac{1}{324}

d1=r1+β1d0=(1/91/18)+1324(12)=(35/32420/324)d_1 = r_1 + \beta_1 \, d_0 = \begin{pmatrix}-1/9\\1/18\end{pmatrix} + \frac{1}{324}\begin{pmatrix}1\\2\end{pmatrix} = \begin{pmatrix}-35/324\\20/324\end{pmatrix}

Iteration 2: Second step

Ad1=(2113)(35/32420/324)=(50/32425/324)A \, d_1 = \begin{pmatrix}2&1\\1&3\end{pmatrix}\begin{pmatrix}-35/324\\20/324\end{pmatrix} = \begin{pmatrix}-50/324\\25/324\end{pmatrix}

d1TAd1=(35)(50)+(20)(25)3242=1750+5003242=22503242d_1^T A \, d_1 = \frac{(-35)(-50) + (20)(25)}{324^2} = \frac{1750 + 500}{324^2} = \frac{2250}{324^2}

α1=r1Tr1d1TAd1=5/3242250/3242=53242250=16202250=1825\alpha_1 = \frac{r_1^T r_1}{d_1^T A \, d_1} = \frac{5/324}{2250/324^2} = \frac{5 \cdot 324}{2250} = \frac{1620}{2250} = \frac{18}{25}

x2=x1+α1d1=(5/185/9)+1825(35/32420/324)x_2 = x_1 + \alpha_1 \, d_1 = \begin{pmatrix}5/18\\5/9\end{pmatrix} + \frac{18}{25}\begin{pmatrix}-35/324\\20/324\end{pmatrix}

For the first component: 518+182535324=5186308100=518790=2590790=1890=15\frac{5}{18} + \frac{18}{25} \cdot \frac{-35}{324} = \frac{5}{18} - \frac{630}{8100} = \frac{5}{18} - \frac{7}{90} = \frac{25}{90} - \frac{7}{90} = \frac{18}{90} = \frac{1}{5}

For the second component: 59+182520324=59+3608100=59+245=2545+245=2745=35\frac{5}{9} + \frac{18}{25} \cdot \frac{20}{324} = \frac{5}{9} + \frac{360}{8100} = \frac{5}{9} + \frac{2}{45} = \frac{25}{45} + \frac{2}{45} = \frac{27}{45} = \frac{3}{5}

x2=(1/53/5)x_2 = \begin{pmatrix}1/5\\3/5\end{pmatrix}

Verify: Ax2=(2113)(1/53/5)=(2/5+3/51/5+9/5)=(12)=bAx_2 = \begin{pmatrix}2&1\\1&3\end{pmatrix}\begin{pmatrix}1/5\\3/5\end{pmatrix} = \begin{pmatrix}2/5 + 3/5\\1/5+9/5\end{pmatrix} = \begin{pmatrix}1\\2\end{pmatrix} = b

CG solved a 2D system in exactly 2 iterations, as guaranteed.

Example 2: CG on a quadratic function

Minimize f(x,y)=x2+4y2f(x,y) = x^2 + 4y^2 starting from x0=(4,2)Tx_0 = (4, 2)^T.

This quadratic corresponds to A=(2008)A = \begin{pmatrix}2&0\\0&8\end{pmatrix} and b=(00)b = \begin{pmatrix}0\\0\end{pmatrix}, since f(x)=12xTAxf(x) = \frac{1}{2}x^TAx.

Iteration 0:

r0=bAx0=(00)(816)=(816)r_0 = b - Ax_0 = \begin{pmatrix}0\\0\end{pmatrix} - \begin{pmatrix}8\\16\end{pmatrix} = \begin{pmatrix}-8\\-16\end{pmatrix}

d0=r0=(816)d_0 = r_0 = \begin{pmatrix}-8\\-16\end{pmatrix}

Iteration 1:

r0Tr0=64+256=320r_0^T r_0 = 64 + 256 = 320

Ad0=(2008)(816)=(16128)A \, d_0 = \begin{pmatrix}2&0\\0&8\end{pmatrix}\begin{pmatrix}-8\\-16\end{pmatrix} = \begin{pmatrix}-16\\-128\end{pmatrix}

d0TAd0=(8)(16)+(16)(128)=128+2048=2176d_0^T A \, d_0 = (-8)(-16) + (-16)(-128) = 128 + 2048 = 2176

α0=3202176=534\alpha_0 = \frac{320}{2176} = \frac{5}{34}

x1=(42)+534(816)=(440/34280/34)=(48/176/17)x_1 = \begin{pmatrix}4\\2\end{pmatrix} + \frac{5}{34}\begin{pmatrix}-8\\-16\end{pmatrix} = \begin{pmatrix}4 - 40/34\\2 - 80/34\end{pmatrix} = \begin{pmatrix}48/17\\-6/17\end{pmatrix}

r1=r0α0Ad0=(816)534(16128)=(96/1748/17)r_1 = r_0 - \alpha_0 \, A \, d_0 = \begin{pmatrix}-8\\-16\end{pmatrix} - \frac{5}{34}\begin{pmatrix}-16\\-128\end{pmatrix} = \begin{pmatrix}-96/17\\48/17\end{pmatrix}

r1Tr1=9216+2304289=11520289r_1^T r_1 = \frac{9216 + 2304}{289} = \frac{11520}{289}

β1=11520/289320=36289\beta_1 = \frac{11520/289}{320} = \frac{36}{289}

d1=r1+β1d0=(96/1748/17)+36289(816)=(1920/289240/289)d_1 = r_1 + \beta_1 d_0 = \begin{pmatrix}-96/17\\48/17\end{pmatrix} + \frac{36}{289}\begin{pmatrix}-8\\-16\end{pmatrix} = \begin{pmatrix}-1920/289\\240/289\end{pmatrix}

Iteration 2:

Ad1=(3840/2891920/289)A \, d_1 = \begin{pmatrix}-3840/289\\1920/289\end{pmatrix}

d1TAd1=1920(3840+240)2892=192040802892=7,833,60083,521d_1^T A \, d_1 = \frac{1920(3840 + 240)}{289^2} = \frac{1920 \cdot 4080}{289^2} = \frac{7{,}833{,}600}{83{,}521}

α1=11520/2897,833,600/83,521=115202897,833,600=1740\alpha_1 = \frac{11520/289}{7{,}833{,}600/83{,}521} = \frac{11520 \cdot 289}{7{,}833{,}600} = \frac{17}{40}

x2=(48/176/17)+1740(1920/289240/289)=(48/1748/176/17+6/17)=(00)x_2 = \begin{pmatrix}48/17\\-6/17\end{pmatrix} + \frac{17}{40}\begin{pmatrix}-1920/289\\240/289\end{pmatrix} = \begin{pmatrix}48/17 - 48/17\\-6/17 + 6/17\end{pmatrix} = \begin{pmatrix}0\\0\end{pmatrix}

CG reaches the exact minimum (0,0)(0, 0) in 2 steps. Compare this to gradient descent, which would need many iterations on this problem because the condition number κ=8/2=4\kappa = 8/2 = 4 causes zig-zagging. The higher the condition number, the worse gradient descent performs relative to CG.

Example 3: Verifying the conjugacy property

Let us verify that the search directions from Example 1 are AA-conjugate, meaning d0TAd1=0d_0^T A \, d_1 = 0.

From Example 1:

d0=(12),d1=(35/32420/324)d_0 = \begin{pmatrix}1\\2\end{pmatrix}, \quad d_1 = \begin{pmatrix}-35/324\\20/324\end{pmatrix}

First, compute Ad1A \, d_1 (we already did this):

Ad1=(50/32425/324)A \, d_1 = \begin{pmatrix}-50/324\\25/324\end{pmatrix}

Now take the dot product d0T(Ad1)d_0^T (A \, d_1):

d0TAd1=150324+225324=50324+50324=0d_0^T A \, d_1 = 1 \cdot \frac{-50}{324} + 2 \cdot \frac{25}{324} = \frac{-50}{324} + \frac{50}{324} = 0 \quad \checkmark

The two search directions are AA-conjugate. The CG algorithm guarantees this by construction: the β\beta formula is chosen precisely so each new direction is conjugate to all previous ones.

You can also verify that the residuals are orthogonal (not just conjugate):

r0Tr1=119+2118=19+19=0r_0^T r_1 = 1 \cdot \frac{-1}{9} + 2 \cdot \frac{1}{18} = \frac{-1}{9} + \frac{1}{9} = 0 \quad \checkmark

This is another property CG maintains. The residuals form an orthogonal set, and the search directions form a conjugate set.

CG for nonlinear optimization

The linear CG algorithm assumes a quadratic objective. For general nonlinear functions, we cannot compute the residual rk=bAxkr_k = b - Ax_k because there is no AA or bb. Instead, we use the gradient gk=f(xk)g_k = \nabla f(x_k) in place of rk-r_k and adapt the β\beta formula.

The algorithm becomes:

  1. Choose x0x_0. Set g0=f(x0)g_0 = \nabla f(x_0) and d0=g0d_0 = -g_0.
  2. For k=0,1,2,k = 0, 1, 2, \ldots:
    • Find αk\alpha_k by line search: αk=argminαf(xk+αdk)\alpha_k = \arg\min_\alpha f(x_k + \alpha \, d_k)
    • Update: xk+1=xk+αkdkx_{k+1} = x_k + \alpha_k \, d_k
    • Compute: gk+1=f(xk+1)g_{k+1} = \nabla f(x_{k+1})
    • Compute βk+1\beta_{k+1} (see below)
    • Update direction: dk+1=gk+1+βk+1dkd_{k+1} = -g_{k+1} + \beta_{k+1} \, d_k

The two most common formulas for β\beta are:

Fletcher-Reeves:

βk+1FR=gk+1Tgk+1gkTgk\beta_{k+1}^{FR} = \frac{g_{k+1}^T g_{k+1}}{g_k^T g_k}

Polak-Ribiere:

βk+1PR=gk+1T(gk+1gk)gkTgk\beta_{k+1}^{PR} = \frac{g_{k+1}^T (g_{k+1} - g_k)}{g_k^T g_k}

On a quadratic with exact line search, both formulas give identical results and reduce to the linear CG algorithm. On general nonlinear functions, Polak-Ribiere tends to work better in practice. The reason: when the gradient changes slowly (gk+1gkg_{k+1} \approx g_k), Polak-Ribiere gives β0\beta \approx 0, which effectively restarts CG as steepest descent. Fletcher-Reeves can keep a poor search direction going too long.

A common practical detail is to set βk+1=max(βk+1PR,0)\beta_{k+1} = \max(\beta_{k+1}^{PR}, 0). This prevents β\beta from going negative, which could reverse the search direction.

Restart strategy: Every nn iterations (or when gk+1Tgk/gk+12>0.1|g_{k+1}^T g_k| / \|g_{k+1}\|^2 > 0.1), reset dk+1=gk+1d_{k+1} = -g_{k+1}. This guards against loss of conjugacy from inexact line searches.

Comparison: steepest descent vs CG vs Newton

PropertySteepest descentConjugate gradientNewton’s method
Convergence rateLinearSuperlinear (quadratics: finite)Quadratic
Cost per iterationO(n)O(n)O(n)O(n) + one mat-vecO(n3)O(n^3) (Hessian solve)
StorageO(n)O(n)O(n)O(n)O(n2)O(n^2)
Quadratic in nn dimsMany iterationsAt most nn iterations1 iteration
Zig-zaggingYesNo (on quadratics)No
Needs HessianNoNoYes
Best forSmall/simple problemsLarge sparse systemsSmall problems, fast convergence needed

CG sits in a sweet spot. It is nearly as cheap as steepest descent per iteration, but converges much faster. It avoids the O(n2)O(n^2) storage and O(n3)O(n^3) per-step cost of Newton’s method, making it practical for very large problems.

Convergence comparison

graph LR
  A["Steepest descent: linear rate, slow on ill-conditioned problems"] --> D["Many iterations"]
  B["Conjugate gradient: superlinear rate, at most n steps on quadratics"] --> E["Fast convergence"]
  C["Newton: quadratic rate, 1 step on quadratics"] --> F["Fastest, but O(n^3) per step"]

Convergence comparison: CG reaches f = 0 in 2 steps vs steepest descent still converging after 10

Python implementation

Here is a compact CG implementation for quadratic objectives f(x)=12xTAxbTxf(x) = \frac{1}{2}x^TAx - b^Tx:

import numpy as np

def conjugate_gradient(A, b, x0, tol=1e-10, max_iter=None):
    """Solve Ax = b using conjugate gradient."""
    n = len(b)
    if max_iter is None:
        max_iter = n

    x = x0.copy().astype(float)
    r = b - A @ x
    d = r.copy()
    rs_old = r @ r

    for k in range(max_iter):
        Ad = A @ d
        alpha = rs_old / (d @ Ad)
        x = x + alpha * d
        r = r - alpha * Ad
        rs_new = r @ r

        if np.sqrt(rs_new) < tol:
            print(f"Converged in {k+1} iterations")
            return x

        beta = rs_new / rs_old
        d = r + beta * d
        rs_old = rs_new

    print(f"Reached max iterations ({max_iter})")
    return x


# Example 1: solve the 2x2 system
A = np.array([[2, 1], [1, 3]], dtype=float)
b = np.array([1, 2], dtype=float)
x0 = np.array([0, 0], dtype=float)

x = conjugate_gradient(A, b, x0)
print(f"Solution: {x}")          # [0.2, 0.6]
print(f"Ax = {A @ x}")           # [1.0, 2.0]
print(f"Residual: {b - A @ x}")  # [0.0, 0.0]

And the nonlinear variant using Polak-Ribiere:

def cg_polak_ribiere(grad_f, x0, line_search, tol=1e-8, max_iter=1000):
    """Nonlinear CG with Polak-Ribiere and automatic restarts."""
    x = x0.copy().astype(float)
    g = grad_f(x)
    d = -g

    for k in range(max_iter):
        if np.linalg.norm(g) < tol:
            print(f"Converged in {k} iterations")
            return x

        alpha = line_search(x, d)
        x = x + alpha * d
        g_new = grad_f(x)

        beta = max(g_new @ (g_new - g) / (g @ g), 0)
        d = -g_new + beta * d
        g = g_new

    return x

What comes next

CG handles unconstrained problems well, but most real optimization problems have constraints. When you add equality or inequality constraints, you enter the world of constrained optimization. The main tool there is Lagrangian duality, which converts constrained problems into unconstrained ones by introducing dual variables.

CG also shows up inside other algorithms. Truncated Newton methods use CG to approximately solve the Newton system HΔx=gH \, \Delta x = -g without ever forming the full Hessian matrix. This combination gives you Newton-like convergence at CG-like cost per iteration. If you work with large-scale optimization (training neural networks, solving PDEs), you will see CG everywhere.

Start typing to search across all content
navigate Enter open Esc close