Mar 14, 2026 · 18 min read · ML Optimization

Conjugate gradient methods

In this series (18 parts)

Prerequisites: This post builds on steepest descent and matrix operations. You should be comfortable with matrix-vector products, gradients, and the idea of iterative minimization before reading further.

Why conjugate gradient exists

Solving large linear systems $Ax = b$ shows up everywhere in ML. Least squares, Gaussian processes, Newton steps, and Hessian-vector products all reduce to linear solves. Direct methods like Gaussian elimination cost $O(n^3)$ . When $n$ is in the millions, that cost is brutal.

Gradient descent can solve these systems iteratively, but it zigzags badly on elongated bowls. Each step undoes some of the progress from the previous step, wasting iterations.

Method	Cost per step	Behavior	Steps for $n$ -dim quadratic
Direct solve (Gaussian elimination)	$O(n^3)$	Exact	1 (but expensive)
Gradient descent	$O(n)$	Zigzags on ill-conditioned problems	Many (depends on condition number)
Conjugate gradient	$O(n)$ + one matrix-vector product	No zigzag, steady progress	At most $n$

The zigzag problem with gradient descent

graph LR
  A["Start"] --> B["Step 1: move toward minimum"]
  B --> C["Step 2: overshoot, correct sideways"]
  C --> D["Step 3: correct back"]
  D --> E["Step 4: still zigzagging..."]
  E --> F["Eventually converges (slowly)"]

Conjugate gradient picks search directions that do not undo previous progress.

Contour plot comparing steepest descent (red, zigzag) vs conjugate gradient (green, direct) on f(x,y) = x^2 + 10y^2. CG reaches the exact minimum in just 2 steps.

Each step fully solves the problem along its direction, and later steps never interfere. On an $n$ -dimensional quadratic, CG reaches the exact solution in at most $n$ steps.

Now let’s see how it works.

The problem: solving linear systems cheaply

Suppose you need to solve $Ax = b$ where $A$ is an $n \times n$ symmetric positive definite matrix. Direct methods like Gaussian elimination cost $O(n^3)$ operations. When $n$ is large (millions of variables in engineering or ML problems), that cost is brutal.

Here is the key insight. Solving $Ax = b$ is the same as minimizing the quadratic function:

$f(x) = \frac{1}{2}x^TAx - b^Tx$

Take the gradient and set it to zero:

$\nabla f(x) = Ax - b = 0 \implies Ax = b$

So any method that minimizes $f$ also solves the linear system. Gradient descent can do this, but it zig-zags badly when the eigenvalues of $A$ are spread out. The conjugate gradient method (CG) fixes this. It solves an $n$ -dimensional quadratic in at most $n$ iterations, using only matrix-vector products, no matrix factorization needed.

Conjugate directions: the core idea

Two vectors $d_i$ and $d_j$ are conjugate (or $A$ -conjugate) with respect to $A$ if:

$d_i^T A \, d_j = 0 \quad \text{for } i \neq j$

This looks like orthogonality, but twisted by $A$ . Ordinary orthogonality means $d_i^T d_j = 0$ . Conjugacy means orthogonality in a space stretched by $A$ .

Why does this matter? If you search along conjugate directions, each step solves the problem completely along that direction. Later steps never undo the progress of earlier ones. This is exactly what steepest descent gets wrong: it keeps revisiting the same directions, zig-zagging toward the solution.

Think of it this way. For a 2D quadratic with elliptical contours, steepest descent bounces back and forth across the narrow valley. CG picks two directions that are “independent” in the geometry of the ellipse, so it nails the answer in just 2 steps.

If you have $n$ mutually conjugate directions $d_0, d_1, \ldots, d_{n-1}$ , you can express the solution as a linear combination along those directions. Each coefficient is found by a single line search. That gives you the exact answer in $n$ steps.

Conjugate directions vs steepest descent directions

graph TD
  A["Steepest descent: each direction is the negative gradient"] --> B["Directions overlap, undo previous work"]
  C["Conjugate gradient: each direction is A-orthogonal to all previous ones"] --> D["Directions are independent in the geometry of A"]
  B --> E["Slow convergence, zigzagging"]
  D --> F["Fast convergence, at most n steps"]

The CG algorithm for linear systems

Given: symmetric positive definite $A$ , right-hand side $b$ , initial guess $x_0$ .

Initialize:

$r_0 = b - Ax_0 \quad \text{(residual)}$ $d_0 = r_0 \quad \text{(first search direction = steepest descent direction)}$

For $k = 0, 1, 2, \ldots$ :

$\alpha_k = \frac{r_k^T r_k}{d_k^T A \, d_k} \quad \text{(step size)}$

$x_{k+1} = x_k + \alpha_k \, d_k \quad \text{(update position)}$

$r_{k+1} = r_k - \alpha_k \, A \, d_k \quad \text{(update residual)}$

$\beta_{k+1} = \frac{r_{k+1}^T r_{k+1}}{r_k^T r_k} \quad \text{(conjugacy parameter)}$

$d_{k+1} = r_{k+1} + \beta_{k+1} \, d_k \quad \text{(new search direction)}$

Stop when $\|r_{k+1}\|$ is small enough.

Each iteration costs one matrix-vector product ( $Ad_k$ ), two dot products, and a few vector additions. No matrix inversions. No storing previous directions. The $\beta$ formula automatically builds conjugacy into the new search direction.

The residual $r_k = b - Ax_k$ is the negative gradient of $f$ at $x_k$ . It tells you how far you are from the solution. The search direction $d_k$ is the residual corrected by $\beta_k d_{k-1}$ to maintain conjugacy.

CG algorithm flow

graph TD
  A["Start: compute residual r0, set direction d0 = r0"] --> B["Compute step size alpha"]
  B --> C["Update position: x = x + alpha * d"]
  C --> D["Update residual: r = r - alpha * A * d"]
  D --> E["Residual small enough?"]
  E -- Yes --> F["Done: x is the solution"]
  E -- No --> G["Compute beta (conjugacy parameter)"]
  G --> H["New direction: d = r + beta * d_prev"]
  H --> B

Worked examples

Example 1: CG on a 2x2 linear system

Solve $Ax = b$ where:

$A = \begin{pmatrix} 2 & 1 \\ 1 & 3 \end{pmatrix}, \quad b = \begin{pmatrix} 1 \\ 2 \end{pmatrix}, \quad x_0 = \begin{pmatrix} 0 \\ 0 \end{pmatrix}$

Iteration 0: Setup

$r_0 = b - Ax_0 = \begin{pmatrix}1\\2\end{pmatrix} - \begin{pmatrix}0\\0\end{pmatrix} = \begin{pmatrix}1\\2\end{pmatrix}$

$d_0 = r_0 = \begin{pmatrix}1\\2\end{pmatrix}$

Iteration 1: First step

Compute the step size:

$r_0^T r_0 = 1^2 + 2^2 = 5$

$A \, d_0 = \begin{pmatrix}2&1\\1&3\end{pmatrix}\begin{pmatrix}1\\2\end{pmatrix} = \begin{pmatrix}4\\7\end{pmatrix}$

$d_0^T A \, d_0 = 1 \cdot 4 + 2 \cdot 7 = 18$

$\alpha_0 = \frac{5}{18}$

Update position and residual:

$x_1 = \begin{pmatrix}0\\0\end{pmatrix} + \frac{5}{18}\begin{pmatrix}1\\2\end{pmatrix} = \begin{pmatrix}5/18\\5/9\end{pmatrix}$

$r_1 = r_0 - \alpha_0 \, A \, d_0 = \begin{pmatrix}1\\2\end{pmatrix} - \frac{5}{18}\begin{pmatrix}4\\7\end{pmatrix} = \begin{pmatrix}1 - 20/18 \\ 2 - 35/18\end{pmatrix} = \begin{pmatrix}-1/9\\1/18\end{pmatrix}$

Build the next search direction:

$r_1^T r_1 = \frac{1}{81} + \frac{1}{324} = \frac{4}{324} + \frac{1}{324} = \frac{5}{324}$

$\beta_1 = \frac{r_1^T r_1}{r_0^T r_0} = \frac{5/324}{5} = \frac{1}{324}$

$d_1 = r_1 + \beta_1 \, d_0 = \begin{pmatrix}-1/9\\1/18\end{pmatrix} + \frac{1}{324}\begin{pmatrix}1\\2\end{pmatrix} = \begin{pmatrix}-35/324\\20/324\end{pmatrix}$

Iteration 2: Second step

$A \, d_1 = \begin{pmatrix}2&1\\1&3\end{pmatrix}\begin{pmatrix}-35/324\\20/324\end{pmatrix} = \begin{pmatrix}-50/324\\25/324\end{pmatrix}$

$d_1^T A \, d_1 = \frac{(-35)(-50) + (20)(25)}{324^2} = \frac{1750 + 500}{324^2} = \frac{2250}{324^2}$

$\alpha_1 = \frac{r_1^T r_1}{d_1^T A \, d_1} = \frac{5/324}{2250/324^2} = \frac{5 \cdot 324}{2250} = \frac{1620}{2250} = \frac{18}{25}$

$x_2 = x_1 + \alpha_1 \, d_1 = \begin{pmatrix}5/18\\5/9\end{pmatrix} + \frac{18}{25}\begin{pmatrix}-35/324\\20/324\end{pmatrix}$

For the first component: $\frac{5}{18} + \frac{18}{25} \cdot \frac{-35}{324} = \frac{5}{18} - \frac{630}{8100} = \frac{5}{18} - \frac{7}{90} = \frac{25}{90} - \frac{7}{90} = \frac{18}{90} = \frac{1}{5}$

For the second component: $\frac{5}{9} + \frac{18}{25} \cdot \frac{20}{324} = \frac{5}{9} + \frac{360}{8100} = \frac{5}{9} + \frac{2}{45} = \frac{25}{45} + \frac{2}{45} = \frac{27}{45} = \frac{3}{5}$

$x_2 = \begin{pmatrix}1/5\\3/5\end{pmatrix}$

Verify: $Ax_2 = \begin{pmatrix}2&1\\1&3\end{pmatrix}\begin{pmatrix}1/5\\3/5\end{pmatrix} = \begin{pmatrix}2/5 + 3/5\\1/5+9/5\end{pmatrix} = \begin{pmatrix}1\\2\end{pmatrix} = b$ ✓

CG solved a 2D system in exactly 2 iterations, as guaranteed.

Example 2: CG on a quadratic function

Minimize $f(x,y) = x^2 + 4y^2$ starting from $x_0 = (4, 2)^T$ .

This quadratic corresponds to $A = \begin{pmatrix}2&0\\0&8\end{pmatrix}$ and $b = \begin{pmatrix}0\\0\end{pmatrix}$ , since $f(x) = \frac{1}{2}x^TAx$ .

Iteration 0:

$r_0 = b - Ax_0 = \begin{pmatrix}0\\0\end{pmatrix} - \begin{pmatrix}8\\16\end{pmatrix} = \begin{pmatrix}-8\\-16\end{pmatrix}$

$d_0 = r_0 = \begin{pmatrix}-8\\-16\end{pmatrix}$

Iteration 1:

$r_0^T r_0 = 64 + 256 = 320$

$A \, d_0 = \begin{pmatrix}2&0\\0&8\end{pmatrix}\begin{pmatrix}-8\\-16\end{pmatrix} = \begin{pmatrix}-16\\-128\end{pmatrix}$

$d_0^T A \, d_0 = (-8)(-16) + (-16)(-128) = 128 + 2048 = 2176$

$\alpha_0 = \frac{320}{2176} = \frac{5}{34}$

$x_1 = \begin{pmatrix}4\\2\end{pmatrix} + \frac{5}{34}\begin{pmatrix}-8\\-16\end{pmatrix} = \begin{pmatrix}4 - 40/34\\2 - 80/34\end{pmatrix} = \begin{pmatrix}48/17\\-6/17\end{pmatrix}$

$r_1 = r_0 - \alpha_0 \, A \, d_0 = \begin{pmatrix}-8\\-16\end{pmatrix} - \frac{5}{34}\begin{pmatrix}-16\\-128\end{pmatrix} = \begin{pmatrix}-96/17\\48/17\end{pmatrix}$

$r_1^T r_1 = \frac{9216 + 2304}{289} = \frac{11520}{289}$

$\beta_1 = \frac{11520/289}{320} = \frac{36}{289}$

$d_1 = r_1 + \beta_1 d_0 = \begin{pmatrix}-96/17\\48/17\end{pmatrix} + \frac{36}{289}\begin{pmatrix}-8\\-16\end{pmatrix} = \begin{pmatrix}-1920/289\\240/289\end{pmatrix}$

Iteration 2:

$A \, d_1 = \begin{pmatrix}-3840/289\\1920/289\end{pmatrix}$

$d_1^T A \, d_1 = \frac{1920(3840 + 240)}{289^2} = \frac{1920 \cdot 4080}{289^2} = \frac{7{,}833{,}600}{83{,}521}$

$\alpha_1 = \frac{11520/289}{7{,}833{,}600/83{,}521} = \frac{11520 \cdot 289}{7{,}833{,}600} = \frac{17}{40}$

$x_2 = \begin{pmatrix}48/17\\-6/17\end{pmatrix} + \frac{17}{40}\begin{pmatrix}-1920/289\\240/289\end{pmatrix} = \begin{pmatrix}48/17 - 48/17\\-6/17 + 6/17\end{pmatrix} = \begin{pmatrix}0\\0\end{pmatrix}$

CG reaches the exact minimum $(0, 0)$ in 2 steps. Compare this to gradient descent, which would need many iterations on this problem because the condition number $\kappa = 8/2 = 4$ causes zig-zagging. The higher the condition number, the worse gradient descent performs relative to CG.

Example 3: Verifying the conjugacy property

Let us verify that the search directions from Example 1 are $A$ -conjugate, meaning $d_0^T A \, d_1 = 0$ .

From Example 1:

$d_0 = \begin{pmatrix}1\\2\end{pmatrix}, \quad d_1 = \begin{pmatrix}-35/324\\20/324\end{pmatrix}$

First, compute $A \, d_1$ (we already did this):

$A \, d_1 = \begin{pmatrix}-50/324\\25/324\end{pmatrix}$

Now take the dot product $d_0^T (A \, d_1)$ :

$d_0^T A \, d_1 = 1 \cdot \frac{-50}{324} + 2 \cdot \frac{25}{324} = \frac{-50}{324} + \frac{50}{324} = 0 \quad \checkmark$

The two search directions are $A$ -conjugate. The CG algorithm guarantees this by construction: the $\beta$ formula is chosen precisely so each new direction is conjugate to all previous ones.

You can also verify that the residuals are orthogonal (not just conjugate):

$r_0^T r_1 = 1 \cdot \frac{-1}{9} + 2 \cdot \frac{1}{18} = \frac{-1}{9} + \frac{1}{9} = 0 \quad \checkmark$

This is another property CG maintains. The residuals form an orthogonal set, and the search directions form a conjugate set.

CG for nonlinear optimization

The linear CG algorithm assumes a quadratic objective. For general nonlinear functions, we cannot compute the residual $r_k = b - Ax_k$ because there is no $A$ or $b$ . Instead, we use the gradient $g_k = \nabla f(x_k)$ in place of $-r_k$ and adapt the $\beta$ formula.

The algorithm becomes:

Choose $x_0$ . Set $g_0 = \nabla f(x_0)$ and $d_0 = -g_0$ .
For $k = 0, 1, 2, \ldots$ $k = 0, 1, 2, \dots$ :
- Find $\alpha_k$ by line search: $\alpha_k = \arg\min_\alpha f(x_k + \alpha \, d_k)$
- Update: $x_{k+1} = x_k + \alpha_k \, d_k$
- Compute: $g_{k+1} = \nabla f(x_{k+1})$
- Compute $\beta_{k+1}$ (see below)
- Update direction: $d_{k+1} = -g_{k+1} + \beta_{k+1} \, d_k$

The two most common formulas for $\beta$ are:

Fletcher-Reeves:

$\beta_{k+1}^{FR} = \frac{g_{k+1}^T g_{k+1}}{g_k^T g_k}$

Polak-Ribiere:

$\beta_{k+1}^{PR} = \frac{g_{k+1}^T (g_{k+1} - g_k)}{g_k^T g_k}$

On a quadratic with exact line search, both formulas give identical results and reduce to the linear CG algorithm. On general nonlinear functions, Polak-Ribiere tends to work better in practice. The reason: when the gradient changes slowly ( $g_{k+1} \approx g_k$ ), Polak-Ribiere gives $\beta \approx 0$ , which effectively restarts CG as steepest descent. Fletcher-Reeves can keep a poor search direction going too long.

A common practical detail is to set $\beta_{k+1} = \max(\beta_{k+1}^{PR}, 0)$ . This prevents $\beta$ from going negative, which could reverse the search direction.

Restart strategy: Every $n$ iterations (or when $|g_{k+1}^T g_k| / \|g_{k+1}\|^2 > 0.1$ ), reset $d_{k+1} = -g_{k+1}$ . This guards against loss of conjugacy from inexact line searches.

Comparison: steepest descent vs CG vs Newton

Property	Steepest descent	Conjugate gradient	Newton’s method
Convergence rate	Linear	Superlinear (quadratics: finite)	Quadratic
Cost per iteration	$O(n)$	$O(n)$ + one mat-vec	$O(n^3)$ (Hessian solve)
Storage	$O(n)$	$O(n)$	$O(n^2)$
Quadratic in $n$ dims	Many iterations	At most $n$ iterations	1 iteration
Zig-zagging	Yes	No (on quadratics)	No
Needs Hessian	No	No	Yes
Best for	Small/simple problems	Large sparse systems	Small problems, fast convergence needed

CG sits in a sweet spot. It is nearly as cheap as steepest descent per iteration, but converges much faster. It avoids the $O(n^2)$ storage and $O(n^3)$ per-step cost of Newton’s method, making it practical for very large problems.

Convergence comparison

graph LR
  A["Steepest descent: linear rate, slow on ill-conditioned problems"] --> D["Many iterations"]
  B["Conjugate gradient: superlinear rate, at most n steps on quadratics"] --> E["Fast convergence"]
  C["Newton: quadratic rate, 1 step on quadratics"] --> F["Fastest, but O(n^3) per step"]

Convergence comparison: CG reaches f = 0 in 2 steps vs steepest descent still converging after 10

Python implementation

Here is a compact CG implementation for quadratic objectives $f(x) = \frac{1}{2}x^TAx - b^Tx$ :

import numpy as np

def conjugate_gradient(A, b, x0, tol=1e-10, max_iter=None):
    """Solve Ax = b using conjugate gradient."""
    n = len(b)
    if max_iter is None:
        max_iter = n

    x = x0.copy().astype(float)
    r = b - A @ x
    d = r.copy()
    rs_old = r @ r

    for k in range(max_iter):
        Ad = A @ d
        alpha = rs_old / (d @ Ad)
        x = x + alpha * d
        r = r - alpha * Ad
        rs_new = r @ r

        if np.sqrt(rs_new) < tol:
            print(f"Converged in {k+1} iterations")
            return x

        beta = rs_new / rs_old
        d = r + beta * d
        rs_old = rs_new

    print(f"Reached max iterations ({max_iter})")
    return x


# Example 1: solve the 2x2 system
A = np.array([[2, 1], [1, 3]], dtype=float)
b = np.array([1, 2], dtype=float)
x0 = np.array([0, 0], dtype=float)

x = conjugate_gradient(A, b, x0)
print(f"Solution: {x}")          # [0.2, 0.6]
print(f"Ax = {A @ x}")           # [1.0, 2.0]
print(f"Residual: {b - A @ x}")  # [0.0, 0.0]

And the nonlinear variant using Polak-Ribiere:

def cg_polak_ribiere(grad_f, x0, line_search, tol=1e-8, max_iter=1000):
    """Nonlinear CG with Polak-Ribiere and automatic restarts."""
    x = x0.copy().astype(float)
    g = grad_f(x)
    d = -g

    for k in range(max_iter):
        if np.linalg.norm(g) < tol:
            print(f"Converged in {k} iterations")
            return x

        alpha = line_search(x, d)
        x = x + alpha * d
        g_new = grad_f(x)

        beta = max(g_new @ (g_new - g) / (g @ g), 0)
        d = -g_new + beta * d
        g = g_new

    return x

What comes next

CG handles unconstrained problems well, but most real optimization problems have constraints. When you add equality or inequality constraints, you enter the world of constrained optimization. The main tool there is Lagrangian duality, which converts constrained problems into unconstrained ones by introducing dual variables.

CG also shows up inside other algorithms. Truncated Newton methods use CG to approximately solve the Newton system $H \, \Delta x = -g$ without ever forming the full Hessian matrix. This combination gives you Newton-like convergence at CG-like cost per iteration. If you work with large-scale optimization (training neural networks, solving PDEs), you will see CG everywhere.

← Back to all series