Conjugate gradient methods
In this series (18 parts)
- What is optimization and why ML needs it
- Convex sets and convex functions
- Optimality conditions: first order
- Optimality conditions: second order
- Line search methods
- Least squares: the closed-form solution
- Steepest descent (gradient descent)
- Newton's method for optimization
- Quasi-Newton methods: BFGS and L-BFGS
- Conjugate gradient methods
- Constrained optimization and Lagrangian duality
- KKT conditions
- Penalty and barrier methods
- Interior point methods
- The simplex method
- Frank-Wolfe method
- Optimization in dynamic programming and optimal control
- Stochastic gradient descent and variants
Prerequisites: This post builds on steepest descent and matrix operations. You should be comfortable with matrix-vector products, gradients, and the idea of iterative minimization before reading further.
Why conjugate gradient exists
Solving large linear systems shows up everywhere in ML. Least squares, Gaussian processes, Newton steps, and Hessian-vector products all reduce to linear solves. Direct methods like Gaussian elimination cost . When is in the millions, that cost is brutal.
Gradient descent can solve these systems iteratively, but it zigzags badly on elongated bowls. Each step undoes some of the progress from the previous step, wasting iterations.
| Method | Cost per step | Behavior | Steps for -dim quadratic |
|---|---|---|---|
| Direct solve (Gaussian elimination) | Exact | 1 (but expensive) | |
| Gradient descent | Zigzags on ill-conditioned problems | Many (depends on condition number) | |
| Conjugate gradient | + one matrix-vector product | No zigzag, steady progress | At most |
The zigzag problem with gradient descent
graph LR A["Start"] --> B["Step 1: move toward minimum"] B --> C["Step 2: overshoot, correct sideways"] C --> D["Step 3: correct back"] D --> E["Step 4: still zigzagging..."] E --> F["Eventually converges (slowly)"]
Conjugate gradient picks search directions that do not undo previous progress.
Contour plot comparing steepest descent (red, zigzag) vs conjugate gradient (green, direct) on f(x,y) = x^2 + 10y^2. CG reaches the exact minimum in just 2 steps.
Each step fully solves the problem along its direction, and later steps never interfere. On an -dimensional quadratic, CG reaches the exact solution in at most steps.
Now let’s see how it works.
The problem: solving linear systems cheaply
Suppose you need to solve where is an symmetric positive definite matrix. Direct methods like Gaussian elimination cost operations. When is large (millions of variables in engineering or ML problems), that cost is brutal.
Here is the key insight. Solving is the same as minimizing the quadratic function:
Take the gradient and set it to zero:
So any method that minimizes also solves the linear system. Gradient descent can do this, but it zig-zags badly when the eigenvalues of are spread out. The conjugate gradient method (CG) fixes this. It solves an -dimensional quadratic in at most iterations, using only matrix-vector products, no matrix factorization needed.
Conjugate directions: the core idea
Two vectors and are conjugate (or -conjugate) with respect to if:
This looks like orthogonality, but twisted by . Ordinary orthogonality means . Conjugacy means orthogonality in a space stretched by .
Why does this matter? If you search along conjugate directions, each step solves the problem completely along that direction. Later steps never undo the progress of earlier ones. This is exactly what steepest descent gets wrong: it keeps revisiting the same directions, zig-zagging toward the solution.
Think of it this way. For a 2D quadratic with elliptical contours, steepest descent bounces back and forth across the narrow valley. CG picks two directions that are “independent” in the geometry of the ellipse, so it nails the answer in just 2 steps.
If you have mutually conjugate directions , you can express the solution as a linear combination along those directions. Each coefficient is found by a single line search. That gives you the exact answer in steps.
Conjugate directions vs steepest descent directions
graph TD A["Steepest descent: each direction is the negative gradient"] --> B["Directions overlap, undo previous work"] C["Conjugate gradient: each direction is A-orthogonal to all previous ones"] --> D["Directions are independent in the geometry of A"] B --> E["Slow convergence, zigzagging"] D --> F["Fast convergence, at most n steps"]
The CG algorithm for linear systems
Given: symmetric positive definite , right-hand side , initial guess .
Initialize:
For :
Stop when is small enough.
Each iteration costs one matrix-vector product (), two dot products, and a few vector additions. No matrix inversions. No storing previous directions. The formula automatically builds conjugacy into the new search direction.
The residual is the negative gradient of at . It tells you how far you are from the solution. The search direction is the residual corrected by to maintain conjugacy.
CG algorithm flow
graph TD A["Start: compute residual r0, set direction d0 = r0"] --> B["Compute step size alpha"] B --> C["Update position: x = x + alpha * d"] C --> D["Update residual: r = r - alpha * A * d"] D --> E["Residual small enough?"] E -- Yes --> F["Done: x is the solution"] E -- No --> G["Compute beta (conjugacy parameter)"] G --> H["New direction: d = r + beta * d_prev"] H --> B
Worked examples
Example 1: CG on a 2x2 linear system
Solve where:
Iteration 0: Setup
Iteration 1: First step
Compute the step size:
Update position and residual:
Build the next search direction:
Iteration 2: Second step
For the first component:
For the second component:
Verify: ✓
CG solved a 2D system in exactly 2 iterations, as guaranteed.
Example 2: CG on a quadratic function
Minimize starting from .
This quadratic corresponds to and , since .
Iteration 0:
Iteration 1:
Iteration 2:
CG reaches the exact minimum in 2 steps. Compare this to gradient descent, which would need many iterations on this problem because the condition number causes zig-zagging. The higher the condition number, the worse gradient descent performs relative to CG.
Example 3: Verifying the conjugacy property
Let us verify that the search directions from Example 1 are -conjugate, meaning .
From Example 1:
First, compute (we already did this):
Now take the dot product :
The two search directions are -conjugate. The CG algorithm guarantees this by construction: the formula is chosen precisely so each new direction is conjugate to all previous ones.
You can also verify that the residuals are orthogonal (not just conjugate):
This is another property CG maintains. The residuals form an orthogonal set, and the search directions form a conjugate set.
CG for nonlinear optimization
The linear CG algorithm assumes a quadratic objective. For general nonlinear functions, we cannot compute the residual because there is no or . Instead, we use the gradient in place of and adapt the formula.
The algorithm becomes:
- Choose . Set and .
- For :
- Find by line search:
- Update:
- Compute:
- Compute (see below)
- Update direction:
The two most common formulas for are:
Fletcher-Reeves:
Polak-Ribiere:
On a quadratic with exact line search, both formulas give identical results and reduce to the linear CG algorithm. On general nonlinear functions, Polak-Ribiere tends to work better in practice. The reason: when the gradient changes slowly (), Polak-Ribiere gives , which effectively restarts CG as steepest descent. Fletcher-Reeves can keep a poor search direction going too long.
A common practical detail is to set . This prevents from going negative, which could reverse the search direction.
Restart strategy: Every iterations (or when ), reset . This guards against loss of conjugacy from inexact line searches.
Comparison: steepest descent vs CG vs Newton
| Property | Steepest descent | Conjugate gradient | Newton’s method |
|---|---|---|---|
| Convergence rate | Linear | Superlinear (quadratics: finite) | Quadratic |
| Cost per iteration | + one mat-vec | (Hessian solve) | |
| Storage | |||
| Quadratic in dims | Many iterations | At most iterations | 1 iteration |
| Zig-zagging | Yes | No (on quadratics) | No |
| Needs Hessian | No | No | Yes |
| Best for | Small/simple problems | Large sparse systems | Small problems, fast convergence needed |
CG sits in a sweet spot. It is nearly as cheap as steepest descent per iteration, but converges much faster. It avoids the storage and per-step cost of Newton’s method, making it practical for very large problems.
Convergence comparison
graph LR A["Steepest descent: linear rate, slow on ill-conditioned problems"] --> D["Many iterations"] B["Conjugate gradient: superlinear rate, at most n steps on quadratics"] --> E["Fast convergence"] C["Newton: quadratic rate, 1 step on quadratics"] --> F["Fastest, but O(n^3) per step"]
Convergence comparison: CG reaches f = 0 in 2 steps vs steepest descent still converging after 10
Python implementation
Here is a compact CG implementation for quadratic objectives :
import numpy as np
def conjugate_gradient(A, b, x0, tol=1e-10, max_iter=None):
"""Solve Ax = b using conjugate gradient."""
n = len(b)
if max_iter is None:
max_iter = n
x = x0.copy().astype(float)
r = b - A @ x
d = r.copy()
rs_old = r @ r
for k in range(max_iter):
Ad = A @ d
alpha = rs_old / (d @ Ad)
x = x + alpha * d
r = r - alpha * Ad
rs_new = r @ r
if np.sqrt(rs_new) < tol:
print(f"Converged in {k+1} iterations")
return x
beta = rs_new / rs_old
d = r + beta * d
rs_old = rs_new
print(f"Reached max iterations ({max_iter})")
return x
# Example 1: solve the 2x2 system
A = np.array([[2, 1], [1, 3]], dtype=float)
b = np.array([1, 2], dtype=float)
x0 = np.array([0, 0], dtype=float)
x = conjugate_gradient(A, b, x0)
print(f"Solution: {x}") # [0.2, 0.6]
print(f"Ax = {A @ x}") # [1.0, 2.0]
print(f"Residual: {b - A @ x}") # [0.0, 0.0]
And the nonlinear variant using Polak-Ribiere:
def cg_polak_ribiere(grad_f, x0, line_search, tol=1e-8, max_iter=1000):
"""Nonlinear CG with Polak-Ribiere and automatic restarts."""
x = x0.copy().astype(float)
g = grad_f(x)
d = -g
for k in range(max_iter):
if np.linalg.norm(g) < tol:
print(f"Converged in {k} iterations")
return x
alpha = line_search(x, d)
x = x + alpha * d
g_new = grad_f(x)
beta = max(g_new @ (g_new - g) / (g @ g), 0)
d = -g_new + beta * d
g = g_new
return x
What comes next
CG handles unconstrained problems well, but most real optimization problems have constraints. When you add equality or inequality constraints, you enter the world of constrained optimization. The main tool there is Lagrangian duality, which converts constrained problems into unconstrained ones by introducing dual variables.
CG also shows up inside other algorithms. Truncated Newton methods use CG to approximately solve the Newton system without ever forming the full Hessian matrix. This combination gives you Newton-like convergence at CG-like cost per iteration. If you work with large-scale optimization (training neural networks, solving PDEs), you will see CG everywhere.