Mar 20, 2026 · 16 min read · ML Optimization

Frank-Wolfe method

In this series (18 parts)

The projection problem

Sometimes the constraint set in an optimization problem is so complex that projecting onto it is expensive. Projected gradient descent works by taking a gradient step, then projecting back onto the feasible set. But for structured constraints like nuclear norm balls or flow polytopes, that projection is itself a hard optimization problem.

Frank-Wolfe avoids projection entirely.

Step	Projected gradient descent	Frank-Wolfe
Descent direction	Negative gradient	Negative gradient (linearized)
Stay feasible via	Projection onto constraint set	Linear minimization over constraint set
Cost of staying feasible	Solve a QP (can be expensive)	Solve an LP or use closed form (often cheap)
Iterates	Dense	Naturally sparse

The Frank-Wolfe idea

graph LR
  A["Current point x_k"] --> B["Linearize objective at x_k"]
  B --> C["Find cheapest point v_k in constraint set (LMO)"]
  C --> D["Move toward v_k"]
  D --> E["New point x_{k+1}"]

Instead of projecting back onto the constraint set after each step, Frank-Wolfe finds the best point inside the constraint set using a simpler linear problem. The linear minimization oracle (LMO) replaces the projection step and is often orders of magnitude cheaper.

Now let’s formalize this.

Prerequisites

You should be comfortable with gradient descent and how the gradient points in the direction of steepest increase. Understanding Lagrangian duality helps with the convergence analysis, but is not strictly required to follow the algorithm.

The problem

We want to solve:

\min_{x \in \mathcal{C}} \; f(x)

where $f$ is a smooth convex function and $\mathcal{C}$ is a compact convex set.

The standard approach is projected gradient descent: take a gradient step, then project back onto $\mathcal{C}$ . This works, but the projection step can be expensive. For many constraint sets (nuclear norm balls, flow polytopes, matroid polytopes), projection is hard while linear minimization over the set is easy.

Frank-Wolfe takes a different route: instead of projecting, it solves a linear subproblem at each step.

The algorithm

Frank-Wolfe (Conditional Gradient) Method:

Input: starting point x_0 ∈ C, number of iterations T
For k = 0, 1, ..., T-1:
    1. Compute the gradient: g_k = ∇f(x_k)
    2. Linear minimization oracle (LMO):
       v_k = argmin_{v ∈ C}  g_k^T v
    3. Step size: γ_k = 2/(k+2)  (or line search)
    4. Update: x_{k+1} = (1 - γ_k) x_k + γ_k v_k

That is the entire algorithm. Step 2 is the key: you minimize a linear function over $\mathcal{C}$ . For a polytope, this always returns a vertex. Step 4 moves toward that vertex with a step size that decreases over time.

Why “conditional gradient”?

The direction $v_k - x_k$ is the Frank-Wolfe direction. You can think of it as finding the point in $\mathcal{C}$ that the negative gradient most wants to reach, then stepping toward it. The name “conditional gradient” comes from the fact that you optimize a linear approximation conditioned on staying in $\mathcal{C}$ .

Frank-Wolfe iteration flow

graph TD
  A["Compute gradient at current point"] --> B["LMO: find vertex v minimizing gradient dot v"]
  B --> C["Choose step size gamma"]
  C --> D["Update: move from x toward v"]
  D --> E["Frank-Wolfe gap small?"]
  E -- No --> A
  E -- Yes --> F["Converged"]

Why no projection?

Projected gradient descent requires solving:

\text{proj}_{\mathcal{C}}(y) = \arg\min_{x \in \mathcal{C}} \|x - y\|^2

This is itself an optimization problem. For simple sets (boxes, balls), projection is cheap. But for structured sets, it can be as hard as the original problem.

Frank-Wolfe replaces projection with a linear minimization oracle (LMO):

v = \arg\min_{v \in \mathcal{C}} \; c^T v

For polytopes, this is a linear program. For many special polytopes, it has a closed-form solution or a very efficient algorithm.

Constraint set $\mathcal{C}$	Projection cost	LMO cost
$\ell_2$ ball	$O(n)$	$O(n)$
Simplex $\{x \ge 0, \mathbf{1}^Tx = 1\}$	$O(n \log n)$	$O(n)$
Nuclear norm ball	Full SVD: $O(n^3)$	Top singular vector: $O(n^2)$
Flow polytope	Expensive QP	Shortest path

When the LMO is much cheaper than projection, Frank-Wolfe wins.

The linear minimization oracle (LMO)

graph TD
  A["Input: gradient direction g"] --> B["LMO solves: min g^T v over constraint set C"]
  B --> C["For a polytope: returns a vertex"]
  B --> D["For L1 ball: pick coordinate with largest abs gradient"]
  B --> E["For simplex: pick coordinate with smallest gradient"]

Example 1: Minimizing a quadratic over the simplex

Problem:

\min_{x \in \Delta_3} \; f(x) = \frac{1}{2} x^T Q x + c^T x

where $\Delta_3 = \{x \in \mathbb{R}^3 : x \ge 0, x_1 + x_2 + x_3 = 1\}$ is the 3-simplex (probability simplex), and:

Q = \begin{pmatrix} 4 & 1 & 0 \\ 1 & 2 & 0 \\ 0 & 0 & 6 \end{pmatrix}, \quad c = \begin{pmatrix} 1 \\ -2 \\ 3 \end{pmatrix}

The LMO on the simplex is trivial: minimize a linear function $g^T x$ over the simplex. The answer is always a vertex $e_j$ where $j = \arg\min_i g_i$ . Just pick the coordinate with the smallest gradient component.

Iteration 0: Start at $x_0 = (1/3, 1/3, 1/3)$ .

Compute gradient:

\nabla f(x_0) = Qx_0 + c

Qx_0 = \begin{pmatrix} 4/3 + 1/3 + 0 \\ 1/3 + 2/3 + 0 \\ 0 + 0 + 6/3 \end{pmatrix} = \begin{pmatrix} 5/3 \\ 1 \\ 2 \end{pmatrix}

\nabla f(x_0) = \begin{pmatrix} 5/3 + 1 \\ 1 + (-2) \\ 2 + 3 \end{pmatrix} = \begin{pmatrix} 8/3 \\ -1 \\ 5 \end{pmatrix}

LMO: $\min_i (\nabla f)_i$ . The minimum is $-1$ at index 2. So $v_0 = e_2 = (0, 1, 0)$ .

Step size: $\gamma_0 = 2/(0+2) = 1$ .

Update: $x_1 = (1 - 1) x_0 + 1 \cdot v_0 = (0, 1, 0)$ .

Iteration 1: $x_1 = (0, 1, 0)$ .

\nabla f(x_1) = Q \begin{pmatrix} 0 \\ 1 \\ 0 \end{pmatrix} + c = \begin{pmatrix} 1 \\ 2 \\ 0 \end{pmatrix} + \begin{pmatrix} 1 \\ -2 \\ 3 \end{pmatrix} = \begin{pmatrix} 2 \\ 0 \\ 3 \end{pmatrix}

LMO: Minimum gradient component is $0$ at index 2. So $v_1 = e_2 = (0, 1, 0)$ . Same as $x_1$ !

The Frank-Wolfe gap $g_k = \nabla f(x_k)^T(x_k - v_k) = 0$ , which means we are at the optimum for this direction. But let us check: $v_1 = x_1$ means no progress. The Frank-Wolfe gap is zero, confirming (local) optimality on the simplex.

Actually, $\nabla f(x_1)^T (x_1 - v_1) = (2, 0, 3)^T (0, 0, 0) = 0$ . The algorithm has converged in just one step because $x_1 = (0, 1, 0)$ happens to be optimal on the simplex.

Verify: $f(0, 1, 0) = \frac{1}{2}(1)(2)(1) + (-2)(1) = 1 - 2 = -1$ . Check other vertices: $f(1, 0, 0) = 2 + 1 = 3$ , $f(0, 0, 1) = 3 + 3 = 6$ . Indeed $(0, 1, 0)$ is optimal.

Frank-Wolfe iterates on min (x-1.8)^2 + (y-1.8)^2 over the triangle x ≥ 0, y ≥ 0, x + y ≤ 2. Each step moves toward a vertex of the constraint set, producing a zigzag path toward the optimum.

Example 2: Multiple iterations with line search

Problem:

\min_{x \in \Delta_2} \; f(x) = (x_1 - 0.7)^2 + (x_2 - 0.5)^2

where $\Delta_2 = \{(x_1, x_2) : x_1 + x_2 \le 1, x_1, x_2 \ge 0\}$ .

The unconstrained minimum is $(0.7, 0.5)$ , but $0.7 + 0.5 = 1.2 > 1$ , so it is infeasible. The constrained optimum lies on the edge $x_1 + x_2 = 1$ .

Iteration 0: Start at $x_0 = (0.1, 0.1)$ .

\nabla f(x_0) = (2(0.1 - 0.7), \; 2(0.1 - 0.5)) = (-1.2, \; -0.8)

LMO over $\Delta_2$ : Minimize $(-1.2)v_1 + (-0.8)v_2$ over the simplex. The vertices are $(0,0)$ , $(1,0)$ , and $(0,1)$ . Values: $0$ , $-1.2$ , $-0.8$ . Minimum at $(1, 0)$ .

So $v_0 = (1, 0)$ .

Line search: Minimize $f(x_0 + \gamma(v_0 - x_0))$ over $\gamma \in [0, 1]$ .

x_0 + \gamma(v_0 - x_0) = (0.1 + 0.9\gamma, \; 0.1 - 0.1\gamma)

f = (0.1 + 0.9\gamma - 0.7)^2 + (0.1 - 0.1\gamma - 0.5)^2

= (0.9\gamma - 0.6)^2 + (-0.1\gamma - 0.4)^2

= 0.81\gamma^2 - 1.08\gamma + 0.36 + 0.01\gamma^2 + 0.08\gamma + 0.16

= 0.82\gamma^2 - 1.0\gamma + 0.52

Minimize: $\frac{d}{d\gamma} = 1.64\gamma - 1.0 = 0$ , so $\gamma^* = 1.0/1.64 \approx 0.610$ .

x_1 = (0.1 + 0.9(0.610), \; 0.1 - 0.1(0.610)) = (0.649, \; 0.039)

$f(x_1) = (0.649 - 0.7)^2 + (0.039 - 0.5)^2 = (-0.051)^2 + (-0.461)^2 = 0.0026 + 0.2125 = 0.215$ .

Iteration 1:

\nabla f(x_1) = (2(0.649 - 0.7), \; 2(0.039 - 0.5)) = (-0.102, \; -0.922)

LMO: Values at vertices: $0$ , $-0.102$ , $-0.922$ . Minimum at $(0, 1)$ .

$v_1 = (0, 1)$ . Direction: $(0 - 0.649, 1 - 0.039) = (-0.649, 0.961)$ .

Line search:

x_1 + \gamma(-0.649, 0.961) = (0.649 - 0.649\gamma, \; 0.039 + 0.961\gamma)

f = (0.649 - 0.649\gamma - 0.7)^2 + (0.039 + 0.961\gamma - 0.5)^2

= (-0.051 - 0.649\gamma)^2 + (-0.461 + 0.961\gamma)^2

Taking the derivative and setting to zero:

2(-0.051 - 0.649\gamma)(-0.649) + 2(-0.461 + 0.961\gamma)(0.961) = 0

0.649(0.051 + 0.649\gamma) + 0.961(-0.461 + 0.961\gamma) = 0

0.033 + 0.421\gamma - 0.443 + 0.923\gamma = 0

1.344\gamma = 0.410

\gamma^* = 0.305

x_2 = (0.649 - 0.649 \times 0.305, \; 0.039 + 0.961 \times 0.305) = (0.451, \; 0.332)

$f(x_2) = (0.451 - 0.7)^2 + (0.332 - 0.5)^2 = 0.062 + 0.028 = 0.090$ .

The iterates are approaching the optimal point on the constraint boundary. Notice how each iterate is a convex combination of previous iterates and vertices.

Sparsity of iterates

A key property of Frank-Wolfe: after $k$ iterations, the iterate $x_k$ is a convex combination of at most $k + 1$ vertices of $\mathcal{C}$ . This is because each update mixes the current point with one new vertex.

For problems where $\mathcal{C}$ is a polytope, this means $x_k$ is supported on at most $k + 1$ vertices. If you run 50 iterations on a polytope with millions of vertices, your solution involves at most 51 vertices. This sparsity is valuable in:

Machine learning: Sparse models are interpretable. Training an SVM with Frank-Wolfe produces a solution supported on few support vectors.
Signal processing: Recovering sparse signals from measurements.
Combinatorial optimization: Expressing solutions as combinations of few extreme points.

Convergence

Frank-Wolfe converges at a rate of $O(1/k)$ for smooth convex functions. After $k$ iterations:

f(x_k) - f(x^*) \le \frac{2LD^2}{k + 2}

where $L$ is the Lipschitz constant of $\nabla f$ and $D = \max_{x, y \in \mathcal{C}} \|x - y\|$ is the diameter of $\mathcal{C}$ .

This is slower than projected gradient descent ( $O(1/k)$ can be improved to $O(1/k^2)$ with acceleration), but each Frank-Wolfe iteration is cheaper when the LMO is cheaper than projection.

The Frank-Wolfe gap

A useful convergence certificate is the Frank-Wolfe gap:

g_k = \max_{v \in \mathcal{C}} \nabla f(x_k)^T (x_k - v) = \nabla f(x_k)^T (x_k - v_k)

This is the dot product of the gradient with the direction toward the LMO solution. It upper bounds the suboptimality: $f(x_k) - f(x^*) \le g_k$ . When $g_k$ is small, you are close to optimal.

Frank-Wolfe vs projected gradient descent convergence

graph TD
  A["Frank-Wolfe: O(1/k) rate"] --> C["Slower per-iteration progress"]
  B["Projected gradient descent: O(1/k) or O(1/k^2) with acceleration"] --> D["Faster per-iteration progress"]
  C --> E["But each FW iteration is cheaper when LMO is cheap"]
  D --> F["Each PGD iteration pays the projection cost"]

Frank-Wolfe converges at O(1/k) rate while projected gradient descent achieves linear convergence, but FW avoids the cost of projections.

Example 3: Frank-Wolfe on a quadratic over an $\ell_1$ ball

Problem:

\min_{\|x\|_1 \le 1} \; f(x) = \frac{1}{2}\|x - a\|^2

where $a = (0.8, 0.3)$ . The $\ell_1$ ball in 2D is the diamond with vertices $(\pm 1, 0)$ and $(0, \pm 1)$ .

The unconstrained minimum is $a = (0.8, 0.3)$ with $\|a\|_1 = 1.1 > 1$ , so the constraint is active.

LMO on the $\ell_1$ ball: Minimize $g^T x$ over $\|x\|_1 \le 1$ . The solution is $v = -\text{sign}(g_j) e_j$ where $j = \arg\max_i |g_i|$ . Pick the coordinate with the largest absolute gradient and go to the corresponding vertex.

Iteration 0: $x_0 = (0, 0)$ .

\nabla f(x_0) = x_0 - a = (-0.8, -0.3)

LMO: $|{-0.8}| > |{-0.3}|$ , so $j = 1$ , $v_0 = -\text{sign}(-0.8) \cdot e_1 = (1, 0)$ .

$\gamma_0 = 2/2 = 1$ . $x_1 = v_0 = (1, 0)$ .

$f(x_1) = \frac{1}{2}((1-0.8)^2 + (0-0.3)^2) = \frac{1}{2}(0.04 + 0.09) = 0.065$ .

Iteration 1: $x_1 = (1, 0)$ .

\nabla f(x_1) = (1 - 0.8, 0 - 0.3) = (0.2, -0.3)

LMO: $|{-0.3}| > |0.2|$ , so $j = 2$ , $v_1 = (0, 1)$ .

$\gamma_1 = 2/3 \approx 0.667$ .

$x_2 = (1 - 0.667)(1, 0) + 0.667(0, 1) = (0.333, 0.667)$ .

$f(x_2) = \frac{1}{2}((0.333-0.8)^2 + (0.667-0.3)^2) = \frac{1}{2}(0.218 + 0.135) = 0.176$ .

Hmm, the objective increased! This happens because the default step size $2/(k+2)$ is not always optimal. Let us use line search instead.

With line search:

h(\gamma) = f((1-\gamma)(1,0) + \gamma(0,1)) = f(1-\gamma, \gamma)

= \frac{1}{2}((1-\gamma-0.8)^2 + (\gamma-0.3)^2) = \frac{1}{2}((0.2-\gamma)^2 + (\gamma-0.3)^2)

\frac{dh}{d\gamma} = -(0.2-\gamma) + (\gamma-0.3) = 2\gamma - 0.5 = 0 \implies \gamma^* = 0.25

$x_2 = (0.75, 0.25)$ .

$f(x_2) = \frac{1}{2}((0.75-0.8)^2 + (0.25-0.3)^2) = \frac{1}{2}(0.0025 + 0.0025) = 0.0025$ .

That is much better. Note: $\|x_2\|_1 = 1.0$ , so the constraint is active.

Iteration 2: $x_2 = (0.75, 0.25)$ .

\nabla f(x_2) = (-0.05, -0.05)

LMO: Both components are equal in magnitude. Pick $j = 1$ (say). $v_2 = (1, 0)$ .

Line search along $(0.75, 0.25) + \gamma((1, 0) - (0.75, 0.25)) = (0.75 + 0.25\gamma, 0.25 - 0.25\gamma)$ :

f = \frac{1}{2}((0.75+0.25\gamma - 0.8)^2 + (0.25-0.25\gamma - 0.3)^2)

= \frac{1}{2}((0.25\gamma - 0.05)^2 + (-0.25\gamma - 0.05)^2)

= \frac{1}{2}(2 \times (0.0625\gamma^2 + 0.0025)) = 0.0625\gamma^2 + 0.0025

Minimized at $\gamma^* = 0$ . The algorithm stays at $x_2$ . Frank-Wolfe gap:

g_2 = \nabla f(x_2)^T(x_2 - v_2) = (-0.05, -0.05)^T(0.75 - 1, 0.25 - 0) = (-0.05)(-0.25) + (-0.05)(0.25) = 0

Converged. The optimal solution is approximately $(0.75, 0.25)$ .

The true optimum (by projection onto the $\ell_1$ ball) is $(0.8 - 0.05, 0.3 - 0.05) = (0.75, 0.25)$ . Frank-Wolfe found it exactly.

Variants and improvements

Away-step Frank-Wolfe

Standard Frank-Wolfe can zigzag near the optimum. The away-step variant can also move away from a vertex already in the active set, reducing zigzagging and improving convergence to linear rate for strongly convex objectives.

Pairwise Frank-Wolfe

Combines the toward step and away step into a single pairwise swap between vertices. Converges linearly for strongly convex functions on polytopes.

Stochastic Frank-Wolfe

When the objective is an expectation $f(x) = \mathbb{E}[F(x, \xi)]$ , use a stochastic gradient in the LMO. This connects Frank-Wolfe to stochastic optimization.

When to use Frank-Wolfe

✓ The constraint set has a cheap LMO but expensive projection.

✓ You want sparse solutions (few active vertices).

✓ The problem is large-scale and you need simple iterations.

⚠ Convergence is $O(1/k)$ , slower than projected gradient methods for strongly convex problems. The away-step variant fixes this.

⚠ Not suitable for unconstrained problems (the whole point is the constraint structure).

Python implementation

import numpy as np

def frank_wolfe(grad_f, lmo, x0, f=None, max_iter=100, tol=1e-6):
    """
    Frank-Wolfe method with line search.
    
    grad_f: gradient function
    lmo: linear minimization oracle, returns argmin_{v in C} g^T v
    x0: starting point in C
    f: objective (needed for line search; if None, use default step size)
    """
    x = x0.copy()
    
    for k in range(max_iter):
        g = grad_f(x)
        v = lmo(g)
        d = v - x
        gap = -g @ d  # Frank-Wolfe gap
        
        if gap < tol:
            break
        
        # Line search or default step size
        gamma = 2.0 / (k + 2)
        if f is not None:
            # Exact line search for quadratics, 
            # or use backtracking for general f
            best_gamma = gamma
            best_val = f(x + gamma * d)
            for g_try in np.linspace(0, 1, 50):
                val = f(x + g_try * d)
                if val < best_val:
                    best_val = val
                    best_gamma = g_try
            gamma = best_gamma
        
        x = x + gamma * d
    
    return x

# LMO for the probability simplex
def simplex_lmo(g):
    v = np.zeros_like(g)
    v[np.argmin(g)] = 1.0
    return v

# LMO for the L1 ball
def l1_ball_lmo(g):
    v = np.zeros_like(g)
    j = np.argmax(np.abs(g))
    v[j] = -np.sign(g[j])
    return v

What comes next

Frank-Wolfe is great for constrained optimization with special structure. But what about problems where the “constraint” is really a sequential decision process? Dynamic programming and optimal control shows how optimization principles apply to multi-stage decision problems, from shortest paths to reinforcement learning.

← Back to all series

Frank-Wolfe method

The projection problem

Prerequisites

The problem

The algorithm

Why “conditional gradient”?

Why no projection?

Example 1: Minimizing a quadratic over the simplex

Example 2: Multiple iterations with line search

Sparsity of iterates

Convergence

The Frank-Wolfe gap

Example 3: Frank-Wolfe on a quadratic over an ℓ1\ell_1ℓ1​ ball

Variants and improvements

Away-step Frank-Wolfe

Pairwise Frank-Wolfe

Stochastic Frank-Wolfe

When to use Frank-Wolfe

Python implementation

What comes next

Example 3: Frank-Wolfe on a quadratic over an $\ell_1$ ball