Frank-Wolfe method
In this series (18 parts)
- What is optimization and why ML needs it
- Convex sets and convex functions
- Optimality conditions: first order
- Optimality conditions: second order
- Line search methods
- Least squares: the closed-form solution
- Steepest descent (gradient descent)
- Newton's method for optimization
- Quasi-Newton methods: BFGS and L-BFGS
- Conjugate gradient methods
- Constrained optimization and Lagrangian duality
- KKT conditions
- Penalty and barrier methods
- Interior point methods
- The simplex method
- Frank-Wolfe method
- Optimization in dynamic programming and optimal control
- Stochastic gradient descent and variants
The projection problem
Sometimes the constraint set in an optimization problem is so complex that projecting onto it is expensive. Projected gradient descent works by taking a gradient step, then projecting back onto the feasible set. But for structured constraints like nuclear norm balls or flow polytopes, that projection is itself a hard optimization problem.
Frank-Wolfe avoids projection entirely.
| Step | Projected gradient descent | Frank-Wolfe |
|---|---|---|
| Descent direction | Negative gradient | Negative gradient (linearized) |
| Stay feasible via | Projection onto constraint set | Linear minimization over constraint set |
| Cost of staying feasible | Solve a QP (can be expensive) | Solve an LP or use closed form (often cheap) |
| Iterates | Dense | Naturally sparse |
The Frank-Wolfe idea
graph LR
A["Current point x_k"] --> B["Linearize objective at x_k"]
B --> C["Find cheapest point v_k in constraint set (LMO)"]
C --> D["Move toward v_k"]
D --> E["New point x_{k+1}"]
Instead of projecting back onto the constraint set after each step, Frank-Wolfe finds the best point inside the constraint set using a simpler linear problem. The linear minimization oracle (LMO) replaces the projection step and is often orders of magnitude cheaper.
Now let’s formalize this.
Prerequisites
You should be comfortable with gradient descent and how the gradient points in the direction of steepest increase. Understanding Lagrangian duality helps with the convergence analysis, but is not strictly required to follow the algorithm.
The problem
We want to solve:
where is a smooth convex function and is a compact convex set.
The standard approach is projected gradient descent: take a gradient step, then project back onto . This works, but the projection step can be expensive. For many constraint sets (nuclear norm balls, flow polytopes, matroid polytopes), projection is hard while linear minimization over the set is easy.
Frank-Wolfe takes a different route: instead of projecting, it solves a linear subproblem at each step.
The algorithm
Frank-Wolfe (Conditional Gradient) Method:
Input: starting point x_0 ∈ C, number of iterations T
For k = 0, 1, ..., T-1:
1. Compute the gradient: g_k = ∇f(x_k)
2. Linear minimization oracle (LMO):
v_k = argmin_{v ∈ C} g_k^T v
3. Step size: γ_k = 2/(k+2) (or line search)
4. Update: x_{k+1} = (1 - γ_k) x_k + γ_k v_k
That is the entire algorithm. Step 2 is the key: you minimize a linear function over . For a polytope, this always returns a vertex. Step 4 moves toward that vertex with a step size that decreases over time.
Why “conditional gradient”?
The direction is the Frank-Wolfe direction. You can think of it as finding the point in that the negative gradient most wants to reach, then stepping toward it. The name “conditional gradient” comes from the fact that you optimize a linear approximation conditioned on staying in .
Frank-Wolfe iteration flow
graph TD A["Compute gradient at current point"] --> B["LMO: find vertex v minimizing gradient dot v"] B --> C["Choose step size gamma"] C --> D["Update: move from x toward v"] D --> E["Frank-Wolfe gap small?"] E -- No --> A E -- Yes --> F["Converged"]
Why no projection?
Projected gradient descent requires solving:
This is itself an optimization problem. For simple sets (boxes, balls), projection is cheap. But for structured sets, it can be as hard as the original problem.
Frank-Wolfe replaces projection with a linear minimization oracle (LMO):
For polytopes, this is a linear program. For many special polytopes, it has a closed-form solution or a very efficient algorithm.
| Constraint set | Projection cost | LMO cost |
|---|---|---|
| ball | ||
| Simplex | ||
| Nuclear norm ball | Full SVD: | Top singular vector: |
| Flow polytope | Expensive QP | Shortest path |
When the LMO is much cheaper than projection, Frank-Wolfe wins.
The linear minimization oracle (LMO)
graph TD A["Input: gradient direction g"] --> B["LMO solves: min g^T v over constraint set C"] B --> C["For a polytope: returns a vertex"] B --> D["For L1 ball: pick coordinate with largest abs gradient"] B --> E["For simplex: pick coordinate with smallest gradient"]
Example 1: Minimizing a quadratic over the simplex
Problem:
where is the 3-simplex (probability simplex), and:
The LMO on the simplex is trivial: minimize a linear function over the simplex. The answer is always a vertex where . Just pick the coordinate with the smallest gradient component.
Iteration 0: Start at .
Compute gradient:
LMO: . The minimum is at index 2. So .
Step size: .
Update: .
Iteration 1: .
LMO: Minimum gradient component is at index 2. So . Same as !
The Frank-Wolfe gap , which means we are at the optimum for this direction. But let us check: means no progress. The Frank-Wolfe gap is zero, confirming (local) optimality on the simplex.
Actually, . The algorithm has converged in just one step because happens to be optimal on the simplex.
Verify: . Check other vertices: , . Indeed is optimal.
Frank-Wolfe iterates on min (x-1.8)^2 + (y-1.8)^2 over the triangle x ≥ 0, y ≥ 0, x + y ≤ 2. Each step moves toward a vertex of the constraint set, producing a zigzag path toward the optimum.
Example 2: Multiple iterations with line search
Problem:
where .
The unconstrained minimum is , but , so it is infeasible. The constrained optimum lies on the edge .
Iteration 0: Start at .
LMO over : Minimize over the simplex. The vertices are , , and . Values: , , . Minimum at .
So .
Line search: Minimize over .
Minimize: , so .
.
Iteration 1:
LMO: Values at vertices: , , . Minimum at .
. Direction: .
Line search:
Taking the derivative and setting to zero:
.
The iterates are approaching the optimal point on the constraint boundary. Notice how each iterate is a convex combination of previous iterates and vertices.
Sparsity of iterates
A key property of Frank-Wolfe: after iterations, the iterate is a convex combination of at most vertices of . This is because each update mixes the current point with one new vertex.
For problems where is a polytope, this means is supported on at most vertices. If you run 50 iterations on a polytope with millions of vertices, your solution involves at most 51 vertices. This sparsity is valuable in:
- Machine learning: Sparse models are interpretable. Training an SVM with Frank-Wolfe produces a solution supported on few support vectors.
- Signal processing: Recovering sparse signals from measurements.
- Combinatorial optimization: Expressing solutions as combinations of few extreme points.
Convergence
Frank-Wolfe converges at a rate of for smooth convex functions. After iterations:
where is the Lipschitz constant of and is the diameter of .
This is slower than projected gradient descent ( can be improved to with acceleration), but each Frank-Wolfe iteration is cheaper when the LMO is cheaper than projection.
The Frank-Wolfe gap
A useful convergence certificate is the Frank-Wolfe gap:
This is the dot product of the gradient with the direction toward the LMO solution. It upper bounds the suboptimality: . When is small, you are close to optimal.
Frank-Wolfe vs projected gradient descent convergence
graph TD A["Frank-Wolfe: O(1/k) rate"] --> C["Slower per-iteration progress"] B["Projected gradient descent: O(1/k) or O(1/k^2) with acceleration"] --> D["Faster per-iteration progress"] C --> E["But each FW iteration is cheaper when LMO is cheap"] D --> F["Each PGD iteration pays the projection cost"]
Frank-Wolfe converges at O(1/k) rate while projected gradient descent achieves linear convergence, but FW avoids the cost of projections.
Example 3: Frank-Wolfe on a quadratic over an ball
Problem:
where . The ball in 2D is the diamond with vertices and .
The unconstrained minimum is with , so the constraint is active.
LMO on the ball: Minimize over . The solution is where . Pick the coordinate with the largest absolute gradient and go to the corresponding vertex.
Iteration 0: .
LMO: , so , .
. .
.
Iteration 1: .
LMO: , so , .
.
.
.
Hmm, the objective increased! This happens because the default step size is not always optimal. Let us use line search instead.
With line search:
.
.
That is much better. Note: , so the constraint is active.
Iteration 2: .
LMO: Both components are equal in magnitude. Pick (say). .
Line search along :
Minimized at . The algorithm stays at . Frank-Wolfe gap:
Converged. The optimal solution is approximately .
The true optimum (by projection onto the ball) is . Frank-Wolfe found it exactly.
Variants and improvements
Away-step Frank-Wolfe
Standard Frank-Wolfe can zigzag near the optimum. The away-step variant can also move away from a vertex already in the active set, reducing zigzagging and improving convergence to linear rate for strongly convex objectives.
Pairwise Frank-Wolfe
Combines the toward step and away step into a single pairwise swap between vertices. Converges linearly for strongly convex functions on polytopes.
Stochastic Frank-Wolfe
When the objective is an expectation , use a stochastic gradient in the LMO. This connects Frank-Wolfe to stochastic optimization.
When to use Frank-Wolfe
✓ The constraint set has a cheap LMO but expensive projection.
✓ You want sparse solutions (few active vertices).
✓ The problem is large-scale and you need simple iterations.
⚠ Convergence is , slower than projected gradient methods for strongly convex problems. The away-step variant fixes this.
⚠ Not suitable for unconstrained problems (the whole point is the constraint structure).
Python implementation
import numpy as np
def frank_wolfe(grad_f, lmo, x0, f=None, max_iter=100, tol=1e-6):
"""
Frank-Wolfe method with line search.
grad_f: gradient function
lmo: linear minimization oracle, returns argmin_{v in C} g^T v
x0: starting point in C
f: objective (needed for line search; if None, use default step size)
"""
x = x0.copy()
for k in range(max_iter):
g = grad_f(x)
v = lmo(g)
d = v - x
gap = -g @ d # Frank-Wolfe gap
if gap < tol:
break
# Line search or default step size
gamma = 2.0 / (k + 2)
if f is not None:
# Exact line search for quadratics,
# or use backtracking for general f
best_gamma = gamma
best_val = f(x + gamma * d)
for g_try in np.linspace(0, 1, 50):
val = f(x + g_try * d)
if val < best_val:
best_val = val
best_gamma = g_try
gamma = best_gamma
x = x + gamma * d
return x
# LMO for the probability simplex
def simplex_lmo(g):
v = np.zeros_like(g)
v[np.argmin(g)] = 1.0
return v
# LMO for the L1 ball
def l1_ball_lmo(g):
v = np.zeros_like(g)
j = np.argmax(np.abs(g))
v[j] = -np.sign(g[j])
return v
What comes next
Frank-Wolfe is great for constrained optimization with special structure. But what about problems where the “constraint” is really a sequential decision process? Dynamic programming and optimal control shows how optimization principles apply to multi-stage decision problems, from shortest paths to reinforcement learning.