Search…

Constrained optimization and Lagrangian duality

In this series (18 parts)
  1. What is optimization and why ML needs it
  2. Convex sets and convex functions
  3. Optimality conditions: first order
  4. Optimality conditions: second order
  5. Line search methods
  6. Least squares: the closed-form solution
  7. Steepest descent (gradient descent)
  8. Newton's method for optimization
  9. Quasi-Newton methods: BFGS and L-BFGS
  10. Conjugate gradient methods
  11. Constrained optimization and Lagrangian duality
  12. KKT conditions
  13. Penalty and barrier methods
  14. Interior point methods
  15. The simplex method
  16. Frank-Wolfe method
  17. Optimization in dynamic programming and optimal control
  18. Stochastic gradient descent and variants

Prerequisites: This article assumes you are comfortable with first-order optimality conditions. You should also know how to compute gradients and understand convex functions.

Why constrained optimization matters

Most real problems have constraints. You want to minimize cost, but your budget is fixed. You want to maximize return, but risk cannot exceed a threshold. Unconstrained methods like gradient descent do not handle these situations directly. You need a framework that bakes constraints into the optimization itself.

Equality constraints and geometric intuition

Consider the basic constrained problem:

minxf(x)subject toh(x)=0\min_x f(x) \quad \text{subject to} \quad h(x) = 0

You want to minimize f(x)f(x) but only among points that satisfy h(x)=0h(x) = 0. The constraint h(x)=0h(x) = 0 defines a surface (or curve, in 2D). You are restricted to moving along that surface.

Picture the level curves of ff and the constraint surface h(x)=0h(x) = 0. As you walk along the constraint surface, ff decreases until you reach a point where the constraint surface is tangent to a level curve of ff. At that point, any small step along the constraint either increases ff or leaves it unchanged. You have found the optimum.

What does tangency mean mathematically? It means the gradient of ff is parallel to the gradient of hh at the optimal point. If f\nabla f had any component along the constraint surface, you could move in that direction and decrease ff further. So at the optimum:

f(x)=λh(x)\nabla f(x^*) = -\lambda \nabla h(x^*)

for some scalar λ\lambda. This scalar is the Lagrange multiplier.

Contour plot of f(x,y) = x^2 + y^2 with the equality constraint x + y = 2. The constrained minimum occurs at (1,1) where the contour is tangent to the constraint line.

The Lagrangian function

Instead of solving the constrained problem directly, we build a single function that encodes both the objective and the constraint. The Lagrangian is:

L(x,λ)=f(x)+λh(x)L(x, \lambda) = f(x) + \lambda \, h(x)

Here λ\lambda is the Lagrange multiplier. Setting the partial derivatives of LL to zero gives us the optimality conditions:

Lx=f(x)+λh(x)=0\frac{\partial L}{\partial x} = \nabla f(x) + \lambda \nabla h(x) = 0 Lλ=h(x)=0\frac{\partial L}{\partial \lambda} = h(x) = 0

The first equation says the gradients are parallel (the geometric condition). The second equation says the constraint is satisfied. Together, they form a system of equations you can solve for xx^* and λ\lambda^*.

What the multiplier means

The multiplier λ\lambda^* has a concrete interpretation. It is the shadow price of the constraint: it tells you how much the optimal value of ff changes if you relax the constraint slightly. If the constraint is h(x)=0h(x) = 0 and you change it to h(x)=ϵh(x) = \epsilon, then:

f(xϵ)f(x)λϵf(x^*_\epsilon) \approx f(x^*) - \lambda^* \epsilon

This is extremely useful in practice. It tells you the “cost” of each constraint, which constraints are worth relaxing, and by how much.

Worked example 1: minimizing distance to the origin

Problem: Minimize f(x,y)=x2+y2f(x, y) = x^2 + y^2 subject to x+y=1x + y = 1.

We want the point on the line x+y=1x + y = 1 that is closest to the origin.

Step 1: write the Lagrangian.

The constraint is h(x,y)=x+y1=0h(x, y) = x + y - 1 = 0.

L(x,y,λ)=x2+y2+λ(x+y1)L(x, y, \lambda) = x^2 + y^2 + \lambda(x + y - 1)

Step 2: take partial derivatives and set them to zero.

Lx=2x+λ=0\frac{\partial L}{\partial x} = 2x + \lambda = 0 Ly=2y+λ=0\frac{\partial L}{\partial y} = 2y + \lambda = 0 Lλ=x+y1=0\frac{\partial L}{\partial \lambda} = x + y - 1 = 0

Step 3: solve the system.

From the first equation: x=λ/2x = -\lambda / 2.

From the second equation: y=λ/2y = -\lambda / 2.

So x=yx = y. Substituting into the constraint:

x+x=1    2x=1    x=12x + x = 1 \implies 2x = 1 \implies x = \frac{1}{2}

Therefore y=1/2y = 1/2 and λ=2x=1\lambda = -2x = -1.

Step 4: verify and interpret.

The optimal point is (x,y)=(1/2,1/2)(x^*, y^*) = (1/2, 1/2) with optimal value f=(1/2)2+(1/2)2=1/2f^* = (1/2)^2 + (1/2)^2 = 1/2.

The multiplier λ=1\lambda^* = -1 tells us: if we change the constraint to x+y=1+ϵx + y = 1 + \epsilon, the optimal value changes by approximately (1)ϵ=ϵ-(-1)\epsilon = \epsilon. Relaxing the constraint (making the line farther from the origin) increases the minimum distance squared, which makes geometric sense.

The dual problem

The Lagrangian gives us more than just optimality conditions. It opens the door to a completely different way of solving the problem: duality.

Define the dual function:

g(λ)=minxL(x,λ)=minx[f(x)+λh(x)]g(\lambda) = \min_x L(x, \lambda) = \min_x \left[ f(x) + \lambda \, h(x) \right]

For each fixed λ\lambda, you minimize the Lagrangian over xx with no constraints. The result g(λ)g(\lambda) depends only on λ\lambda.

Why the dual matters

The dual function provides a lower bound on the optimal value of the original (primal) problem. For any λ\lambda:

g(λ)f(x)g(\lambda) \leq f(x^*)

Here is why. Let xx^* be the primal optimum, so h(x)=0h(x^*) = 0. Then:

g(λ)=minx[f(x)+λh(x)]f(x)+λh(x)=f(x)g(\lambda) = \min_x \left[ f(x) + \lambda \, h(x) \right] \leq f(x^*) + \lambda \, h(x^*) = f(x^*)

The minimum over all xx is at most the value at xx^*, and the constraint term vanishes at xx^*. This holds for every λ\lambda, so we get the tightest bound by maximizing:

maxλg(λ)f(x)\max_\lambda \, g(\lambda) \leq f(x^*)

This is the dual problem: maximize g(λ)g(\lambda) over λ\lambda.

Weak and strong duality

Weak duality says the dual optimal value dd^* is never larger than the primal optimal value pp^*:

d=maxλg(λ)minx:h(x)=0f(x)=pd^* = \max_\lambda \, g(\lambda) \leq \min_{x: h(x)=0} f(x) = p^*

This always holds, regardless of convexity. The gap pdp^* - d^* is called the duality gap.

Strong duality says d=pd^* = p^*, meaning the duality gap is zero. This does not always hold, but it does hold under nice conditions. For convex problems with equality constraints that are affine (linear), strong duality holds automatically. More generally, Slater’s condition guarantees strong duality: if the problem is convex and there exists a point strictly inside the feasible region (a strictly feasible point), then d=pd^* = p^*.

Why does this matter? When strong duality holds, you can solve the dual problem instead of the primal. Sometimes the dual is easier, especially when it has fewer variables or a simpler structure.

Worked example 2: duality in action

Problem: Minimize f(x,y)=x2+2y2f(x, y) = x^2 + 2y^2 subject to x+y=3x + y = 3.

Primal solution

Step 1: write the Lagrangian.

L(x,y,λ)=x2+2y2+λ(x+y3)L(x, y, \lambda) = x^2 + 2y^2 + \lambda(x + y - 3)

Step 2: take partial derivatives.

Lx=2x+λ=0    x=λ2\frac{\partial L}{\partial x} = 2x + \lambda = 0 \implies x = -\frac{\lambda}{2} Ly=4y+λ=0    y=λ4\frac{\partial L}{\partial y} = 4y + \lambda = 0 \implies y = -\frac{\lambda}{4} Lλ=x+y3=0\frac{\partial L}{\partial \lambda} = x + y - 3 = 0

Step 3: solve.

Substituting xx and yy into the constraint:

λ2λ4=3-\frac{\lambda}{2} - \frac{\lambda}{4} = 3 3λ4=3-\frac{3\lambda}{4} = 3 λ=4\lambda^* = -4

Therefore:

x=42=2,y=44=1x^* = -\frac{-4}{2} = 2, \quad y^* = -\frac{-4}{4} = 1

The optimal value is:

p=f(2,1)=22+2(1)2=4+2=6p^* = f(2, 1) = 2^2 + 2(1)^2 = 4 + 2 = 6

Dual solution

Now let’s solve the dual problem and verify strong duality.

Step 1: compute g(λ)g(\lambda).

We already found that for a given λ\lambda, the minimizers of LL over (x,y)(x, y) are x=λ/2x = -\lambda/2 and y=λ/4y = -\lambda/4. Substituting back:

g(λ)=(λ2)2+2(λ4)2+λ(λ2λ43)g(\lambda) = \left(-\frac{\lambda}{2}\right)^2 + 2\left(-\frac{\lambda}{4}\right)^2 + \lambda\left(-\frac{\lambda}{2} - \frac{\lambda}{4} - 3\right) =λ24+2λ216+λ(3λ43)= \frac{\lambda^2}{4} + 2 \cdot \frac{\lambda^2}{16} + \lambda\left(-\frac{3\lambda}{4} - 3\right) =λ24+λ283λ243λ= \frac{\lambda^2}{4} + \frac{\lambda^2}{8} - \frac{3\lambda^2}{4} - 3\lambda

Combine the λ2\lambda^2 terms with a common denominator of 8:

=2λ28+λ286λ283λ= \frac{2\lambda^2}{8} + \frac{\lambda^2}{8} - \frac{6\lambda^2}{8} - 3\lambda =3λ283λ= \frac{-3\lambda^2}{8} - 3\lambda

Step 2: maximize g(λ)g(\lambda).

g(λ)=6λ83=3λ43g'(\lambda) = -\frac{6\lambda}{8} - 3 = -\frac{3\lambda}{4} - 3

Set g(λ)=0g'(\lambda) = 0:

3λ4=3    λ=4-\frac{3\lambda}{4} = 3 \implies \lambda^* = -4

Step 3: compute dd^*.

d=g(4)=3(16)83(4)=6+12=6d^* = g(-4) = \frac{-3(16)}{8} - 3(-4) = -6 + 12 = 6

Verification: d=6=pd^* = 6 = p^*. Strong duality holds. The dual optimal equals the primal optimal.

This is a convex problem (quadratic objective, linear constraint), so strong duality is guaranteed.

Multiple equality constraints

The framework extends naturally. If you have mm equality constraints h1(x)=0,,hm(x)=0h_1(x) = 0, \ldots, h_m(x) = 0, the Lagrangian becomes:

L(x,λ1,,λm)=f(x)+i=1mλihi(x)L(x, \lambda_1, \ldots, \lambda_m) = f(x) + \sum_{i=1}^{m} \lambda_i \, h_i(x)

The optimality conditions are:

f(x)+i=1mλihi(x)=0\nabla f(x) + \sum_{i=1}^{m} \lambda_i \nabla h_i(x) = 0 hi(x)=0for i=1,,mh_i(x) = 0 \quad \text{for } i = 1, \ldots, m

This gives you n+mn + m equations in n+mn + m unknowns (the nn components of xx plus the mm multipliers). When ff is quadratic and the constraints are linear, the system is linear and you can solve it with standard linear algebra.

Worked example 3: two equality constraints

Problem: Minimize f(x,y,z)=x2+y2+z2f(x, y, z) = x^2 + y^2 + z^2 subject to x+y=2x + y = 2 and y+z=3y + z = 3.

We want the point closest to the origin that lies on the intersection of two planes.

Step 1: write the Lagrangian.

L=x2+y2+z2+λ1(x+y2)+λ2(y+z3)L = x^2 + y^2 + z^2 + \lambda_1(x + y - 2) + \lambda_2(y + z - 3)

Step 2: take partial derivatives.

Lx=2x+λ1=0\frac{\partial L}{\partial x} = 2x + \lambda_1 = 0 Ly=2y+λ1+λ2=0\frac{\partial L}{\partial y} = 2y + \lambda_1 + \lambda_2 = 0 Lz=2z+λ2=0\frac{\partial L}{\partial z} = 2z + \lambda_2 = 0 Lλ1=x+y2=0\frac{\partial L}{\partial \lambda_1} = x + y - 2 = 0 Lλ2=y+z3=0\frac{\partial L}{\partial \lambda_2} = y + z - 3 = 0

Step 3: express variables in terms of multipliers.

From the first three equations:

x=λ12,y=λ1+λ22,z=λ22x = -\frac{\lambda_1}{2}, \quad y = -\frac{\lambda_1 + \lambda_2}{2}, \quad z = -\frac{\lambda_2}{2}

Step 4: substitute into the constraints.

Constraint 1 (x+y=2x + y = 2):

λ12λ1+λ22=2-\frac{\lambda_1}{2} - \frac{\lambda_1 + \lambda_2}{2} = 2 2λ1+λ22=2-\frac{2\lambda_1 + \lambda_2}{2} = 2 2λ1+λ2=4(i)2\lambda_1 + \lambda_2 = -4 \quad \cdots (i)

Constraint 2 (y+z=3y + z = 3):

λ1+λ22λ22=3-\frac{\lambda_1 + \lambda_2}{2} - \frac{\lambda_2}{2} = 3 λ1+2λ22=3-\frac{\lambda_1 + 2\lambda_2}{2} = 3 λ1+2λ2=6(ii)\lambda_1 + 2\lambda_2 = -6 \quad \cdots (ii)

Step 5: solve the 2x2 system.

From (i): λ2=42λ1\lambda_2 = -4 - 2\lambda_1.

Substitute into (ii):

λ1+2(42λ1)=6\lambda_1 + 2(-4 - 2\lambda_1) = -6 λ184λ1=6\lambda_1 - 8 - 4\lambda_1 = -6 3λ1=2-3\lambda_1 = 2 λ1=23\lambda_1 = -\frac{2}{3}

Then:

λ2=42(23)=4+43=83\lambda_2 = -4 - 2\left(-\frac{2}{3}\right) = -4 + \frac{4}{3} = -\frac{8}{3}

Step 6: recover the optimal point.

x=2/32=13x^* = -\frac{-2/3}{2} = \frac{1}{3} y=2/3+(8/3)2=10/32=106=53y^* = -\frac{-2/3 + (-8/3)}{2} = -\frac{-10/3}{2} = \frac{10}{6} = \frac{5}{3} z=8/32=86=43z^* = -\frac{-8/3}{2} = \frac{8}{6} = \frac{4}{3}

Step 7: verify.

Check constraint 1: x+y=1/3+5/3=6/3=2x^* + y^* = 1/3 + 5/3 = 6/3 = 2. ✓

Check constraint 2: y+z=5/3+4/3=9/3=3y^* + z^* = 5/3 + 4/3 = 9/3 = 3. ✓

Optimal value:

f=(13)2+(53)2+(43)2=19+259+169=429=1434.667f^* = \left(\frac{1}{3}\right)^2 + \left(\frac{5}{3}\right)^2 + \left(\frac{4}{3}\right)^2 = \frac{1}{9} + \frac{25}{9} + \frac{16}{9} = \frac{42}{9} = \frac{14}{3} \approx 4.667

The multiplier λ1=2/3\lambda_1 = -2/3 tells us: relaxing the first constraint (x+y=2x + y = 2) by a small ϵ\epsilon changes the optimal value by approximately (2/3)ϵ(2/3)\epsilon. Similarly, λ2=8/3\lambda_2 = -8/3 tells us the second constraint is more “expensive” to enforce.

Python implementation

Here is how you solve constrained optimization problems with scipy.optimize.minimize. We will solve Example 2: minimize x2+2y2x^2 + 2y^2 subject to x+y=3x + y = 3.

import numpy as np
from scipy.optimize import minimize

def objective(vars):
    x, y = vars
    return x**2 + 2 * y**2

def constraint_eq(vars):
    x, y = vars
    return x + y - 3  # must equal zero

result = minimize(
    objective,
    x0=[0.0, 0.0],
    method="SLSQP",
    constraints={"type": "eq", "fun": constraint_eq},
)

print(f"Optimal x = {result.x[0]:.4f}, y = {result.x[1]:.4f}")
print(f"Optimal value = {result.fun:.4f}")
# Output: Optimal x = 2.0000, y = 1.0000
# Output: Optimal value = 6.0000

For the two-constraint problem (Example 3):

def objective_3d(vars):
    x, y, z = vars
    return x**2 + y**2 + z**2

constraints = [
    {"type": "eq", "fun": lambda v: v[0] + v[1] - 2},
    {"type": "eq", "fun": lambda v: v[1] + v[2] - 3},
]

result = minimize(
    objective_3d,
    x0=[0.0, 0.0, 0.0],
    method="SLSQP",
    constraints=constraints,
)

print(f"Optimal (x, y, z) = ({result.x[0]:.4f}, {result.x[1]:.4f}, {result.x[2]:.4f})")
print(f"Optimal value = {result.fun:.4f}")
# Output: Optimal (x, y, z) = (0.3333, 1.6667, 1.3333)
# Output: Optimal value = 4.6667

The SLSQP method handles both equality and inequality constraints. It uses a sequential least-squares approach internally, building on ideas from least squares optimization.

A brief note on inequality constraints

So far we have only handled equality constraints of the form h(x)=0h(x) = 0. But many real problems have inequality constraints:

minxf(x)subject tog(x)0,h(x)=0\min_x f(x) \quad \text{subject to} \quad g(x) \leq 0, \quad h(x) = 0

Inequality constraints introduce additional complexity. The multipliers for inequality constraints must be non-negative, and a complementary slackness condition determines which constraints are “active” at the optimum. These ideas lead to the Karush-Kuhn-Tucker (KKT) conditions, which generalize everything in this article.

What comes next

This article covered equality-constrained optimization and duality. The natural next step is handling inequality constraints, which requires the KKT conditions. KKT conditions combine the Lagrangian framework with complementary slackness to handle both equality and inequality constraints in a unified way. They are the foundation for understanding support vector machines, interior-point methods, and most modern optimization algorithms.

Start typing to search across all content
navigate Enter open Esc close