Search…

KKT conditions

In this series (18 parts)
  1. What is optimization and why ML needs it
  2. Convex sets and convex functions
  3. Optimality conditions: first order
  4. Optimality conditions: second order
  5. Line search methods
  6. Least squares: the closed-form solution
  7. Steepest descent (gradient descent)
  8. Newton's method for optimization
  9. Quasi-Newton methods: BFGS and L-BFGS
  10. Conjugate gradient methods
  11. Constrained optimization and Lagrangian duality
  12. KKT conditions
  13. Penalty and barrier methods
  14. Interior point methods
  15. The simplex method
  16. Frank-Wolfe method
  17. Optimization in dynamic programming and optimal control
  18. Stochastic gradient descent and variants

Prerequisites: This article builds directly on Lagrangian duality. You should be comfortable with Lagrange multipliers for equality-constrained problems before continuing.

A factory with resource constraints

A factory makes two products, A and B. Each unit of A earns $5 profit, each unit of B earns $4. But resources are limited:

ResourceA uses per unitB uses per unitAvailable
Material2 kg1 kg10 kg
Labor1 hr2 hr8 hr

The factory wants to maximize profit. The unconstrained answer would be “make infinitely many of both,” but the constraints fence in the possibilities. Only certain combinations of A and B are feasible.

The optimal point sits on the boundary of the feasible region, where the profit contours just touch the constraint walls. KKT conditions tell you exactly when a point is optimal for a constrained problem. They are a checklist: if all four conditions pass, you have found the answer (at least for convex problems).

Feasible region with objective contours

graph TD
  A["Feasible region:<br/>all constraint inequalities satisfied"] --> B["Objective contours<br/>sweep across the region"]
  B --> C["Optimal point: where the best contour<br/>just touches the feasible boundary"]
  C --> D["KKT conditions:<br/>four checks that confirm optimality"]
  style C fill:#ccffcc
  style D fill:#ccffcc

The relationship: unconstrained to equality to inequality constraints

graph LR
  A["Unconstrained:<br/>set gradient = 0"] --> B["Equality constraints:<br/>Lagrange multipliers"]
  B --> C["Inequality constraints:<br/>KKT conditions"]
  A -->|"Simplest case"| D["No constraints to worry about"]
  B -->|"Adds multipliers for equalities"| E["Gradient of Lagrangian = 0"]
  C -->|"Adds sign constraints<br/>and complementary slackness"| F["Four conditions to check"]

KKT conditions are the most general first-order optimality conditions for smooth constrained problems. Everything else is a special case.

From equality to inequality constraints

The Lagrangian method handles equality constraints cleanly. You set up a Lagrangian, take gradients, set them to zero, and solve. But real optimization problems rarely come with only equalities. Budget limits, capacity bounds, non-negativity requirements: these are all inequalities.

The general constrained optimization problem looks like this:

minxf(x)subject togi(x)0,  i=1,,mandhj(x)=0,  j=1,,p\min_{x} f(x) \quad \text{subject to} \quad g_i(x) \leq 0, \; i = 1, \ldots, m \quad \text{and} \quad h_j(x) = 0, \; j = 1, \ldots, p

The Karush-Kuhn-Tucker (KKT) conditions extend the Lagrangian framework to handle these inequality constraints. They give you a checklist: if a point passes all four conditions, you know it is a candidate for optimality. For convex problems, passing the checklist is enough to guarantee optimality.

The four KKT conditions

Given the problem above, form the Lagrangian:

L(x,μ,λ)=f(x)+i=1mμigi(x)+j=1pλjhj(x)\mathcal{L}(x, \mu, \lambda) = f(x) + \sum_{i=1}^{m} \mu_i g_i(x) + \sum_{j=1}^{p} \lambda_j h_j(x)

Here μi\mu_i are the multipliers for inequality constraints and λj\lambda_j are the multipliers for equality constraints. A point xx^* with multipliers μ\mu^* and λ\lambda^* satisfies the KKT conditions if all four of the following hold:

1. Stationarity

f(x)+i=1mμigi(x)+j=1pλjhj(x)=0\nabla f(x^*) + \sum_{i=1}^{m} \mu_i^* \nabla g_i(x^*) + \sum_{j=1}^{p} \lambda_j^* \nabla h_j(x^*) = 0

The gradient of the objective plus the weighted gradients of all constraints must sum to zero. This is the same idea as setting the gradient of the Lagrangian to zero.

2. Primal feasibility

gi(x)0for all i,hj(x)=0for all jg_i(x^*) \leq 0 \quad \text{for all } i, \qquad h_j(x^*) = 0 \quad \text{for all } j

The solution must actually satisfy all the constraints. No cheating.

3. Dual feasibility

μi0for all i\mu_i^* \geq 0 \quad \text{for all } i

The multipliers on inequality constraints must be non-negative. This is the key difference from equality constraints, where λj\lambda_j can be any real number. The sign restriction on μi\mu_i reflects the fact that inequality constraints can only “push” the solution in one direction.

4. Complementary slackness

μigi(x)=0for all i\mu_i^* g_i(x^*) = 0 \quad \text{for all } i

For each inequality constraint, either the constraint is active (gi(x)=0g_i(x^*) = 0) or its multiplier is zero (μi=0\mu_i^* = 0), or both. This is the most interesting condition, and it deserves its own section.

Complementary slackness explained

Think about it this way. If a constraint is inactive, meaning gi(x)<0g_i(x^*) < 0, the solution is strictly inside the feasible region with respect to that constraint. The constraint is not “touching” the solution. It might as well not exist. So its multiplier should be zero, because removing that constraint would not change the optimal value.

On the other hand, if a constraint is active, meaning gi(x)=0g_i(x^*) = 0, the solution sits right on the boundary of that constraint. The constraint is binding. Its multiplier μi\mu_i^* can be positive, reflecting the “cost” of that constraint being there.

The product μigi(x)=0\mu_i^* g_i(x^*) = 0 captures both cases in one equation. It is called complementary slackness because “slack” (gi<0g_i < 0) and a positive multiplier (μi>0\mu_i > 0) cannot coexist.

Complementary slackness: active vs inactive constraints

graph TD
  A["Constraint i"] --> B{"Is constraint active?<br/>g_i = 0?"}
  B -->|"Yes: active, tight"| C["Constraint is binding<br/>Multiplier mu_i can be positive"]
  B -->|"No: inactive, slack"| D["Constraint has room<br/>Multiplier mu_i must be 0"]
  C --> E["mu_i * g_i = mu_i * 0 = 0 ✓"]
  D --> F["mu_i * g_i = 0 * g_i = 0 ✓"]
  style C fill:#ffcccc
  style D fill:#ccffcc

This gives you a powerful solving strategy: for each inequality constraint, consider two cases. Either the constraint is active, or μi=0\mu_i = 0. Try each combination and see which one leads to a consistent solution.

Interpreting the multipliers

The multiplier μi\mu_i^* tells you the sensitivity of the optimal value to constraint ii. If you relax the constraint slightly, changing gi(x)0g_i(x) \leq 0 to gi(x)ϵg_i(x) \leq \epsilon, the optimal objective value changes by approximately μiϵ-\mu_i^* \epsilon. A large multiplier means the constraint is “expensive”: it is costing you a lot in objective value.

If μi=0\mu_i^* = 0, the constraint is not affecting the solution at all. You could remove it and nothing would change.

Worked Example 1: Minimizing distance with a linear constraint

Problem: Minimize f(x,y)=x2+y2f(x, y) = x^2 + y^2 subject to x+y1x + y \geq 1.

Contour plot of f(x,y) = x^2 + y^2 with the inequality constraint x + y >= 2. The shaded region is feasible. The KKT point (1,1) lies on the constraint boundary where the gradient of f is proportional to the gradient of the constraint.

Step 1: Rewrite in standard form. The KKT framework uses gi(x)0g_i(x) \leq 0, so rewrite the constraint:

g(x,y)=xy+10g(x, y) = -x - y + 1 \leq 0

Step 2: Write the KKT conditions.

Stationarity:

f+μg=0\nabla f + \mu \nabla g = 0 [2x2y]+μ[11]=[00]\begin{bmatrix} 2x \\ 2y \end{bmatrix} + \mu \begin{bmatrix} -1 \\ -1 \end{bmatrix} = \begin{bmatrix} 0 \\ 0 \end{bmatrix}

So: 2xμ=02x - \mu = 0 and 2yμ=02y - \mu = 0.

Primal feasibility: xy+10-x - y + 1 \leq 0, i.e., x+y1x + y \geq 1.

Dual feasibility: μ0\mu \geq 0.

Complementary slackness: μ(xy+1)=0\mu(-x - y + 1) = 0.

Step 3: Consider two cases.

Case A: Constraint inactive (μ=0\mu = 0).

From stationarity: 2x=02x = 0 and 2y=02y = 0, giving x=0x = 0, y=0y = 0.

Check primal feasibility: 0+0=010 + 0 = 0 \geq 1? No. ✗

The unconstrained minimum (0,0)(0, 0) violates the constraint. This case fails.

Case B: Constraint active (g=0g = 0, so x+y=1x + y = 1).

From stationarity: x=μ/2x = \mu/2 and y=μ/2y = \mu/2, so x=yx = y.

Substituting into x+y=1x + y = 1: 2x=12x = 1, so x=1/2x = 1/2, y=1/2y = 1/2.

From stationarity: μ=2x=1\mu = 2x = 1.

Step 4: Verify all four conditions at (x,y)=(1/2,1/2)(x^*, y^*) = (1/2, 1/2), μ=1\mu^* = 1.

  • ✓ Stationarity: 2(1/2)1=02(1/2) - 1 = 0 and 2(1/2)1=02(1/2) - 1 = 0.
  • ✓ Primal feasibility: (1/2)(1/2)+1=00-(1/2) - (1/2) + 1 = 0 \leq 0.
  • ✓ Dual feasibility: μ=10\mu^* = 1 \geq 0.
  • ✓ Complementary slackness: 10=01 \cdot 0 = 0.

The optimal value is f(1/2,1/2)=1/4+1/4=1/2f(1/2, 1/2) = 1/4 + 1/4 = 1/2.

Worked Example 2: A 1D bound constraint

Problem: Minimize f(x)=(x3)2f(x) = (x - 3)^2 subject to x2x \leq 2.

This is a simple one-dimensional problem. The objective wants x=3x = 3, but the constraint says xx cannot exceed 2.

Step 1: Standard form. The constraint is already in standard form: g(x)=x20g(x) = x - 2 \leq 0.

Step 2: Write the KKT conditions.

Stationarity:

f(x)+μg(x)=0    2(x3)+μ=0f'(x) + \mu g'(x) = 0 \implies 2(x - 3) + \mu = 0

Primal feasibility: x20x - 2 \leq 0.

Dual feasibility: μ0\mu \geq 0.

Complementary slackness: μ(x2)=0\mu(x - 2) = 0.

Step 3: Consider two cases.

Case A: Constraint inactive (μ=0\mu = 0).

From stationarity: 2(x3)=02(x - 3) = 0, so x=3x = 3.

Check primal feasibility: 32=103 - 2 = 1 \leq 0? No. ✗

The unconstrained optimum is infeasible.

Case B: Constraint active (x=2x = 2).

From stationarity: 2(23)+μ=02(2 - 3) + \mu = 0, so 2+μ=0-2 + \mu = 0, giving μ=2\mu = 2.

Step 4: Verify all four conditions at x=2x^* = 2, μ=2\mu^* = 2.

  • ✓ Stationarity: 2(23)+2=2+2=02(2 - 3) + 2 = -2 + 2 = 0.
  • ✓ Primal feasibility: 22=002 - 2 = 0 \leq 0.
  • ✓ Dual feasibility: μ=20\mu^* = 2 \geq 0.
  • ✓ Complementary slackness: 2(22)=20=02 \cdot (2 - 2) = 2 \cdot 0 = 0.

The optimal value is f(2)=(23)2=1f(2) = (2 - 3)^2 = 1.

The multiplier μ=2\mu^* = 2 tells you something concrete. If you relax the constraint to x2.1x \leq 2.1, the optimal value would decrease by approximately 2×0.1=0.22 \times 0.1 = 0.2, from 1.01.0 to about 0.810.81 (since (2.13)2=0.81(2.1 - 3)^2 = 0.81).

Worked Example 3: A constrained quadratic program

Part A: Unconstrained minimum is feasible

Problem: Minimize f(x,y)=x2+y2f(x, y) = x^2 + y^2 subject to x+y4x + y \leq 4, x0x \geq 0, y0y \geq 0.

Step 1: Standard form. Rewrite all constraints as gi0g_i \leq 0:

g1(x,y)=x+y40,g2(x,y)=x0,g3(x,y)=y0g_1(x, y) = x + y - 4 \leq 0, \quad g_2(x, y) = -x \leq 0, \quad g_3(x, y) = -y \leq 0

Step 2: Check the unconstrained minimum. f=(2x,2y)=0\nabla f = (2x, 2y) = 0 gives (x,y)=(0,0)(x, y) = (0, 0).

Is (0,0)(0, 0) feasible?

  • g1=0+04=40g_1 = 0 + 0 - 4 = -4 \leq 0
  • g2=00g_2 = 0 \leq 0
  • g3=00g_3 = 0 \leq 0

Yes. The unconstrained minimum lies inside the feasible region. So no constraint is active, and all multipliers are zero: μ1=μ2=μ3=0\mu_1 = \mu_2 = \mu_3 = 0.

Step 3: Verify KKT conditions at (0,0)(0, 0) with μ1=μ2=μ3=0\mu_1 = \mu_2 = \mu_3 = 0.

  • ✓ Stationarity: f+μ1g1+μ2g2+μ3g3=(0,0)+0+0+0=(0,0)\nabla f + \mu_1 \nabla g_1 + \mu_2 \nabla g_2 + \mu_3 \nabla g_3 = (0, 0) + 0 + 0 + 0 = (0, 0).
  • ✓ Primal feasibility: all constraints satisfied as shown above.
  • ✓ Dual feasibility: all μi=00\mu_i = 0 \geq 0.
  • ✓ Complementary slackness: 0(4)=00 \cdot (-4) = 0, 00=00 \cdot 0 = 0, 00=00 \cdot 0 = 0.

Optimal value: f(0,0)=0f(0, 0) = 0.

Part B: Unconstrained minimum is infeasible

Problem: Minimize f(x,y)=(x3)2+(y3)2f(x, y) = (x - 3)^2 + (y - 3)^2 subject to x+y4x + y \leq 4, x0x \geq 0, y0y \geq 0.

The unconstrained minimum is (3,3)(3, 3), but 3+3=6>43 + 3 = 6 > 4. The point violates the first constraint.

Step 1: Which constraints are active? By symmetry (the objective and the binding constraint x+y=4x + y = 4 treat xx and yy identically), expect x=yx = y at the solution. If x=yx = y and x+y=4x + y = 4, then x=y=2x = y = 2. Both x0x \geq 0 and y0y \geq 0 are satisfied but not active.

So assume g1g_1 is active (x+y=4x + y = 4) and μ2=μ3=0\mu_2 = \mu_3 = 0.

Step 2: KKT stationarity with only μ1\mu_1 active.

f+μ1g1=0\nabla f + \mu_1 \nabla g_1 = 0 [2(x3)2(y3)]+μ1[11]=[00]\begin{bmatrix} 2(x - 3) \\ 2(y - 3) \end{bmatrix} + \mu_1 \begin{bmatrix} 1 \\ 1 \end{bmatrix} = \begin{bmatrix} 0 \\ 0 \end{bmatrix}

From the first equation: 2(x3)+μ1=02(x - 3) + \mu_1 = 0.

From the second equation: 2(y3)+μ1=02(y - 3) + \mu_1 = 0.

Subtracting: 2(x3)=2(y3)2(x - 3) = 2(y - 3), so x=yx = y.

With x+y=4x + y = 4: x=y=2x = y = 2.

Substituting back: 2(23)+μ1=02(2 - 3) + \mu_1 = 0, so 2+μ1=0-2 + \mu_1 = 0, giving μ1=2\mu_1 = 2.

Step 3: Verify all KKT conditions at (2,2)(2, 2), μ1=2\mu_1 = 2, μ2=μ3=0\mu_2 = \mu_3 = 0.

  • ✓ Stationarity: 2(23)+2=02(2-3) + 2 = 0 for both components.
  • ✓ Primal feasibility: 2+24=002 + 2 - 4 = 0 \leq 0; 20-2 \leq 0; 20-2 \leq 0.
  • ✓ Dual feasibility: μ1=20\mu_1 = 2 \geq 0, μ2=00\mu_2 = 0 \geq 0, μ3=00\mu_3 = 0 \geq 0.
  • ✓ Complementary slackness: 20=02 \cdot 0 = 0; 0(2)=00 \cdot (-2) = 0; 0(2)=00 \cdot (-2) = 0.

Optimal value: f(2,2)=(23)2+(23)2=1+1=2f(2, 2) = (2-3)^2 + (2-3)^2 = 1 + 1 = 2.

Again, μ1=2\mu_1 = 2 tells you that relaxing the constraint x+y4x + y \leq 4 to x+y4.1x + y \leq 4.1 would decrease the optimal value by about 0.20.2.

KKT for convex vs non-convex problems

For convex problems (convex objective, convex inequality constraints, affine equality constraints), the KKT conditions are both necessary and sufficient for global optimality. If you find a point satisfying all four conditions, you have the global minimum. No further checking needed.

All three examples above were convex, so verifying KKT was enough to confirm optimality.

For non-convex problems, KKT conditions are only necessary (under a technical condition called constraint qualification). A point can satisfy all four KKT conditions and still be a local minimum, a saddle point, or even a local maximum. You would need second-order conditions involving the Hessian to distinguish these cases.

Constraint qualification is a regularity condition that ensures the KKT conditions actually apply. The most common one is the Linear Independence Constraint Qualification (LICQ): the gradients of the active constraints at xx^* must be linearly independent. In practice, LICQ holds for most well-posed problems. When it fails, you can get situations where the KKT conditions have no solution even though an optimal point exists.

Connection to machine learning

KKT conditions are not just theoretical. They appear directly in several ML algorithms.

Support vector machines. The SVM optimization problem minimizes w2\|w\|^2 subject to the constraint that each training point is correctly classified with a margin: yi(wxi+b)1y_i(w \cdot x_i + b) \geq 1. The KKT conditions for this problem tell you that:

  • Most data points have μi=0\mu_i = 0 (they are not support vectors).
  • The support vectors are the points where the constraint is active (μi>0\mu_i > 0).

Complementary slackness is what makes SVMs sparse: only a few training points actually determine the decision boundary.

Regularized regression. L1 regularization (lasso) can be formulated as a constrained optimization problem: minimize the squared error subject to w1t\|w\|_1 \leq t. The KKT conditions explain why lasso produces sparse solutions. When a coefficient’s constraint is not active, the corresponding multiplier is zero, and the coefficient can be nonzero. When the constraint binds, complementary slackness forces certain coefficients to exactly zero.

Constrained neural networks. Modern applications sometimes add constraints to neural network training, such as fairness constraints or robustness constraints. These are solved using KKT-based methods, often through augmented Lagrangian or penalty approaches.

Python implementation

You can solve constrained optimization problems in Python using scipy.optimize.minimize with the SLSQP method, which handles inequality constraints directly.

import numpy as np
from scipy.optimize import minimize

# Example 3B: Minimize (x-3)^2 + (y-3)^2
# subject to x + y <= 4, x >= 0, y >= 0

def objective(xy):
    x, y = xy
    return (x - 3)**2 + (y - 3)**2

# Inequality constraints use the convention: constraint(x) >= 0
constraints = [
    {"type": "ineq", "fun": lambda xy: 4 - xy[0] - xy[1]},  # x + y <= 4
    {"type": "ineq", "fun": lambda xy: xy[0]},                # x >= 0
    {"type": "ineq", "fun": lambda xy: xy[1]},                # y >= 0
]

result = minimize(objective, x0=[0.0, 0.0], method="SLSQP", constraints=constraints)

print(f"Optimal point: x = {result.x[0]:.4f}, y = {result.x[1]:.4f}")
print(f"Optimal value: {result.fun:.4f}")

Output:

Optimal point: x = 2.0000, y = 2.0000
Optimal value: 2.0000

This matches our hand calculation from Example 3B. The solver internally uses KKT conditions to find and verify the solution.

For problems where you need the multiplier values, the result object does not expose them directly in scipy. For full KKT analysis, you can use CVXPY for convex problems:

import cvxpy as cp

x = cp.Variable()
y = cp.Variable()

objective = cp.Minimize((x - 3)**2 + (y - 3)**2)
constraints = [x + y <= 4, x >= 0, y >= 0]

problem = cp.Problem(objective, constraints)
problem.solve()

print(f"Optimal point: x = {x.value:.4f}, y = {y.value:.4f}")
print(f"Optimal value: {problem.value:.4f}")
print(f"Multiplier for x + y <= 4: {constraints[0].dual_value:.4f}")
print(f"Multiplier for x >= 0:     {constraints[1].dual_value:.4f}")
print(f"Multiplier for y >= 0:     {constraints[2].dual_value:.4f}")

Output:

Optimal point: x = 2.0000, y = 2.0000
Optimal value: 2.0000
Multiplier for x + y <= 4: 2.0000
Multiplier for x >= 0:     0.0000
Multiplier for y >= 0:     0.0000

The multipliers confirm what we found by hand: μ1=2\mu_1 = 2 for the active constraint, and μ2=μ3=0\mu_2 = \mu_3 = 0 for the inactive ones.

Summary

Checking all KKT conditions

flowchart TD
  A["Candidate point x*"] --> B["1. Stationarity:<br/>gradient of Lagrangian = 0?"]
  B --> C["2. Primal feasibility:<br/>all constraints satisfied?"]
  C --> D["3. Dual feasibility:<br/>multipliers mu_i >= 0?"]
  D --> E["4. Complementary slackness:<br/>mu_i * g_i = 0 for all i?"]
  E --> F{"All four pass?"}
  F -->|"Yes + convex problem"| G["Global optimum confirmed"]
  F -->|"Yes + non-convex"| H["Candidate for local optimum"]
  F -->|"No"| I["Not a KKT point"]
  style G fill:#ccffcc
  style I fill:#ffcccc

The KKT conditions boil down to four checks:

ConditionWhat it saysWhy it matters
StationarityGradients balance outThe point is a critical point of the Lagrangian
Primal feasibilityAll constraints satisfiedThe point is actually feasible
Dual feasibilityμi0\mu_i \geq 0Inequality multipliers have the right sign
Complementary slacknessμigi=0\mu_i g_i = 0Inactive constraints have zero multipliers

For convex problems, these four conditions are the whole story. Find a point that satisfies them, and you have the global optimum.

What comes next

The KKT conditions tell you what the solution looks like, but they do not directly give you an algorithm to find it. For large-scale problems where you cannot solve the KKT system by hand, you need iterative methods. Penalty and barrier methods convert constrained problems into a sequence of unconstrained problems that you can solve with standard tools like gradient descent. They are the bridge between the theory of KKT and practical large-scale optimization.

Start typing to search across all content
navigate Enter open Esc close