Search…

Penalty and barrier methods

In this series (18 parts)
  1. What is optimization and why ML needs it
  2. Convex sets and convex functions
  3. Optimality conditions: first order
  4. Optimality conditions: second order
  5. Line search methods
  6. Least squares: the closed-form solution
  7. Steepest descent (gradient descent)
  8. Newton's method for optimization
  9. Quasi-Newton methods: BFGS and L-BFGS
  10. Conjugate gradient methods
  11. Constrained optimization and Lagrangian duality
  12. KKT conditions
  13. Penalty and barrier methods
  14. Interior point methods
  15. The simplex method
  16. Frank-Wolfe method
  17. Optimization in dynamic programming and optimal control
  18. Stochastic gradient descent and variants

Prerequisites

You should be comfortable with KKT conditions before reading this post. We will reference the Lagrangian frequently, so a quick refresher on that will help too.


The fence around the forbidden zone

Imagine you are searching for the lowest point in a landscape, but certain areas are off limits. A fence surrounds the forbidden zones. You cannot cross the fence.

One approach: ignore the fence, find the lowest point anywhere, then check whether you ended up in a forbidden zone. If you did, add an electric charge to the fence that shocks you proportionally to how far you trespass. Crank up the voltage and search again. Eventually, the shock becomes so painful that you stay on the legal side. That is the penalty method.

Another approach: stay inside the legal zone from the start. As you walk toward the fence, the ground slopes upward more and more steeply, repelling you from the boundary. The closer you get, the steeper the repulsion. That is the barrier method.

Both approaches convert a constrained problem into a sequence of unconstrained ones.

Penalty method: a concrete example

Value
ObjectiveMinimize (x3)2(x - 3)^2
Constraintx2x \leq 2
Unconstrained minimumx=3x = 3 (violates constraint)
With penalty, μ=10\mu = 10x1.33x \approx 1.33 (still violates, less)
With penalty, μ=100\mu = 100x1.04x \approx 1.04 (almost feasible)
With penalty, μ=1000\mu = 1000x1.004x \approx 1.004 (converging to x=1x = 1… wait, x=2x = 2?)

Actually, the constrained optimum here is x=2x = 2, the boundary. The penalty pushes the solution from the unconstrained minimum at x=3x = 3 toward the constraint boundary at x=2x = 2, but approaches it from the infeasible side.

From constrained to unconstrained

flowchart LR
  A["Original problem:<br/>minimize f subject to constraints"] --> B["Replace constraints<br/>with penalty or barrier term"]
  B --> C["Solve unconstrained<br/>penalized problem"]
  C --> D{"Solution close enough<br/>to feasible?"}
  D -->|"No"| E["Increase penalty weight"]
  E --> C
  D -->|"Yes"| F["Done: approximate<br/>constrained solution"]

Now let’s make this precise.


The core idea

Constrained optimization is harder than unconstrained optimization. You have a perfectly good objective function, but then there are constraints that fence in where you can go. Penalty and barrier methods sidestep this by folding the constraints into the objective itself. You solve a sequence of unconstrained (or simpler) problems, and the solutions gradually converge to the true constrained optimum.

There are two main families:

  • Exterior penalty methods add a cost for violating constraints. They work from outside the feasible region, pushing solutions inward.
  • Interior barrier methods add a cost for approaching constraint boundaries. They work from inside the feasible region, keeping solutions away from the walls.

Both transform one hard problem into many easier ones.


Exterior penalty methods

Setup

Consider a generic constrained problem:

minxf(x)subject togi(x)0,i=1,,m\min_{x} f(x) \quad \text{subject to} \quad g_i(x) \le 0, \quad i = 1, \dots, m

The quadratic penalty approach replaces this with:

minx  f(x)+μ2i=1m[max(0,gi(x))]2\min_{x} \; f(x) + \frac{\mu}{2} \sum_{i=1}^{m} \bigl[\max(0, g_i(x))\bigr]^2

Here μ>0\mu > 0 is the penalty parameter. When a constraint gi(x)0g_i(x) \le 0 is satisfied, its penalty term is zero. When it is violated, you pay a cost proportional to μ\mu times the squared violation.

How it works

  1. Pick an initial μ0\mu_0 (say 1).
  2. Solve the unconstrained penalized problem to get x(μ0)x^*(\mu_0).
  3. Increase μ\mu: set μ1=10μ0\mu_1 = 10 \mu_0.
  4. Solve again, warm-starting from x(μ0)x^*(\mu_0).
  5. Repeat until the constraint violations are small enough.

As μ\mu \to \infty, the solution x(μ)x^*(\mu) converges to the true constrained optimum. The catch: large μ\mu makes the penalized problem ill-conditioned. The Hessian of the penalty term grows with μ\mu, so the subproblem gets harder to solve numerically.

Why quadratic?

You could use max(0,gi(x))|\max(0, g_i(x))| directly (an 1\ell_1 penalty), but squaring makes the function smooth everywhere except at the boundary gi(x)=0g_i(x) = 0. Smoothness lets you use gradient descent or Newton’s method on the subproblems without worrying about kinks.


Example 1: Quadratic penalty on a simple problem

Problem:

minx  (x3)2subject tox1\min_{x} \; (x - 3)^2 \quad \text{subject to} \quad x \le 1

The unconstrained minimum is x=3x = 3, but the constraint forces x1x \le 1. The true constrained optimum is x=1x^* = 1.

Rewrite the constraint as g(x)=x10g(x) = x - 1 \le 0.

Penalized objective:

P(x;μ)=(x3)2+μ2[max(0,x1)]2P(x; \mu) = (x - 3)^2 + \frac{\mu}{2} [\max(0, x - 1)]^2

For x>1x > 1, the penalty fires and we get:

P(x;μ)=(x3)2+μ2(x1)2P(x; \mu) = (x - 3)^2 + \frac{\mu}{2}(x - 1)^2

Take the derivative and set it to zero:

dPdx=2(x3)+μ(x1)=0\frac{dP}{dx} = 2(x - 3) + \mu(x - 1) = 0 2x6+μxμ=02x - 6 + \mu x - \mu = 0 x(2+μ)=6+μx(2 + \mu) = 6 + \mu x(μ)=6+μ2+μx^*(\mu) = \frac{6 + \mu}{2 + \mu}

Step-by-step for different μ\mu values:

μ=1\mu = 1:

x(1)=6+12+1=732.333x^*(1) = \frac{6 + 1}{2 + 1} = \frac{7}{3} \approx 2.333

Still far from the boundary. Constraint violation: x1=1.333x - 1 = 1.333.

μ=10\mu = 10:

x(10)=6+102+10=16121.333x^*(10) = \frac{6 + 10}{2 + 10} = \frac{16}{12} \approx 1.333

Closer. Violation: 0.3330.333.

μ=100\mu = 100:

x(100)=6+1002+100=1061021.039x^*(100) = \frac{6 + 100}{2 + 100} = \frac{106}{102} \approx 1.039

Almost there. Violation: 0.0390.039.

μ=1000\mu = 1000:

x(1000)=100610021.004x^*(1000) = \frac{1006}{1002} \approx 1.004

Converging to x=1x^* = 1. Notice that you never exactly hit the constraint boundary at any finite μ\mu. You approach it from the infeasible side.

As the penalty parameter mu grows, the penalized objective increasingly discourages constraint violations, pushing the minimum toward the feasible region (x ≥ 1).


Interior barrier methods

The opposite approach

Instead of allowing constraint violations and penalizing them, barrier methods stay strictly inside the feasible region and add a term that blows up as you approach the boundary.

The most common choice is the logarithmic barrier:

minx  f(x)ti=1mln(gi(x))\min_{x} \; f(x) - t \sum_{i=1}^{m} \ln(-g_i(x))

Here t>0t > 0 is the barrier parameter. The term ln(gi(x))-\ln(-g_i(x)) goes to ++\infty as gi(x)0g_i(x) \to 0^- (approaching the boundary from inside). So the barrier repels you from the walls.

How it works

  1. Start with a feasible point x0x_0 (this is required, and sometimes finding one is its own problem).
  2. Pick a large initial t0t_0 (say t0=1t_0 = 1).
  3. Solve the barrier problem to get x(t0)x^*(t_0).
  4. Decrease tt: set t1=t0/10t_1 = t_0 / 10.
  5. Solve again, warm-starting from x(t0)x^*(t_0).
  6. Repeat until tt is small enough.

As t0+t \to 0^+, the barrier term vanishes and x(t)x^*(t) converges to the constrained optimum. The collection of solutions {x(t):t>0}\{x^*(t) : t > 0\} traces out the central path, a smooth curve through the interior that leads to the optimum.


Example 2: Log barrier on a bounded problem

Problem:

minx  x2subject tox1\min_{x} \; x^2 \quad \text{subject to} \quad x \ge 1

Rewrite as g(x)=1x0g(x) = 1 - x \le 0 (so g(x)=x1>0-g(x) = x - 1 > 0 in the interior).

Barrier objective:

B(x;t)=x2tln(x1)B(x; t) = x^2 - t \ln(x - 1)

Take the derivative:

dBdx=2xtx1=0\frac{dB}{dx} = 2x - \frac{t}{x - 1} = 0 2x(x1)=t2x(x - 1) = t 2x22xt=02x^2 - 2x - t = 0

Using the quadratic formula:

x(t)=2+4+8t4=1+1+2t2x^*(t) = \frac{2 + \sqrt{4 + 8t}}{4} = \frac{1 + \sqrt{1 + 2t}}{2}

Step-by-step for different tt values:

t=2t = 2:

x(2)=1+1+42=1+521+2.23621.618x^*(2) = \frac{1 + \sqrt{1 + 4}}{2} = \frac{1 + \sqrt{5}}{2} \approx \frac{1 + 2.236}{2} \approx 1.618

The barrier pushes us away from the boundary x=1x = 1.

t=0.5t = 0.5:

x(0.5)=1+1+12=1+221+1.41421.207x^*(0.5) = \frac{1 + \sqrt{1 + 1}}{2} = \frac{1 + \sqrt{2}}{2} \approx \frac{1 + 1.414}{2} \approx 1.207

Getting closer to the boundary.

t=0.1t = 0.1:

x(0.1)=1+1+0.22=1+1.221+1.09521.048x^*(0.1) = \frac{1 + \sqrt{1 + 0.2}}{2} = \frac{1 + \sqrt{1.2}}{2} \approx \frac{1 + 1.095}{2} \approx 1.048

t=0.01t = 0.01:

x(0.01)=1+1.0221+1.01021.005x^*(0.01) = \frac{1 + \sqrt{1.02}}{2} \approx \frac{1 + 1.010}{2} \approx 1.005

Converging to x=1x^* = 1 from the feasible side. Unlike the penalty method, every iterate satisfies the constraint strictly.


Parameter scheduling

Both methods require a sequence of parameter values. How fast should you change μ\mu or tt?

StrategyPenalty μ\muBarrier ttTradeoff
ConservativeMultiply by 2 each roundDivide by 2More subproblems, each is well-conditioned
AggressiveMultiply by 10 each roundDivide by 10Fewer subproblems, each is harder to solve
AdaptiveIncrease based on constraint violationDecrease based on duality gapBest of both, more complex to implement

A common practical choice is to multiply μ\mu by 10 (or divide tt by 10) each round. If the subproblem solver struggles, back off to a factor of 2.

Penalty parameter increasing sequence

graph LR
  A["mu = 1<br/>Solution far from feasible"] --> B["mu = 10<br/>Solution closer"]
  B --> C["mu = 100<br/>Nearly feasible"]
  C --> D["mu = 1000<br/>Approximately optimal"]
  style A fill:#ffcccc
  style B fill:#fff3cd
  style C fill:#ccffcc
  style D fill:#ccffcc

The full penalty method iteration

flowchart TD
  A["Choose initial penalty weight mu"] --> B["Solve unconstrained penalized problem"]
  B --> C["Check constraint violation"]
  C --> D{"Violation small enough?"}
  D -->|"Yes"| E["Return solution"]
  D -->|"No"| F["Increase mu by factor of 10"]
  F --> B

Example 3: Penalty method on a 2D problem

Problem:

minx1,x2  (x12)2+(x22)2subject tox1+x22\min_{x_1, x_2} \; (x_1 - 2)^2 + (x_2 - 2)^2 \quad \text{subject to} \quad x_1 + x_2 \le 2

The unconstrained optimum is (2,2)(2, 2), but the constraint x1+x22x_1 + x_2 \le 2 makes that infeasible. By symmetry, the constrained optimum is (1,1)(1, 1).

Constraint: g(x1,x2)=x1+x220g(x_1, x_2) = x_1 + x_2 - 2 \le 0.

Penalized objective:

P(x1,x2;μ)=(x12)2+(x22)2+μ2(x1+x22)2P(x_1, x_2; \mu) = (x_1 - 2)^2 + (x_2 - 2)^2 + \frac{\mu}{2}(x_1 + x_2 - 2)^2

where the penalty fires whenever x1+x2>2x_1 + x_2 > 2.

Take partial derivatives and set to zero:

Px1=2(x12)+μ(x1+x22)=0\frac{\partial P}{\partial x_1} = 2(x_1 - 2) + \mu(x_1 + x_2 - 2) = 0 Px2=2(x22)+μ(x1+x22)=0\frac{\partial P}{\partial x_2} = 2(x_2 - 2) + \mu(x_1 + x_2 - 2) = 0

Subtracting the two equations gives 2(x12)=2(x22)2(x_1 - 2) = 2(x_2 - 2), so x1=x2x_1 = x_2. By symmetry, this makes sense. Substitute x1=x2=xx_1 = x_2 = x:

2(x2)+μ(2x2)=02(x - 2) + \mu(2x - 2) = 0 2x4+2μx2μ=02x - 4 + 2\mu x - 2\mu = 0 x(2+2μ)=4+2μx(2 + 2\mu) = 4 + 2\mu x(μ)=4+2μ2+2μ=2+μ1+μx^*(\mu) = \frac{4 + 2\mu}{2 + 2\mu} = \frac{2 + \mu}{1 + \mu}

μ=1\mu = 1: x=32=1.5x^* = \frac{3}{2} = 1.5, so (x1,x2)=(1.5,1.5)(x_1, x_2) = (1.5, 1.5). Violation: 1.5+1.52=1.01.5 + 1.5 - 2 = 1.0.

μ=10\mu = 10: x=12111.091x^* = \frac{12}{11} \approx 1.091. Violation: 0.1820.182.

μ=100\mu = 100: x=1021011.0099x^* = \frac{102}{101} \approx 1.0099. Violation: 0.01980.0198.

μ=1000\mu = 1000: x1.001x^* \approx 1.001. Violation: 0.0020.002.

The sequence (1.5,1.5)(1.091,1.091)(1.010,1.010)(1.001,1.001)(1.5, 1.5) \to (1.091, 1.091) \to (1.010, 1.010) \to (1.001, 1.001) converges to the optimum (1,1)(1, 1).

Contour plot of the penalized objective (mu = 10) for min x^2 + y^2 subject to x + y ≥ 2. The penalty warps the contours and shifts the minimum toward the constraint boundary at (1,1).


Comparing penalty and barrier

Penalty vs barrier: two paths to the same answer

graph TD
  A["Constrained problem"] --> B["Penalty method"]
  A --> C["Barrier method"]
  B --> D["Start anywhere<br/>Allow violations<br/>Increase penalty weight"]
  C --> E["Start inside feasible region<br/>Stay feasible always<br/>Decrease barrier weight"]
  D --> F["Approach optimum from outside"]
  E --> G["Approach optimum from inside"]
  F --> H["Constrained optimum"]
  G --> H
  style D fill:#ffcccc
  style E fill:#ccffcc
AspectExterior penaltyInterior barrier
Starting pointAny point (feasible or not)Must be strictly feasible
IteratesInfeasible, converge to feasibleAlways feasible
Parameter directionμ\mu \to \inftyt0+t \to 0^+
ConditioningWorsens as μ\mu growsWorsens as tt shrinks
Constraint typesEquality and inequalityInequality only (need strict interior)

When to use which

Use penalty methods when:

  • Finding a feasible starting point is hard.
  • You have equality constraints (barrier methods cannot handle these directly).
  • Approximate feasibility is acceptable early on.

Use barrier methods when:

  • You need every iterate to be feasible (important in some engineering applications).
  • You are solving a convex problem where the interior is easy to find.
  • You want to build toward interior point methods (the topic of the next article).

Augmented Lagrangian: the best of both worlds

Pure penalty methods converge slowly because you need μ\mu \to \infty to get exact feasibility. The augmented Lagrangian method fixes this by combining a Lagrange multiplier estimate with a moderate penalty:

LA(x,λ;μ)=f(x)+i=1mλigi(x)+μ2i=1m[gi(x)]2\mathcal{L}_A(x, \lambda; \mu) = f(x) + \sum_{i=1}^{m} \lambda_i g_i(x) + \frac{\mu}{2} \sum_{i=1}^{m} [g_i(x)]^2

The key insight: by updating λ\lambda after each subproblem solve, you can achieve convergence with bounded μ\mu. No need to drive μ\mu to infinity. This keeps the subproblems well-conditioned.

Update rule after solving with current (λk,μk)(\lambda^k, \mu^k):

λik+1=λik+μkgi(xk+1)\lambda_i^{k+1} = \lambda_i^k + \mu^k \, g_i(x^{k+1})

This is sometimes called the “method of multipliers.” It converges much faster in practice than plain quadratic penalty.


Connection to KKT conditions

As the penalty parameter increases (or barrier parameter decreases), the solutions approach a point that satisfies the KKT conditions. The penalty/barrier terms effectively create approximate dual variables.

For the log barrier, the relationship is especially clean. At the barrier solution x(t)x^*(t), the approximate dual variable for constraint ii is:

λi(t)=tgi(x(t))\lambda_i^*(t) = \frac{t}{-g_i(x^*(t))}

As t0t \to 0, these converge to the true KKT multipliers. This connection is the foundation of interior point methods.


Practical tips

  1. Warm-starting matters. Always initialize each subproblem from the previous solution. This can cut total iterations dramatically.

  2. Don’t increase μ\mu too fast. If you jump from μ=1\mu = 1 to μ=106\mu = 10^6, the first subproblem is easy but useless, and the second is extremely hard. Gradual increases with warm starts work better.

  3. Monitor both objective and feasibility. Plot the objective value and the maximum constraint violation across outer iterations. You want both to converge.

  4. For barrier methods, finding a feasible start can be done with a Phase I method: minimize a slack variable subject to gi(x)sg_i(x) \le s with s0s \ge 0. If you can drive ss below zero, you have a feasible point.


Python implementation sketch

import numpy as np

def quadratic_penalty(f, grad_f, g, grad_g, x0, mu0=1.0, beta=10.0, 
                       tol=1e-6, max_outer=20, max_inner=200, lr=0.01):
    """Solve min f(x) s.t. g(x) <= 0 using quadratic penalty."""
    x = x0.copy()
    mu = mu0
    
    for k in range(max_outer):
        # Solve penalized subproblem with gradient descent
        for _ in range(max_inner):
            violation = max(0.0, g(x))
            grad = grad_f(x) + mu * violation * grad_g(x)
            x = x - lr * grad
        
        violation = max(0.0, g(x))
        if violation < tol:
            break
        mu *= beta  # increase penalty
    
    return x

# Example: min (x-3)^2 s.t. x <= 1
f = lambda x: (x - 3)**2
grad_f = lambda x: 2*(x - 3)
g = lambda x: x - 1          # x - 1 <= 0
grad_g = lambda x: 1.0

x_opt = quadratic_penalty(f, grad_f, g, grad_g, x0=np.array(3.0))
print(f"Solution: x = {x_opt:.4f}")  # Should be close to 1.0

What comes next

Barrier methods with the log barrier are the foundation of interior point methods, which are among the most powerful algorithms in modern optimization. In the next article, we will see how the central path idea leads to practical algorithms that solve linear and convex programs in polynomial time.

Start typing to search across all content
navigate Enter open Esc close