Penalty and barrier methods
In this series (18 parts)
- What is optimization and why ML needs it
- Convex sets and convex functions
- Optimality conditions: first order
- Optimality conditions: second order
- Line search methods
- Least squares: the closed-form solution
- Steepest descent (gradient descent)
- Newton's method for optimization
- Quasi-Newton methods: BFGS and L-BFGS
- Conjugate gradient methods
- Constrained optimization and Lagrangian duality
- KKT conditions
- Penalty and barrier methods
- Interior point methods
- The simplex method
- Frank-Wolfe method
- Optimization in dynamic programming and optimal control
- Stochastic gradient descent and variants
Prerequisites
You should be comfortable with KKT conditions before reading this post. We will reference the Lagrangian frequently, so a quick refresher on that will help too.
The fence around the forbidden zone
Imagine you are searching for the lowest point in a landscape, but certain areas are off limits. A fence surrounds the forbidden zones. You cannot cross the fence.
One approach: ignore the fence, find the lowest point anywhere, then check whether you ended up in a forbidden zone. If you did, add an electric charge to the fence that shocks you proportionally to how far you trespass. Crank up the voltage and search again. Eventually, the shock becomes so painful that you stay on the legal side. That is the penalty method.
Another approach: stay inside the legal zone from the start. As you walk toward the fence, the ground slopes upward more and more steeply, repelling you from the boundary. The closer you get, the steeper the repulsion. That is the barrier method.
Both approaches convert a constrained problem into a sequence of unconstrained ones.
Penalty method: a concrete example
| Value | |
|---|---|
| Objective | Minimize |
| Constraint | |
| Unconstrained minimum | (violates constraint) |
| With penalty, | (still violates, less) |
| With penalty, | (almost feasible) |
| With penalty, | (converging to … wait, ?) |
Actually, the constrained optimum here is , the boundary. The penalty pushes the solution from the unconstrained minimum at toward the constraint boundary at , but approaches it from the infeasible side.
From constrained to unconstrained
flowchart LR
A["Original problem:<br/>minimize f subject to constraints"] --> B["Replace constraints<br/>with penalty or barrier term"]
B --> C["Solve unconstrained<br/>penalized problem"]
C --> D{"Solution close enough<br/>to feasible?"}
D -->|"No"| E["Increase penalty weight"]
E --> C
D -->|"Yes"| F["Done: approximate<br/>constrained solution"]
Now let’s make this precise.
The core idea
Constrained optimization is harder than unconstrained optimization. You have a perfectly good objective function, but then there are constraints that fence in where you can go. Penalty and barrier methods sidestep this by folding the constraints into the objective itself. You solve a sequence of unconstrained (or simpler) problems, and the solutions gradually converge to the true constrained optimum.
There are two main families:
- Exterior penalty methods add a cost for violating constraints. They work from outside the feasible region, pushing solutions inward.
- Interior barrier methods add a cost for approaching constraint boundaries. They work from inside the feasible region, keeping solutions away from the walls.
Both transform one hard problem into many easier ones.
Exterior penalty methods
Setup
Consider a generic constrained problem:
The quadratic penalty approach replaces this with:
Here is the penalty parameter. When a constraint is satisfied, its penalty term is zero. When it is violated, you pay a cost proportional to times the squared violation.
How it works
- Pick an initial (say 1).
- Solve the unconstrained penalized problem to get .
- Increase : set .
- Solve again, warm-starting from .
- Repeat until the constraint violations are small enough.
As , the solution converges to the true constrained optimum. The catch: large makes the penalized problem ill-conditioned. The Hessian of the penalty term grows with , so the subproblem gets harder to solve numerically.
Why quadratic?
You could use directly (an penalty), but squaring makes the function smooth everywhere except at the boundary . Smoothness lets you use gradient descent or Newton’s method on the subproblems without worrying about kinks.
Example 1: Quadratic penalty on a simple problem
Problem:
The unconstrained minimum is , but the constraint forces . The true constrained optimum is .
Rewrite the constraint as .
Penalized objective:
For , the penalty fires and we get:
Take the derivative and set it to zero:
Step-by-step for different values:
:
Still far from the boundary. Constraint violation: .
:
Closer. Violation: .
:
Almost there. Violation: .
:
Converging to . Notice that you never exactly hit the constraint boundary at any finite . You approach it from the infeasible side.
As the penalty parameter mu grows, the penalized objective increasingly discourages constraint violations, pushing the minimum toward the feasible region (x ≥ 1).
Interior barrier methods
The opposite approach
Instead of allowing constraint violations and penalizing them, barrier methods stay strictly inside the feasible region and add a term that blows up as you approach the boundary.
The most common choice is the logarithmic barrier:
Here is the barrier parameter. The term goes to as (approaching the boundary from inside). So the barrier repels you from the walls.
How it works
- Start with a feasible point (this is required, and sometimes finding one is its own problem).
- Pick a large initial (say ).
- Solve the barrier problem to get .
- Decrease : set .
- Solve again, warm-starting from .
- Repeat until is small enough.
As , the barrier term vanishes and converges to the constrained optimum. The collection of solutions traces out the central path, a smooth curve through the interior that leads to the optimum.
Example 2: Log barrier on a bounded problem
Problem:
Rewrite as (so in the interior).
Barrier objective:
Take the derivative:
Using the quadratic formula:
Step-by-step for different values:
:
The barrier pushes us away from the boundary .
:
Getting closer to the boundary.
:
:
Converging to from the feasible side. Unlike the penalty method, every iterate satisfies the constraint strictly.
Parameter scheduling
Both methods require a sequence of parameter values. How fast should you change or ?
| Strategy | Penalty | Barrier | Tradeoff |
|---|---|---|---|
| Conservative | Multiply by 2 each round | Divide by 2 | More subproblems, each is well-conditioned |
| Aggressive | Multiply by 10 each round | Divide by 10 | Fewer subproblems, each is harder to solve |
| Adaptive | Increase based on constraint violation | Decrease based on duality gap | Best of both, more complex to implement |
A common practical choice is to multiply by 10 (or divide by 10) each round. If the subproblem solver struggles, back off to a factor of 2.
Penalty parameter increasing sequence
graph LR A["mu = 1<br/>Solution far from feasible"] --> B["mu = 10<br/>Solution closer"] B --> C["mu = 100<br/>Nearly feasible"] C --> D["mu = 1000<br/>Approximately optimal"] style A fill:#ffcccc style B fill:#fff3cd style C fill:#ccffcc style D fill:#ccffcc
The full penalty method iteration
flowchart TD
A["Choose initial penalty weight mu"] --> B["Solve unconstrained penalized problem"]
B --> C["Check constraint violation"]
C --> D{"Violation small enough?"}
D -->|"Yes"| E["Return solution"]
D -->|"No"| F["Increase mu by factor of 10"]
F --> B
Example 3: Penalty method on a 2D problem
Problem:
The unconstrained optimum is , but the constraint makes that infeasible. By symmetry, the constrained optimum is .
Constraint: .
Penalized objective:
where the penalty fires whenever .
Take partial derivatives and set to zero:
Subtracting the two equations gives , so . By symmetry, this makes sense. Substitute :
: , so . Violation: .
: . Violation: .
: . Violation: .
: . Violation: .
The sequence converges to the optimum .
Contour plot of the penalized objective (mu = 10) for min x^2 + y^2 subject to x + y ≥ 2. The penalty warps the contours and shifts the minimum toward the constraint boundary at (1,1).
Comparing penalty and barrier
Penalty vs barrier: two paths to the same answer
graph TD A["Constrained problem"] --> B["Penalty method"] A --> C["Barrier method"] B --> D["Start anywhere<br/>Allow violations<br/>Increase penalty weight"] C --> E["Start inside feasible region<br/>Stay feasible always<br/>Decrease barrier weight"] D --> F["Approach optimum from outside"] E --> G["Approach optimum from inside"] F --> H["Constrained optimum"] G --> H style D fill:#ffcccc style E fill:#ccffcc
| Aspect | Exterior penalty | Interior barrier |
|---|---|---|
| Starting point | Any point (feasible or not) | Must be strictly feasible |
| Iterates | Infeasible, converge to feasible | Always feasible |
| Parameter direction | ||
| Conditioning | Worsens as grows | Worsens as shrinks |
| Constraint types | Equality and inequality | Inequality only (need strict interior) |
When to use which
Use penalty methods when:
- Finding a feasible starting point is hard.
- You have equality constraints (barrier methods cannot handle these directly).
- Approximate feasibility is acceptable early on.
Use barrier methods when:
- You need every iterate to be feasible (important in some engineering applications).
- You are solving a convex problem where the interior is easy to find.
- You want to build toward interior point methods (the topic of the next article).
Augmented Lagrangian: the best of both worlds
Pure penalty methods converge slowly because you need to get exact feasibility. The augmented Lagrangian method fixes this by combining a Lagrange multiplier estimate with a moderate penalty:
The key insight: by updating after each subproblem solve, you can achieve convergence with bounded . No need to drive to infinity. This keeps the subproblems well-conditioned.
Update rule after solving with current :
This is sometimes called the “method of multipliers.” It converges much faster in practice than plain quadratic penalty.
Connection to KKT conditions
As the penalty parameter increases (or barrier parameter decreases), the solutions approach a point that satisfies the KKT conditions. The penalty/barrier terms effectively create approximate dual variables.
For the log barrier, the relationship is especially clean. At the barrier solution , the approximate dual variable for constraint is:
As , these converge to the true KKT multipliers. This connection is the foundation of interior point methods.
Practical tips
-
Warm-starting matters. Always initialize each subproblem from the previous solution. This can cut total iterations dramatically.
-
Don’t increase too fast. If you jump from to , the first subproblem is easy but useless, and the second is extremely hard. Gradual increases with warm starts work better.
-
Monitor both objective and feasibility. Plot the objective value and the maximum constraint violation across outer iterations. You want both to converge.
-
For barrier methods, finding a feasible start can be done with a Phase I method: minimize a slack variable subject to with . If you can drive below zero, you have a feasible point.
Python implementation sketch
import numpy as np
def quadratic_penalty(f, grad_f, g, grad_g, x0, mu0=1.0, beta=10.0,
tol=1e-6, max_outer=20, max_inner=200, lr=0.01):
"""Solve min f(x) s.t. g(x) <= 0 using quadratic penalty."""
x = x0.copy()
mu = mu0
for k in range(max_outer):
# Solve penalized subproblem with gradient descent
for _ in range(max_inner):
violation = max(0.0, g(x))
grad = grad_f(x) + mu * violation * grad_g(x)
x = x - lr * grad
violation = max(0.0, g(x))
if violation < tol:
break
mu *= beta # increase penalty
return x
# Example: min (x-3)^2 s.t. x <= 1
f = lambda x: (x - 3)**2
grad_f = lambda x: 2*(x - 3)
g = lambda x: x - 1 # x - 1 <= 0
grad_g = lambda x: 1.0
x_opt = quadratic_penalty(f, grad_f, g, grad_g, x0=np.array(3.0))
print(f"Solution: x = {x_opt:.4f}") # Should be close to 1.0
What comes next
Barrier methods with the log barrier are the foundation of interior point methods, which are among the most powerful algorithms in modern optimization. In the next article, we will see how the central path idea leads to practical algorithms that solve linear and convex programs in polynomial time.