Mar 17, 2026 · 16 min read · ML Optimization

Penalty and barrier methods

In this series (18 parts)

Prerequisites

You should be comfortable with KKT conditions before reading this post. We will reference the Lagrangian frequently, so a quick refresher on that will help too.

The fence around the forbidden zone

Imagine you are searching for the lowest point in a landscape, but certain areas are off limits. A fence surrounds the forbidden zones. You cannot cross the fence.

One approach: ignore the fence, find the lowest point anywhere, then check whether you ended up in a forbidden zone. If you did, add an electric charge to the fence that shocks you proportionally to how far you trespass. Crank up the voltage and search again. Eventually, the shock becomes so painful that you stay on the legal side. That is the penalty method.

Another approach: stay inside the legal zone from the start. As you walk toward the fence, the ground slopes upward more and more steeply, repelling you from the boundary. The closer you get, the steeper the repulsion. That is the barrier method.

Both approaches convert a constrained problem into a sequence of unconstrained ones.

Penalty method: a concrete example

	Value
Objective	Minimize $(x - 3)^2$
Constraint	$x \leq 2$
Unconstrained minimum	$x = 3$ (violates constraint)
With penalty, $\mu = 10$	$x \approx 1.33$ (still violates, less)
With penalty, $\mu = 100$	$x \approx 1.04$ (almost feasible)
With penalty, $\mu = 1000$	$x \approx 1.004$ (converging to $x = 1$ … wait, $x = 2$ ?)

Actually, the constrained optimum here is $x = 2$ , the boundary. The penalty pushes the solution from the unconstrained minimum at $x = 3$ toward the constraint boundary at $x = 2$ , but approaches it from the infeasible side.

From constrained to unconstrained

flowchart LR
  A["Original problem:<br/>minimize f subject to constraints"] --> B["Replace constraints<br/>with penalty or barrier term"]
  B --> C["Solve unconstrained<br/>penalized problem"]
  C --> D{"Solution close enough<br/>to feasible?"}
  D -->|"No"| E["Increase penalty weight"]
  E --> C
  D -->|"Yes"| F["Done: approximate<br/>constrained solution"]

Now let’s make this precise.

The core idea

Constrained optimization is harder than unconstrained optimization. You have a perfectly good objective function, but then there are constraints that fence in where you can go. Penalty and barrier methods sidestep this by folding the constraints into the objective itself. You solve a sequence of unconstrained (or simpler) problems, and the solutions gradually converge to the true constrained optimum.

There are two main families:

Exterior penalty methods add a cost for violating constraints. They work from outside the feasible region, pushing solutions inward.
Interior barrier methods add a cost for approaching constraint boundaries. They work from inside the feasible region, keeping solutions away from the walls.

Both transform one hard problem into many easier ones.

Exterior penalty methods

Setup

Consider a generic constrained problem:

\min_{x} f(x) \quad \text{subject to} \quad g_i(x) \le 0, \quad i = 1, \dots, m

The quadratic penalty approach replaces this with:

\min_{x} \; f(x) + \frac{\mu}{2} \sum_{i=1}^{m} \bigl[\max(0, g_i(x))\bigr]^2

Here $\mu > 0$ is the penalty parameter. When a constraint $g_i(x) \le 0$ is satisfied, its penalty term is zero. When it is violated, you pay a cost proportional to $\mu$ times the squared violation.

How it works

Pick an initial $\mu_0$ (say 1).
Solve the unconstrained penalized problem to get $x^*(\mu_0)$ .
Increase $\mu$ : set $\mu_1 = 10 \mu_0$ .
Solve again, warm-starting from $x^*(\mu_0)$ .
Repeat until the constraint violations are small enough.

As $\mu \to \infty$ , the solution $x^*(\mu)$ converges to the true constrained optimum. The catch: large $\mu$ makes the penalized problem ill-conditioned. The Hessian of the penalty term grows with $\mu$ , so the subproblem gets harder to solve numerically.

Why quadratic?

You could use $|\max(0, g_i(x))|$ directly (an $\ell_1$ penalty), but squaring makes the function smooth everywhere except at the boundary $g_i(x) = 0$ . Smoothness lets you use gradient descent or Newton’s method on the subproblems without worrying about kinks.

Example 1: Quadratic penalty on a simple problem

Problem:

\min_{x} \; (x - 3)^2 \quad \text{subject to} \quad x \le 1

The unconstrained minimum is $x = 3$ , but the constraint forces $x \le 1$ . The true constrained optimum is $x^* = 1$ .

Rewrite the constraint as $g(x) = x - 1 \le 0$ .

Penalized objective:

P(x; \mu) = (x - 3)^2 + \frac{\mu}{2} [\max(0, x - 1)]^2

For $x > 1$ , the penalty fires and we get:

P(x; \mu) = (x - 3)^2 + \frac{\mu}{2}(x - 1)^2

Take the derivative and set it to zero:

\frac{dP}{dx} = 2(x - 3) + \mu(x - 1) = 0

2x - 6 + \mu x - \mu = 0

x(2 + \mu) = 6 + \mu

x^*(\mu) = \frac{6 + \mu}{2 + \mu}

Step-by-step for different $\mu$ values:

$\mu = 1$ :

x^*(1) = \frac{6 + 1}{2 + 1} = \frac{7}{3} \approx 2.333

Still far from the boundary. Constraint violation: $x - 1 = 1.333$ .

$\mu = 10$ :

x^*(10) = \frac{6 + 10}{2 + 10} = \frac{16}{12} \approx 1.333

Closer. Violation: $0.333$ .

$\mu = 100$ :

x^*(100) = \frac{6 + 100}{2 + 100} = \frac{106}{102} \approx 1.039

Almost there. Violation: $0.039$ .

$\mu = 1000$ :

x^*(1000) = \frac{1006}{1002} \approx 1.004

Converging to $x^* = 1$ . Notice that you never exactly hit the constraint boundary at any finite $\mu$ . You approach it from the infeasible side.

As the penalty parameter mu grows, the penalized objective increasingly discourages constraint violations, pushing the minimum toward the feasible region (x ≥ 1).

Interior barrier methods

The opposite approach

Instead of allowing constraint violations and penalizing them, barrier methods stay strictly inside the feasible region and add a term that blows up as you approach the boundary.

The most common choice is the logarithmic barrier:

\min_{x} \; f(x) - t \sum_{i=1}^{m} \ln(-g_i(x))

Here $t > 0$ is the barrier parameter. The term $-\ln(-g_i(x))$ goes to $+\infty$ as $g_i(x) \to 0^-$ (approaching the boundary from inside). So the barrier repels you from the walls.

How it works

Start with a feasible point $x_0$ (this is required, and sometimes finding one is its own problem).
Pick a large initial $t_0$ (say $t_0 = 1$ ).
Solve the barrier problem to get $x^*(t_0)$ .
Decrease $t$ : set $t_1 = t_0 / 10$ .
Solve again, warm-starting from $x^*(t_0)$ .
Repeat until $t$ is small enough.

As $t \to 0^+$ , the barrier term vanishes and $x^*(t)$ converges to the constrained optimum. The collection of solutions $\{x^*(t) : t > 0\}$ traces out the central path, a smooth curve through the interior that leads to the optimum.

Example 2: Log barrier on a bounded problem

Problem:

\min_{x} \; x^2 \quad \text{subject to} \quad x \ge 1

Rewrite as $g(x) = 1 - x \le 0$ (so $-g(x) = x - 1 > 0$ in the interior).

Barrier objective:

B(x; t) = x^2 - t \ln(x - 1)

Take the derivative:

\frac{dB}{dx} = 2x - \frac{t}{x - 1} = 0

2x(x - 1) = t

2x^2 - 2x - t = 0

Using the quadratic formula:

x^*(t) = \frac{2 + \sqrt{4 + 8t}}{4} = \frac{1 + \sqrt{1 + 2t}}{2}

Step-by-step for different $t$ values:

$t = 2$ :

x^*(2) = \frac{1 + \sqrt{1 + 4}}{2} = \frac{1 + \sqrt{5}}{2} \approx \frac{1 + 2.236}{2} \approx 1.618

The barrier pushes us away from the boundary $x = 1$ .

$t = 0.5$ :

x^*(0.5) = \frac{1 + \sqrt{1 + 1}}{2} = \frac{1 + \sqrt{2}}{2} \approx \frac{1 + 1.414}{2} \approx 1.207

Getting closer to the boundary.

$t = 0.1$ :

x^*(0.1) = \frac{1 + \sqrt{1 + 0.2}}{2} = \frac{1 + \sqrt{1.2}}{2} \approx \frac{1 + 1.095}{2} \approx 1.048

$t = 0.01$ :

x^*(0.01) = \frac{1 + \sqrt{1.02}}{2} \approx \frac{1 + 1.010}{2} \approx 1.005

Converging to $x^* = 1$ from the feasible side. Unlike the penalty method, every iterate satisfies the constraint strictly.

Parameter scheduling

Both methods require a sequence of parameter values. How fast should you change $\mu$ or $t$ ?

Strategy	Penalty $\mu$	Barrier $t$	Tradeoff
Conservative	Multiply by 2 each round	Divide by 2	More subproblems, each is well-conditioned
Aggressive	Multiply by 10 each round	Divide by 10	Fewer subproblems, each is harder to solve
Adaptive	Increase based on constraint violation	Decrease based on duality gap	Best of both, more complex to implement

A common practical choice is to multiply $\mu$ by 10 (or divide $t$ by 10) each round. If the subproblem solver struggles, back off to a factor of 2.

Penalty parameter increasing sequence

graph LR
  A["mu = 1<br/>Solution far from feasible"] --> B["mu = 10<br/>Solution closer"]
  B --> C["mu = 100<br/>Nearly feasible"]
  C --> D["mu = 1000<br/>Approximately optimal"]
  style A fill:#ffcccc
  style B fill:#fff3cd
  style C fill:#ccffcc
  style D fill:#ccffcc

The full penalty method iteration

flowchart TD
  A["Choose initial penalty weight mu"] --> B["Solve unconstrained penalized problem"]
  B --> C["Check constraint violation"]
  C --> D{"Violation small enough?"}
  D -->|"Yes"| E["Return solution"]
  D -->|"No"| F["Increase mu by factor of 10"]
  F --> B

Example 3: Penalty method on a 2D problem

Problem:

\min_{x_1, x_2} \; (x_1 - 2)^2 + (x_2 - 2)^2 \quad \text{subject to} \quad x_1 + x_2 \le 2

The unconstrained optimum is $(2, 2)$ , but the constraint $x_1 + x_2 \le 2$ makes that infeasible. By symmetry, the constrained optimum is $(1, 1)$ .

Constraint: $g(x_1, x_2) = x_1 + x_2 - 2 \le 0$ .

Penalized objective:

P(x_1, x_2; \mu) = (x_1 - 2)^2 + (x_2 - 2)^2 + \frac{\mu}{2}(x_1 + x_2 - 2)^2

where the penalty fires whenever $x_1 + x_2 > 2$ .

Take partial derivatives and set to zero:

\frac{\partial P}{\partial x_1} = 2(x_1 - 2) + \mu(x_1 + x_2 - 2) = 0

\frac{\partial P}{\partial x_2} = 2(x_2 - 2) + \mu(x_1 + x_2 - 2) = 0

Subtracting the two equations gives $2(x_1 - 2) = 2(x_2 - 2)$ , so $x_1 = x_2$ . By symmetry, this makes sense. Substitute $x_1 = x_2 = x$ :

2(x - 2) + \mu(2x - 2) = 0

2x - 4 + 2\mu x - 2\mu = 0

x(2 + 2\mu) = 4 + 2\mu

x^*(\mu) = \frac{4 + 2\mu}{2 + 2\mu} = \frac{2 + \mu}{1 + \mu}

$\mu = 1$ : $x^* = \frac{3}{2} = 1.5$ , so $(x_1, x_2) = (1.5, 1.5)$ . Violation: $1.5 + 1.5 - 2 = 1.0$ .

$\mu = 10$ : $x^* = \frac{12}{11} \approx 1.091$ . Violation: $0.182$ .

$\mu = 100$ : $x^* = \frac{102}{101} \approx 1.0099$ . Violation: $0.0198$ .

$\mu = 1000$ : $x^* \approx 1.001$ . Violation: $0.002$ .

The sequence $(1.5, 1.5) \to (1.091, 1.091) \to (1.010, 1.010) \to (1.001, 1.001)$ converges to the optimum $(1, 1)$ .

Contour plot of the penalized objective (mu = 10) for min x^2 + y^2 subject to x + y ≥ 2. The penalty warps the contours and shifts the minimum toward the constraint boundary at (1,1).

Comparing penalty and barrier

Penalty vs barrier: two paths to the same answer

graph TD
  A["Constrained problem"] --> B["Penalty method"]
  A --> C["Barrier method"]
  B --> D["Start anywhere<br/>Allow violations<br/>Increase penalty weight"]
  C --> E["Start inside feasible region<br/>Stay feasible always<br/>Decrease barrier weight"]
  D --> F["Approach optimum from outside"]
  E --> G["Approach optimum from inside"]
  F --> H["Constrained optimum"]
  G --> H
  style D fill:#ffcccc
  style E fill:#ccffcc

Aspect	Exterior penalty	Interior barrier
Starting point	Any point (feasible or not)	Must be strictly feasible
Iterates	Infeasible, converge to feasible	Always feasible
Parameter direction	$\mu \to \infty$	$t \to 0^+$
Conditioning	Worsens as $\mu$ grows	Worsens as $t$ shrinks
Constraint types	Equality and inequality	Inequality only (need strict interior)

When to use which

Use penalty methods when:

Finding a feasible starting point is hard.
You have equality constraints (barrier methods cannot handle these directly).
Approximate feasibility is acceptable early on.

Use barrier methods when:

You need every iterate to be feasible (important in some engineering applications).
You are solving a convex problem where the interior is easy to find.
You want to build toward interior point methods (the topic of the next article).

Augmented Lagrangian: the best of both worlds

Pure penalty methods converge slowly because you need $\mu \to \infty$ to get exact feasibility. The augmented Lagrangian method fixes this by combining a Lagrange multiplier estimate with a moderate penalty:

\mathcal{L}_A(x, \lambda; \mu) = f(x) + \sum_{i=1}^{m} \lambda_i g_i(x) + \frac{\mu}{2} \sum_{i=1}^{m} [g_i(x)]^2

The key insight: by updating $\lambda$ after each subproblem solve, you can achieve convergence with bounded $\mu$ . No need to drive $\mu$ to infinity. This keeps the subproblems well-conditioned.

Update rule after solving with current $(\lambda^k, \mu^k)$ :

\lambda_i^{k+1} = \lambda_i^k + \mu^k \, g_i(x^{k+1})

This is sometimes called the “method of multipliers.” It converges much faster in practice than plain quadratic penalty.

Connection to KKT conditions

As the penalty parameter increases (or barrier parameter decreases), the solutions approach a point that satisfies the KKT conditions. The penalty/barrier terms effectively create approximate dual variables.

For the log barrier, the relationship is especially clean. At the barrier solution $x^*(t)$ , the approximate dual variable for constraint $i$ is:

\lambda_i^*(t) = \frac{t}{-g_i(x^*(t))}

As $t \to 0$ , these converge to the true KKT multipliers. This connection is the foundation of interior point methods.

Practical tips

Warm-starting matters. Always initialize each subproblem from the previous solution. This can cut total iterations dramatically.
Don’t increase $\mu$ too fast. If you jump from $\mu = 1$ to $\mu = 10^6$ , the first subproblem is easy but useless, and the second is extremely hard. Gradual increases with warm starts work better.
Monitor both objective and feasibility. Plot the objective value and the maximum constraint violation across outer iterations. You want both to converge.
For barrier methods, finding a feasible start can be done with a Phase I method: minimize a slack variable subject to $g_i(x) \le s$ with $s \ge 0$ . If you can drive $s$ below zero, you have a feasible point.

Python implementation sketch

import numpy as np

def quadratic_penalty(f, grad_f, g, grad_g, x0, mu0=1.0, beta=10.0, 
                       tol=1e-6, max_outer=20, max_inner=200, lr=0.01):
    """Solve min f(x) s.t. g(x) <= 0 using quadratic penalty."""
    x = x0.copy()
    mu = mu0
    
    for k in range(max_outer):
        # Solve penalized subproblem with gradient descent
        for _ in range(max_inner):
            violation = max(0.0, g(x))
            grad = grad_f(x) + mu * violation * grad_g(x)
            x = x - lr * grad
        
        violation = max(0.0, g(x))
        if violation < tol:
            break
        mu *= beta  # increase penalty
    
    return x

# Example: min (x-3)^2 s.t. x <= 1
f = lambda x: (x - 3)**2
grad_f = lambda x: 2*(x - 3)
g = lambda x: x - 1          # x - 1 <= 0
grad_g = lambda x: 1.0

x_opt = quadratic_penalty(f, grad_f, g, grad_g, x0=np.array(3.0))
print(f"Solution: x = {x_opt:.4f}")  # Should be close to 1.0

What comes next

Barrier methods with the log barrier are the foundation of interior point methods, which are among the most powerful algorithms in modern optimization. In the next article, we will see how the central path idea leads to practical algorithms that solve linear and convex programs in polynomial time.

← Back to all series