Convex sets and convex functions
In this series (18 parts)
- What is optimization and why ML needs it
- Convex sets and convex functions
- Optimality conditions: first order
- Optimality conditions: second order
- Line search methods
- Least squares: the closed-form solution
- Steepest descent (gradient descent)
- Newton's method for optimization
- Quasi-Newton methods: BFGS and L-BFGS
- Conjugate gradient methods
- Constrained optimization and Lagrangian duality
- KKT conditions
- Penalty and barrier methods
- Interior point methods
- The simplex method
- Frank-Wolfe method
- Optimization in dynamic programming and optimal control
- Stochastic gradient descent and variants
If your optimization problem is convex, every local minimum is a global minimum. That single fact changes everything. You do not have to worry about getting stuck in a bad local minimum, and efficient algorithms with guarantees exist. If your problem is not convex, life gets harder.
The bowl vs the crumpled landscape
If your problem is convex, finding the answer is easy. If not, you might get stuck.
Picture a smooth bowl. Drop a marble anywhere on the rim and it rolls to the bottom. There is only one bottom, and every path leads there. That is a convex function.
Now picture a crumpled sheet of aluminum foil. Drop a marble and it settles into whichever dent is closest. That dent might be shallow while a deeper one exists elsewhere. The marble has no way to know. That is a non-convex function.
Convex vs non-convex landscape
graph TD
subgraph Convex["Convex (bowl shape)"]
A1["Any starting point"] --> A2["Rolls to the single global minimum"]
end
subgraph NonConvex["Non-convex (crumpled landscape)"]
B1["Start A"] --> B2["Trapped in local min 1"]
B3["Start B"] --> B4["Finds deeper local min 2"]
end
Convexity determines whether your optimizer can guarantee finding the best answer or might settle for a mediocre one.
| Problem | Convex? | Consequence |
|---|---|---|
| Linear regression (MSE) | Yes | Unique global solution, fast solvers |
| Logistic regression | Yes | Gradient descent finds the optimum |
| SVM (hinge loss) | Yes | Efficient quadratic programming |
| Ridge and Lasso regression | Yes | Regularization preserves convexity |
| Neural networks | No | Many local minima, saddle points |
| K-means clustering | No | Result depends on initialization |
A convex function curves upward like a bowl. Any point at the bottom is THE bottom. No restarts, no tricks.
Comparing a convex function (x^2) with a non-convex function (sin(3x) + x^2/3)
Now let’s define convexity precisely.
Prerequisites
You should be comfortable with:
- What optimization is and why ML needs it
- Jacobians and Hessians, specifically computing second-order derivatives and checking positive definiteness
Convex sets
A set is convex if for any two points and any :
In plain terms: pick any two points in the set, draw a straight line between them, and every point on that line is also in the set. If you can find even one pair of points where the connecting line segment leaves the set, the set is not convex.
The line segment test
graph LR
subgraph Convex["Convex set"]
A["Point x"] -- "Entire line stays inside" --> B["Point y"]
end
subgraph NotConvex["Non-convex set"]
C["Point x"] -- "Line exits the set" --> D["Point y"]
end
Common convex sets
- The real line : trivially convex.
- Hyperplanes: . Any line between two points on a plane stays on that plane.
- Halfspaces: .
- Balls: .
- Polyhedra: intersections of halfspaces. The feasible region of a linear program is a polyhedron.
Non-convex set example
The set (everything outside the unit circle) is not convex. Take the points and . Their midpoint is , which has , so it is outside the set.
Intersection preserves convexity
A useful property: the intersection of any number of convex sets is convex. If are all convex, then is convex. This is why constraints in optimization work well when each constraint defines a convex set.
Convex functions
A function is convex if its domain is a convex set and for all in the domain and :
Geometrically, this says that the line segment connecting to lies on or above the graph of . The function “curves upward.” It never has a bump or a dip that would create a local minimum that is not also global.
A function is strictly convex if the inequality is strict ( instead of ) for all and . Strictly convex functions have at most one global minimum.
A function is concave if is convex.
Jensen’s inequality
The definition above is actually Jensen’s inequality for two points. The general form says: for a convex function and any convex combination , :
In the probabilistic form: if is a random variable and is convex, then:
Jensen’s inequality: the chord lies above the curve
graph TD A["Pick two points x and y on the domain"] --> B["Evaluate f at x and f at y"] B --> C["Draw a chord connecting f(x) to f(y)"] C --> D["For a convex f, the chord always lies on or above the curve"] D --> E["f(average of inputs) is at most the average of f(inputs)"]
This inequality shows up constantly in ML, especially in variational inference and information theory.
Three ways to check convexity
Method 1: Definition (line segment test)
Plug in the definition directly. This is often tedious but always works.
Method 2: First-order condition
If is differentiable, is convex if and only if:
This says the tangent line (or tangent hyperplane) at any point lies on or below the function. The gradient gives you a global underestimator.
Method 3: Second-order condition (Hessian test)
If is twice differentiable, is convex if and only if the Hessian is positive semidefinite for all in the domain.
This is usually the easiest test to apply. Compute the Hessian, then check that all its eigenvalues are non-negative.
Contour plot of the convex bowl f(x,y) = x^2 + y^2. Every contour is a circle, and there is a single global minimum at the origin.
Example 1: Verify that the L2 loss is convex
The L2 loss for linear regression with parameters is:
Step 1. Expand:
Step 2. Compute the gradient:
Step 3. Compute the Hessian:
Step 4. Check positive semidefiniteness. For any vector :
The squared norm is always non-negative, so is positive semidefinite. Therefore, the L2 loss is convex. ✓
Numerical check. Let :
The eigenvalues of are found from:
Both eigenvalues are positive, confirming the Hessian is positive definite. This means the L2 loss is actually strictly convex when has full column rank.
import numpy as np
X = np.array([[1, 2], [3, 4]])
H = 2 * X.T @ X
eigenvalues = np.linalg.eigvalsh(H)
print(f"Eigenvalues of Hessian: {eigenvalues}")
print(f"All non-negative: {all(eigenvalues >= 0)}")
# Eigenvalues of Hessian: [ 0.27 59.73]
# All non-negative: True
Example 2: The log-sum-exp function is convex
The log-sum-exp (LSE) function is defined as:
This function appears in softmax classifiers and in the cross-entropy loss.
Step 1. We verify convexity using the definition. For and vectors :
Step 2. Apply the weighted AM-GM inequality (or Holder’s inequality). For non-negative reals :
Setting and :
Step 3. Take the log of both sides (log is monotone increasing):
This is exactly the definition of convexity. ✓
Numerical verification. Take , , :
The midpoint is .
We confirm: . ✓
import numpy as np
x = np.array([1.0, 2.0])
y = np.array([3.0, 0.0])
theta = 0.5
m = theta * x + (1 - theta) * y
lse = lambda v: np.log(np.sum(np.exp(v)))
print(f"LSE(midpoint) = {lse(m):.3f}")
print(f"Convex combination = {theta * lse(x) + (1 - theta) * lse(y):.3f}")
# LSE(midpoint) = 2.313
# Convex combination = 2.681
Example 3: A non-convex function with a local minimum
The double-well potential is a classic non-convex function:
Step 1. Find critical points. Set :
Critical points: , , .
Step 2. Evaluate and at each:
| Classification | |||
|---|---|---|---|
| Local maximum | |||
| Local (and global) minimum | |||
| Local (and global) minimum |
Step 3. Check for convexity. The second derivative is negative when . Since is not non-negative everywhere, is not convex.
This is the kind of landscape you see in neural networks: multiple minima separated by regions where the curvature is negative. An optimizer starting near could get pushed toward either or depending on the initial conditions.
import numpy as np
f = lambda x: (x**2 - 1)**2
f_pp = lambda x: 12*x**2 - 4
for x in [-1, 0, 1]:
print(f"x={x:+d}: f={f(x)}, f''={f_pp(x):+d}")
# x=-1: f=0, f''=+8
# x=+0: f=1, f''=-4
# x=+1: f=0, f''=+8
Why convexity matters in ML
Every local minimum is global
This is the big one. If is convex and you find a point where the gradient is zero, you have found the global minimum. No need to restart with different initializations or worry about being trapped.
Efficient algorithms exist
For convex problems, gradient descent converges to the global solution with rate guarantees. Newton’s method converges quadratically near the solution. Interior point methods solve constrained convex problems in polynomial time.
Non-convex problems are NP-hard in general
Finding the global minimum of a general non-convex function is computationally intractable. In practice, we settle for local minima and use tricks (random restarts, good initialization, learning rate schedules) to find good ones.
Many ML models are convex
Linear regression, logistic regression, SVMs, and ridge/lasso regression all have convex objectives. This is not an accident. These models were designed so that training is tractable. The recent dominance of deep learning means we now routinely solve non-convex problems, but the theory of convex optimization remains the foundation.
Operations that preserve convexity
If you need to show a complex function is convex, you can often build it from simpler convex pieces using these rules:
| Operation | Result |
|---|---|
| where | Convex (non-negative scaling) |
| where both convex | Convex (sum) |
| where both convex | Convex (pointwise max) |
| where convex | Convex (affine composition) |
| where convex increasing, convex | Convex (composition) |
These let you verify convexity of complicated loss functions by breaking them into simpler components.
Strictly convex and strongly convex
Two stronger versions of convexity come up often:
Strictly convex: the inequality is strict. for and . This guarantees a unique minimum.
Strongly convex with parameter : the Hessian satisfies for all . This is stronger than strict convexity. It guarantees not just a unique minimum but also bounds on how fast gradient descent converges. Adding L2 regularization to a convex loss always produces a strongly convex objective, which is one reason regularization helps optimization, not just generalization.
Summary
| Concept | What it means |
|---|---|
| Convex set | Line segment between any two points stays inside |
| Convex function | Chord lies above the curve (Jensen’s inequality) |
| Hessian test | everywhere implies convexity |
| Strictly convex | Unique global minimum |
| Strongly convex | Fast convergence guarantees |
Convexity is the property that separates tractable optimization from intractable optimization. When you have it, exploit it. When you do not, understand what you lose and use algorithms designed for non-convex landscapes.
What comes next
With convexity in hand, we can now state precisely when a point is optimal. The next article covers first-order optimality conditions, where we use the gradient to identify candidate solutions.