Search…

Convex sets and convex functions

In this series (18 parts)
  1. What is optimization and why ML needs it
  2. Convex sets and convex functions
  3. Optimality conditions: first order
  4. Optimality conditions: second order
  5. Line search methods
  6. Least squares: the closed-form solution
  7. Steepest descent (gradient descent)
  8. Newton's method for optimization
  9. Quasi-Newton methods: BFGS and L-BFGS
  10. Conjugate gradient methods
  11. Constrained optimization and Lagrangian duality
  12. KKT conditions
  13. Penalty and barrier methods
  14. Interior point methods
  15. The simplex method
  16. Frank-Wolfe method
  17. Optimization in dynamic programming and optimal control
  18. Stochastic gradient descent and variants

If your optimization problem is convex, every local minimum is a global minimum. That single fact changes everything. You do not have to worry about getting stuck in a bad local minimum, and efficient algorithms with guarantees exist. If your problem is not convex, life gets harder.

The bowl vs the crumpled landscape

If your problem is convex, finding the answer is easy. If not, you might get stuck.

Picture a smooth bowl. Drop a marble anywhere on the rim and it rolls to the bottom. There is only one bottom, and every path leads there. That is a convex function.

Now picture a crumpled sheet of aluminum foil. Drop a marble and it settles into whichever dent is closest. That dent might be shallow while a deeper one exists elsewhere. The marble has no way to know. That is a non-convex function.

Convex vs non-convex landscape

graph TD
  subgraph Convex["Convex (bowl shape)"]
      A1["Any starting point"] --> A2["Rolls to the single global minimum"]
  end
  subgraph NonConvex["Non-convex (crumpled landscape)"]
      B1["Start A"] --> B2["Trapped in local min 1"]
      B3["Start B"] --> B4["Finds deeper local min 2"]
  end

Convexity determines whether your optimizer can guarantee finding the best answer or might settle for a mediocre one.

ProblemConvex?Consequence
Linear regression (MSE)YesUnique global solution, fast solvers
Logistic regressionYesGradient descent finds the optimum
SVM (hinge loss)YesEfficient quadratic programming
Ridge and Lasso regressionYesRegularization preserves convexity
Neural networksNoMany local minima, saddle points
K-means clusteringNoResult depends on initialization

A convex function curves upward like a bowl. Any point at the bottom is THE bottom. No restarts, no tricks.

Comparing a convex function (x^2) with a non-convex function (sin(3x) + x^2/3)

Now let’s define convexity precisely.

Prerequisites

You should be comfortable with:

Convex sets

A set CRnC \subseteq \mathbb{R}^n is convex if for any two points x,yC\mathbf{x}, \mathbf{y} \in C and any θ[0,1]\theta \in [0, 1]:

θx+(1θ)yC\theta \mathbf{x} + (1 - \theta)\mathbf{y} \in C

In plain terms: pick any two points in the set, draw a straight line between them, and every point on that line is also in the set. If you can find even one pair of points where the connecting line segment leaves the set, the set is not convex.

The line segment test

graph LR
  subgraph Convex["Convex set"]
      A["Point x"] -- "Entire line stays inside" --> B["Point y"]
  end
  subgraph NotConvex["Non-convex set"]
      C["Point x"] -- "Line exits the set" --> D["Point y"]
  end

Common convex sets

  • The real line R\mathbb{R}: trivially convex.
  • Hyperplanes: {x:aTx=b}\{\mathbf{x} : \mathbf{a}^T\mathbf{x} = b\}. Any line between two points on a plane stays on that plane.
  • Halfspaces: {x:aTxb}\{\mathbf{x} : \mathbf{a}^T\mathbf{x} \leq b\}.
  • Balls: {x:xcr}\{\mathbf{x} : \|\mathbf{x} - \mathbf{c}\| \leq r\}.
  • Polyhedra: intersections of halfspaces. The feasible region of a linear program is a polyhedron.

Non-convex set example

The set {(x,y):x2+y21}\{(x, y) : x^2 + y^2 \geq 1\} (everything outside the unit circle) is not convex. Take the points (1,0)(1, 0) and (1,0)(-1, 0). Their midpoint is (0,0)(0, 0), which has 02+02=0<10^2 + 0^2 = 0 < 1, so it is outside the set.

Intersection preserves convexity

A useful property: the intersection of any number of convex sets is convex. If C1,C2,,CkC_1, C_2, \ldots, C_k are all convex, then C1C2CkC_1 \cap C_2 \cap \cdots \cap C_k is convex. This is why constraints in optimization work well when each constraint defines a convex set.

Convex functions

A function f:RnRf: \mathbb{R}^n \to \mathbb{R} is convex if its domain is a convex set and for all x,y\mathbf{x}, \mathbf{y} in the domain and θ[0,1]\theta \in [0, 1]:

f(θx+(1θ)y)θf(x)+(1θ)f(y)f(\theta \mathbf{x} + (1-\theta)\mathbf{y}) \leq \theta f(\mathbf{x}) + (1-\theta)f(\mathbf{y})

Geometrically, this says that the line segment connecting (x,f(x))(x, f(x)) to (y,f(y))(y, f(y)) lies on or above the graph of ff. The function “curves upward.” It never has a bump or a dip that would create a local minimum that is not also global.

A function is strictly convex if the inequality is strict (<< instead of \leq) for all xy\mathbf{x} \neq \mathbf{y} and θ(0,1)\theta \in (0, 1). Strictly convex functions have at most one global minimum.

A function is concave if f-f is convex.

Jensen’s inequality

The definition above is actually Jensen’s inequality for two points. The general form says: for a convex function ff and any convex combination iθi=1\sum_i \theta_i = 1, θi0\theta_i \geq 0:

f ⁣(iθixi)iθif(xi)f\!\left(\sum_i \theta_i \mathbf{x}_i\right) \leq \sum_i \theta_i f(\mathbf{x}_i)

In the probabilistic form: if XX is a random variable and ff is convex, then:

f(E[X])E[f(X)]f(\mathbb{E}[X]) \leq \mathbb{E}[f(X)]

Jensen’s inequality: the chord lies above the curve

graph TD
  A["Pick two points x and y on the domain"] --> B["Evaluate f at x and f at y"]
  B --> C["Draw a chord connecting f(x) to f(y)"]
  C --> D["For a convex f, the chord always lies on or above the curve"]
  D --> E["f(average of inputs) is at most the average of f(inputs)"]

This inequality shows up constantly in ML, especially in variational inference and information theory.

Three ways to check convexity

Method 1: Definition (line segment test)

Plug in the definition directly. This is often tedious but always works.

Method 2: First-order condition

If ff is differentiable, ff is convex if and only if:

f(y)f(x)+f(x)T(yx)for all x,yf(\mathbf{y}) \geq f(\mathbf{x}) + \nabla f(\mathbf{x})^T(\mathbf{y} - \mathbf{x}) \quad \text{for all } \mathbf{x}, \mathbf{y}

This says the tangent line (or tangent hyperplane) at any point lies on or below the function. The gradient gives you a global underestimator.

Method 3: Second-order condition (Hessian test)

If ff is twice differentiable, ff is convex if and only if the Hessian 2f(x)\nabla^2 f(\mathbf{x}) is positive semidefinite for all x\mathbf{x} in the domain.

2f(x)0for all x\nabla^2 f(\mathbf{x}) \succeq 0 \quad \text{for all } \mathbf{x}

This is usually the easiest test to apply. Compute the Hessian, then check that all its eigenvalues are non-negative.

Contour plot of the convex bowl f(x,y) = x^2 + y^2. Every contour is a circle, and there is a single global minimum at the origin.

Example 1: Verify that the L2 loss is convex

The L2 loss for linear regression with parameters wRd\mathbf{w} \in \mathbb{R}^d is:

f(w)=yXw2=(yXw)T(yXw)f(\mathbf{w}) = \|\mathbf{y} - X\mathbf{w}\|^2 = (\mathbf{y} - X\mathbf{w})^T(\mathbf{y} - X\mathbf{w})

Step 1. Expand:

f(w)=yTy2yTXw+wTXTXwf(\mathbf{w}) = \mathbf{y}^T\mathbf{y} - 2\mathbf{y}^TX\mathbf{w} + \mathbf{w}^TX^TX\mathbf{w}

Step 2. Compute the gradient:

f(w)=2XTy+2XTXw\nabla f(\mathbf{w}) = -2X^T\mathbf{y} + 2X^TX\mathbf{w}

Step 3. Compute the Hessian:

2f(w)=2XTX\nabla^2 f(\mathbf{w}) = 2X^TX

Step 4. Check positive semidefiniteness. For any vector v\mathbf{v}:

vT(2XTX)v=2(Xv)T(Xv)=2Xv20\mathbf{v}^T(2X^TX)\mathbf{v} = 2(X\mathbf{v})^T(X\mathbf{v}) = 2\|X\mathbf{v}\|^2 \geq 0

The squared norm is always non-negative, so 2XTX2X^TX is positive semidefinite. Therefore, the L2 loss is convex. ✓

Numerical check. Let X=[1234]X = \begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix}:

XTX=[1324][1234]=[10141420]X^TX = \begin{bmatrix} 1 & 3 \\ 2 & 4 \end{bmatrix}\begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix} = \begin{bmatrix} 10 & 14 \\ 14 & 20 \end{bmatrix}

The eigenvalues of XTXX^TX are found from:

det(XTXλI)=(10λ)(20λ)196=λ230λ+4=0\det(X^TX - \lambda I) = (10 - \lambda)(20 - \lambda) - 196 = \lambda^2 - 30\lambda + 4 = 0

λ=30±900162=30±29.732\lambda = \frac{30 \pm \sqrt{900 - 16}}{2} = \frac{30 \pm 29.73}{2}

λ129.87,λ20.13\lambda_1 \approx 29.87, \quad \lambda_2 \approx 0.13

Both eigenvalues are positive, confirming the Hessian 2XTX2X^TX is positive definite. This means the L2 loss is actually strictly convex when XX has full column rank.

import numpy as np

X = np.array([[1, 2], [3, 4]])
H = 2 * X.T @ X
eigenvalues = np.linalg.eigvalsh(H)
print(f"Eigenvalues of Hessian: {eigenvalues}")
print(f"All non-negative: {all(eigenvalues >= 0)}")
# Eigenvalues of Hessian: [ 0.27  59.73]
# All non-negative: True

Example 2: The log-sum-exp function is convex

The log-sum-exp (LSE) function is defined as:

LSE(x)=log ⁣(i=1nexi)\text{LSE}(\mathbf{x}) = \log\!\left(\sum_{i=1}^{n} e^{x_i}\right)

This function appears in softmax classifiers and in the cross-entropy loss.

Step 1. We verify convexity using the definition. For θ[0,1]\theta \in [0,1] and vectors x,y\mathbf{x}, \mathbf{y}:

LSE(θx+(1θ)y)=log ⁣(ieθxi+(1θ)yi)\text{LSE}(\theta \mathbf{x} + (1-\theta)\mathbf{y}) = \log\!\left(\sum_i e^{\theta x_i + (1-\theta)y_i}\right)

=log ⁣(i(exi)θ(eyi)1θ)= \log\!\left(\sum_i (e^{x_i})^\theta (e^{y_i})^{1-\theta}\right)

Step 2. Apply the weighted AM-GM inequality (or Holder’s inequality). For non-negative reals ai,bia_i, b_i:

iaiθbi1θ(iai)θ(ibi)1θ\sum_i a_i^\theta b_i^{1-\theta} \leq \left(\sum_i a_i\right)^\theta \left(\sum_i b_i\right)^{1-\theta}

Setting ai=exia_i = e^{x_i} and bi=eyib_i = e^{y_i}:

i(exi)θ(eyi)1θ(iexi)θ(ieyi)1θ\sum_i (e^{x_i})^\theta(e^{y_i})^{1-\theta} \leq \left(\sum_i e^{x_i}\right)^\theta \left(\sum_i e^{y_i}\right)^{1-\theta}

Step 3. Take the log of both sides (log is monotone increasing):

LSE(θx+(1θ)y)θlog ⁣(iexi)+(1θ)log ⁣(ieyi)\text{LSE}(\theta\mathbf{x} + (1-\theta)\mathbf{y}) \leq \theta \log\!\left(\sum_i e^{x_i}\right) + (1-\theta)\log\!\left(\sum_i e^{y_i}\right)

=θLSE(x)+(1θ)LSE(y)= \theta \cdot \text{LSE}(\mathbf{x}) + (1-\theta)\cdot \text{LSE}(\mathbf{y})

This is exactly the definition of convexity. ✓

Numerical verification. Take x=(1,2)\mathbf{x} = (1, 2), y=(3,0)\mathbf{y} = (3, 0), θ=0.5\theta = 0.5:

The midpoint is m=(2,1)\mathbf{m} = (2, 1).

LSE(m)=log(e2+e1)=log(7.389+2.718)=log(10.107)2.313\text{LSE}(\mathbf{m}) = \log(e^2 + e^1) = \log(7.389 + 2.718) = \log(10.107) \approx 2.313

12LSE(x)+12LSE(y)=12log(e1+e2)+12log(e3+e0)\frac{1}{2}\text{LSE}(\mathbf{x}) + \frac{1}{2}\text{LSE}(\mathbf{y}) = \frac{1}{2}\log(e^1 + e^2) + \frac{1}{2}\log(e^3 + e^0)

=12log(10.107)+12log(21.086)=12(2.313+3.049)=2.681= \frac{1}{2}\log(10.107) + \frac{1}{2}\log(21.086) = \frac{1}{2}(2.313 + 3.049) = 2.681

We confirm: 2.3132.6812.313 \leq 2.681. ✓

import numpy as np

x = np.array([1.0, 2.0])
y = np.array([3.0, 0.0])
theta = 0.5
m = theta * x + (1 - theta) * y

lse = lambda v: np.log(np.sum(np.exp(v)))

print(f"LSE(midpoint)       = {lse(m):.3f}")
print(f"Convex combination  = {theta * lse(x) + (1 - theta) * lse(y):.3f}")
# LSE(midpoint)       = 2.313
# Convex combination  = 2.681

Example 3: A non-convex function with a local minimum

The double-well potential is a classic non-convex function:

f(x)=(x21)2=x42x2+1f(x) = (x^2 - 1)^2 = x^4 - 2x^2 + 1

Step 1. Find critical points. Set f(x)=0f'(x) = 0:

f(x)=4x34x=4x(x21)=0f'(x) = 4x^3 - 4x = 4x(x^2 - 1) = 0

Critical points: x=0x = 0, x=1x = 1, x=1x = -1.

Step 2. Evaluate ff and ff'' at each:

xxf(x)f(x)f(x)=12x24f''(x) = 12x^2 - 4Classification
00114<0-4 < 0Local maximum
11008>08 > 0Local (and global) minimum
1-1008>08 > 0Local (and global) minimum

Step 3. Check for convexity. The second derivative f(x)=12x24f''(x) = 12x^2 - 4 is negative when x<130.577|x| < \frac{1}{\sqrt{3}} \approx 0.577. Since ff'' is not non-negative everywhere, ff is not convex.

This is the kind of landscape you see in neural networks: multiple minima separated by regions where the curvature is negative. An optimizer starting near x=0x = 0 could get pushed toward either x=1x = 1 or x=1x = -1 depending on the initial conditions.

import numpy as np

f = lambda x: (x**2 - 1)**2
f_pp = lambda x: 12*x**2 - 4

for x in [-1, 0, 1]:
    print(f"x={x:+d}: f={f(x)}, f''={f_pp(x):+d}")
# x=-1: f=0, f''=+8
# x=+0: f=1, f''=-4
# x=+1: f=0, f''=+8

Why convexity matters in ML

Every local minimum is global

This is the big one. If ff is convex and you find a point where the gradient is zero, you have found the global minimum. No need to restart with different initializations or worry about being trapped.

Efficient algorithms exist

For convex problems, gradient descent converges to the global solution with rate guarantees. Newton’s method converges quadratically near the solution. Interior point methods solve constrained convex problems in polynomial time.

Non-convex problems are NP-hard in general

Finding the global minimum of a general non-convex function is computationally intractable. In practice, we settle for local minima and use tricks (random restarts, good initialization, learning rate schedules) to find good ones.

Many ML models are convex

Linear regression, logistic regression, SVMs, and ridge/lasso regression all have convex objectives. This is not an accident. These models were designed so that training is tractable. The recent dominance of deep learning means we now routinely solve non-convex problems, but the theory of convex optimization remains the foundation.

Operations that preserve convexity

If you need to show a complex function is convex, you can often build it from simpler convex pieces using these rules:

OperationResult
αf\alpha f where α>0\alpha > 0Convex (non-negative scaling)
f+gf + g where both convexConvex (sum)
max(f,g)\max(f, g) where both convexConvex (pointwise max)
g(Ax+b)g(Ax + b) where gg convexConvex (affine composition)
h(f(x))h(f(x)) where hh convex increasing, ff convexConvex (composition)

These let you verify convexity of complicated loss functions by breaking them into simpler components.

Strictly convex and strongly convex

Two stronger versions of convexity come up often:

Strictly convex: the inequality is strict. f(θx+(1θ)y)<θf(x)+(1θ)f(y)f(\theta \mathbf{x} + (1-\theta)\mathbf{y}) < \theta f(\mathbf{x}) + (1-\theta)f(\mathbf{y}) for θ(0,1)\theta \in (0,1) and xy\mathbf{x} \neq \mathbf{y}. This guarantees a unique minimum.

Strongly convex with parameter m>0m > 0: the Hessian satisfies 2f(x)mI\nabla^2 f(\mathbf{x}) \succeq mI for all x\mathbf{x}. This is stronger than strict convexity. It guarantees not just a unique minimum but also bounds on how fast gradient descent converges. Adding L2 regularization to a convex loss always produces a strongly convex objective, which is one reason regularization helps optimization, not just generalization.

Summary

ConceptWhat it means
Convex setLine segment between any two points stays inside
Convex functionChord lies above the curve (Jensen’s inequality)
Hessian test2f0\nabla^2 f \succeq 0 everywhere implies convexity
Strictly convexUnique global minimum
Strongly convexFast convergence guarantees

Convexity is the property that separates tractable optimization from intractable optimization. When you have it, exploit it. When you do not, understand what you lose and use algorithms designed for non-convex landscapes.

What comes next

With convexity in hand, we can now state precisely when a point is optimal. The next article covers first-order optimality conditions, where we use the gradient to identify candidate solutions.

Start typing to search across all content
navigate Enter open Esc close