Mar 6, 2026 · 18 min read · ML Optimization

Convex sets and convex functions

In this series (18 parts)

If your optimization problem is convex, every local minimum is a global minimum. That single fact changes everything. You do not have to worry about getting stuck in a bad local minimum, and efficient algorithms with guarantees exist. If your problem is not convex, life gets harder.

The bowl vs the crumpled landscape

If your problem is convex, finding the answer is easy. If not, you might get stuck.

Picture a smooth bowl. Drop a marble anywhere on the rim and it rolls to the bottom. There is only one bottom, and every path leads there. That is a convex function.

Now picture a crumpled sheet of aluminum foil. Drop a marble and it settles into whichever dent is closest. That dent might be shallow while a deeper one exists elsewhere. The marble has no way to know. That is a non-convex function.

Convex vs non-convex landscape

graph TD
  subgraph Convex["Convex (bowl shape)"]
      A1["Any starting point"] --> A2["Rolls to the single global minimum"]
  end
  subgraph NonConvex["Non-convex (crumpled landscape)"]
      B1["Start A"] --> B2["Trapped in local min 1"]
      B3["Start B"] --> B4["Finds deeper local min 2"]
  end

Convexity determines whether your optimizer can guarantee finding the best answer or might settle for a mediocre one.

Problem	Convex?	Consequence
Linear regression (MSE)	Yes	Unique global solution, fast solvers
Logistic regression	Yes	Gradient descent finds the optimum
SVM (hinge loss)	Yes	Efficient quadratic programming
Ridge and Lasso regression	Yes	Regularization preserves convexity
Neural networks	No	Many local minima, saddle points
K-means clustering	No	Result depends on initialization

A convex function curves upward like a bowl. Any point at the bottom is THE bottom. No restarts, no tricks.

Comparing a convex function (x^2) with a non-convex function (sin(3x) + x^2/3)

Now let’s define convexity precisely.

Prerequisites

You should be comfortable with:

What optimization is and why ML needs it
Jacobians and Hessians, specifically computing second-order derivatives and checking positive definiteness

Convex sets

A set $C \subseteq \mathbb{R}^n$ is convex if for any two points $\mathbf{x}, \mathbf{y} \in C$ and any $\theta \in [0, 1]$ :

$\theta \mathbf{x} + (1 - \theta)\mathbf{y} \in C$

In plain terms: pick any two points in the set, draw a straight line between them, and every point on that line is also in the set. If you can find even one pair of points where the connecting line segment leaves the set, the set is not convex.

The line segment test

graph LR
  subgraph Convex["Convex set"]
      A["Point x"] -- "Entire line stays inside" --> B["Point y"]
  end
  subgraph NotConvex["Non-convex set"]
      C["Point x"] -- "Line exits the set" --> D["Point y"]
  end

Common convex sets

The real line $\mathbb{R}$ : trivially convex.
Hyperplanes: $\{\mathbf{x} : \mathbf{a}^T\mathbf{x} = b\}$ . Any line between two points on a plane stays on that plane.
Halfspaces: $\{\mathbf{x} : \mathbf{a}^T\mathbf{x} \leq b\}$ .
Balls: $\{\mathbf{x} : \|\mathbf{x} - \mathbf{c}\| \leq r\}$ .
Polyhedra: intersections of halfspaces. The feasible region of a linear program is a polyhedron.

Non-convex set example

The set $\{(x, y) : x^2 + y^2 \geq 1\}$ (everything outside the unit circle) is not convex. Take the points $(1, 0)$ and $(-1, 0)$ . Their midpoint is $(0, 0)$ , which has $0^2 + 0^2 = 0 < 1$ , so it is outside the set.

Intersection preserves convexity

A useful property: the intersection of any number of convex sets is convex. If $C_1, C_2, \ldots, C_k$ are all convex, then $C_1 \cap C_2 \cap \cdots \cap C_k$ is convex. This is why constraints in optimization work well when each constraint defines a convex set.

Convex functions

A function $f: \mathbb{R}^n \to \mathbb{R}$ is convex if its domain is a convex set and for all $\mathbf{x}, \mathbf{y}$ in the domain and $\theta \in [0, 1]$ :

$f(\theta \mathbf{x} + (1-\theta)\mathbf{y}) \leq \theta f(\mathbf{x}) + (1-\theta)f(\mathbf{y})$

Geometrically, this says that the line segment connecting $(x, f(x))$ to $(y, f(y))$ lies on or above the graph of $f$ . The function “curves upward.” It never has a bump or a dip that would create a local minimum that is not also global.

A function is strictly convex if the inequality is strict ( $<$ instead of $\leq$ ) for all $\mathbf{x} \neq \mathbf{y}$ and $\theta \in (0, 1)$ . Strictly convex functions have at most one global minimum.

A function is concave if $-f$ is convex.

Jensen’s inequality

The definition above is actually Jensen’s inequality for two points. The general form says: for a convex function $f$ and any convex combination $\sum_i \theta_i = 1$ , $\theta_i \geq 0$ :

$f\!\left(\sum_i \theta_i \mathbf{x}_i\right) \leq \sum_i \theta_i f(\mathbf{x}_i)$

In the probabilistic form: if $X$ is a random variable and $f$ is convex, then:

$f(\mathbb{E}[X]) \leq \mathbb{E}[f(X)]$

Jensen’s inequality: the chord lies above the curve

graph TD
  A["Pick two points x and y on the domain"] --> B["Evaluate f at x and f at y"]
  B --> C["Draw a chord connecting f(x) to f(y)"]
  C --> D["For a convex f, the chord always lies on or above the curve"]
  D --> E["f(average of inputs) is at most the average of f(inputs)"]

This inequality shows up constantly in ML, especially in variational inference and information theory.

Three ways to check convexity

Method 1: Definition (line segment test)

Plug in the definition directly. This is often tedious but always works.

Method 2: First-order condition

If $f$ is differentiable, $f$ is convex if and only if:

$f(\mathbf{y}) \geq f(\mathbf{x}) + \nabla f(\mathbf{x})^T(\mathbf{y} - \mathbf{x}) \quad \text{for all } \mathbf{x}, \mathbf{y}$

This says the tangent line (or tangent hyperplane) at any point lies on or below the function. The gradient gives you a global underestimator.

Method 3: Second-order condition (Hessian test)

If $f$ is twice differentiable, $f$ is convex if and only if the Hessian $\nabla^2 f(\mathbf{x})$ is positive semidefinite for all $\mathbf{x}$ in the domain.

$\nabla^2 f(\mathbf{x}) \succeq 0 \quad \text{for all } \mathbf{x}$

This is usually the easiest test to apply. Compute the Hessian, then check that all its eigenvalues are non-negative.

Contour plot of the convex bowl f(x,y) = x^2 + y^2. Every contour is a circle, and there is a single global minimum at the origin.

Example 1: Verify that the L2 loss is convex

The L2 loss for linear regression with parameters $\mathbf{w} \in \mathbb{R}^d$ is:

$f(\mathbf{w}) = \|\mathbf{y} - X\mathbf{w}\|^2 = (\mathbf{y} - X\mathbf{w})^T(\mathbf{y} - X\mathbf{w})$

Step 1. Expand:

$f(\mathbf{w}) = \mathbf{y}^T\mathbf{y} - 2\mathbf{y}^TX\mathbf{w} + \mathbf{w}^TX^TX\mathbf{w}$

Step 2. Compute the gradient:

$\nabla f(\mathbf{w}) = -2X^T\mathbf{y} + 2X^TX\mathbf{w}$

Step 3. Compute the Hessian:

$\nabla^2 f(\mathbf{w}) = 2X^TX$

Step 4. Check positive semidefiniteness. For any vector $\mathbf{v}$ :

$\mathbf{v}^T(2X^TX)\mathbf{v} = 2(X\mathbf{v})^T(X\mathbf{v}) = 2\|X\mathbf{v}\|^2 \geq 0$

The squared norm is always non-negative, so $2X^TX$ is positive semidefinite. Therefore, the L2 loss is convex. ✓

Numerical check. Let $X = \begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix}$ :

$X^TX = \begin{bmatrix} 1 & 3 \\ 2 & 4 \end{bmatrix}\begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix} = \begin{bmatrix} 10 & 14 \\ 14 & 20 \end{bmatrix}$

The eigenvalues of $X^TX$ are found from:

$\det(X^TX - \lambda I) = (10 - \lambda)(20 - \lambda) - 196 = \lambda^2 - 30\lambda + 4 = 0$

$\lambda = \frac{30 \pm \sqrt{900 - 16}}{2} = \frac{30 \pm 29.73}{2}$

$\lambda_1 \approx 29.87, \quad \lambda_2 \approx 0.13$

Both eigenvalues are positive, confirming the Hessian $2X^TX$ is positive definite. This means the L2 loss is actually strictly convex when $X$ has full column rank.

import numpy as np

X = np.array([[1, 2], [3, 4]])
H = 2 * X.T @ X
eigenvalues = np.linalg.eigvalsh(H)
print(f"Eigenvalues of Hessian: {eigenvalues}")
print(f"All non-negative: {all(eigenvalues >= 0)}")
# Eigenvalues of Hessian: [ 0.27  59.73]
# All non-negative: True

Example 2: The log-sum-exp function is convex

The log-sum-exp (LSE) function is defined as:

$\text{LSE}(\mathbf{x}) = \log\!\left(\sum_{i=1}^{n} e^{x_i}\right)$

This function appears in softmax classifiers and in the cross-entropy loss.

Step 1. We verify convexity using the definition. For $\theta \in [0,1]$ and vectors $\mathbf{x}, \mathbf{y}$ :

$\text{LSE}(\theta \mathbf{x} + (1-\theta)\mathbf{y}) = \log\!\left(\sum_i e^{\theta x_i + (1-\theta)y_i}\right)$

$= \log\!\left(\sum_i (e^{x_i})^\theta (e^{y_i})^{1-\theta}\right)$

Step 2. Apply the weighted AM-GM inequality (or Holder’s inequality). For non-negative reals $a_i, b_i$ :

$\sum_i a_i^\theta b_i^{1-\theta} \leq \left(\sum_i a_i\right)^\theta \left(\sum_i b_i\right)^{1-\theta}$

Setting $a_i = e^{x_i}$ and $b_i = e^{y_i}$ :

$\sum_i (e^{x_i})^\theta(e^{y_i})^{1-\theta} \leq \left(\sum_i e^{x_i}\right)^\theta \left(\sum_i e^{y_i}\right)^{1-\theta}$

Step 3. Take the log of both sides (log is monotone increasing):

$\text{LSE}(\theta\mathbf{x} + (1-\theta)\mathbf{y}) \leq \theta \log\!\left(\sum_i e^{x_i}\right) + (1-\theta)\log\!\left(\sum_i e^{y_i}\right)$

$= \theta \cdot \text{LSE}(\mathbf{x}) + (1-\theta)\cdot \text{LSE}(\mathbf{y})$

This is exactly the definition of convexity. ✓

Numerical verification. Take $\mathbf{x} = (1, 2)$ , $\mathbf{y} = (3, 0)$ , $\theta = 0.5$ :

The midpoint is $\mathbf{m} = (2, 1)$ .

$\text{LSE}(\mathbf{m}) = \log(e^2 + e^1) = \log(7.389 + 2.718) = \log(10.107) \approx 2.313$

$\frac{1}{2}\text{LSE}(\mathbf{x}) + \frac{1}{2}\text{LSE}(\mathbf{y}) = \frac{1}{2}\log(e^1 + e^2) + \frac{1}{2}\log(e^3 + e^0)$

$= \frac{1}{2}\log(10.107) + \frac{1}{2}\log(21.086) = \frac{1}{2}(2.313 + 3.049) = 2.681$

We confirm: $2.313 \leq 2.681$ . ✓

import numpy as np

x = np.array([1.0, 2.0])
y = np.array([3.0, 0.0])
theta = 0.5
m = theta * x + (1 - theta) * y

lse = lambda v: np.log(np.sum(np.exp(v)))

print(f"LSE(midpoint)       = {lse(m):.3f}")
print(f"Convex combination  = {theta * lse(x) + (1 - theta) * lse(y):.3f}")
# LSE(midpoint)       = 2.313
# Convex combination  = 2.681

Example 3: A non-convex function with a local minimum

The double-well potential is a classic non-convex function:

$f(x) = (x^2 - 1)^2 = x^4 - 2x^2 + 1$

Step 1. Find critical points. Set $f'(x) = 0$ :

$f'(x) = 4x^3 - 4x = 4x(x^2 - 1) = 0$

Critical points: $x = 0$ , $x = 1$ , $x = -1$ .

Step 2. Evaluate $f$ and $f''$ at each:

$x$	$f(x)$	$f''(x) = 12x^2 - 4$	Classification
$0$	$1$	$-4 < 0$	Local maximum
$1$	$0$	$8 > 0$	Local (and global) minimum
$-1$	$0$	$8 > 0$	Local (and global) minimum

Step 3. Check for convexity. The second derivative $f''(x) = 12x^2 - 4$ is negative when $|x| < \frac{1}{\sqrt{3}} \approx 0.577$ . Since $f''$ is not non-negative everywhere, $f$ is not convex.

This is the kind of landscape you see in neural networks: multiple minima separated by regions where the curvature is negative. An optimizer starting near $x = 0$ could get pushed toward either $x = 1$ or $x = -1$ depending on the initial conditions.

import numpy as np

f = lambda x: (x**2 - 1)**2
f_pp = lambda x: 12*x**2 - 4

for x in [-1, 0, 1]:
    print(f"x={x:+d}: f={f(x)}, f''={f_pp(x):+d}")
# x=-1: f=0, f''=+8
# x=+0: f=1, f''=-4
# x=+1: f=0, f''=+8

Why convexity matters in ML

Every local minimum is global

This is the big one. If $f$ is convex and you find a point where the gradient is zero, you have found the global minimum. No need to restart with different initializations or worry about being trapped.

Efficient algorithms exist

For convex problems, gradient descent converges to the global solution with rate guarantees. Newton’s method converges quadratically near the solution. Interior point methods solve constrained convex problems in polynomial time.

Non-convex problems are NP-hard in general

Finding the global minimum of a general non-convex function is computationally intractable. In practice, we settle for local minima and use tricks (random restarts, good initialization, learning rate schedules) to find good ones.

Many ML models are convex

Linear regression, logistic regression, SVMs, and ridge/lasso regression all have convex objectives. This is not an accident. These models were designed so that training is tractable. The recent dominance of deep learning means we now routinely solve non-convex problems, but the theory of convex optimization remains the foundation.

Operations that preserve convexity

If you need to show a complex function is convex, you can often build it from simpler convex pieces using these rules:

Operation	Result
$\alpha f$ where $\alpha > 0$	Convex (non-negative scaling)
$f + g$ where both convex	Convex (sum)
$\max(f, g)$ where both convex	Convex (pointwise max)
$g(Ax + b)$ where $g$ convex	Convex (affine composition)
$h(f(x))$ where $h$ convex increasing, $f$ convex	Convex (composition)

These let you verify convexity of complicated loss functions by breaking them into simpler components.

Strictly convex and strongly convex

Two stronger versions of convexity come up often:

Strictly convex: the inequality is strict. $f(\theta \mathbf{x} + (1-\theta)\mathbf{y}) < \theta f(\mathbf{x}) + (1-\theta)f(\mathbf{y})$ for $\theta \in (0,1)$ and $\mathbf{x} \neq \mathbf{y}$ . This guarantees a unique minimum.

Strongly convex with parameter $m > 0$ : the Hessian satisfies $\nabla^2 f(\mathbf{x}) \succeq mI$ for all $\mathbf{x}$ . This is stronger than strict convexity. It guarantees not just a unique minimum but also bounds on how fast gradient descent converges. Adding L2 regularization to a convex loss always produces a strongly convex objective, which is one reason regularization helps optimization, not just generalization.

Summary

Concept	What it means
Convex set	Line segment between any two points stays inside
Convex function	Chord lies above the curve (Jensen’s inequality)
Hessian test	$\nabla^2 f \succeq 0$ everywhere implies convexity
Strictly convex	Unique global minimum
Strongly convex	Fast convergence guarantees

Convexity is the property that separates tractable optimization from intractable optimization. When you have it, exploit it. When you do not, understand what you lose and use algorithms designed for non-convex landscapes.

What comes next

With convexity in hand, we can now state precisely when a point is optimal. The next article covers first-order optimality conditions, where we use the gradient to identify candidate solutions.

← Back to all series