Search…

What is optimization and why ML needs it

In this series (18 parts)
  1. What is optimization and why ML needs it
  2. Convex sets and convex functions
  3. Optimality conditions: first order
  4. Optimality conditions: second order
  5. Line search methods
  6. Least squares: the closed-form solution
  7. Steepest descent (gradient descent)
  8. Newton's method for optimization
  9. Quasi-Newton methods: BFGS and L-BFGS
  10. Conjugate gradient methods
  11. Constrained optimization and Lagrangian duality
  12. KKT conditions
  13. Penalty and barrier methods
  14. Interior point methods
  15. The simplex method
  16. Frank-Wolfe method
  17. Optimization in dynamic programming and optimal control
  18. Stochastic gradient descent and variants

Every machine learning model learns by solving an optimization problem. You define a function that measures how wrong the model is, then you find the parameters that make that function as small as possible. That function is the objective function, and the process of minimizing it is optimization.

Why optimization is the engine of ML

Every time a model trains, it solves an optimization problem. The model has knobs (parameters). A scoring function measures how wrong the current settings are. Training turns those knobs until the score is as low as possible.

ML TaskWhat you tuneWhat you minimize (or maximize)
Linear regressionWeightsMean squared error
Logistic regressionWeightsCross-entropy loss
SVMWeights, biasHinge loss + regularization (maximize margin)
Neural networkAll layer weightsTask loss (MSE, cross-entropy, etc.)
K-meansCluster centersSum of squared distances
PCAProjection axesReconstruction error (maximize variance)

The pattern repeats everywhere. Define what “good” means with a function, then find the parameters that optimize it.

The optimization loop

graph LR
  A["Initialize parameters"] --> B["Evaluate objective function"]
  B --> C["Good enough?"]
  C -- No --> D["Adjust parameters"]
  D --> B
  C -- Yes --> E["Return best parameters"]

Every optimization algorithm follows this loop. They differ only in how they adjust parameters at each step. Gradient descent uses first derivatives. Newton’s method uses second derivatives. Stochastic methods add randomness. But the loop is always the same.

Optimization is finding the inputs that make a function as small (or as large) as possible. That single sentence captures the entire field. The rest of this article builds the vocabulary you need to make that statement precise.

Now let’s formalize these ideas.

Prerequisites

This article assumes you are comfortable with partial derivatives and gradients. If you can take the derivative of a multivariable function and interpret the gradient vector, you are ready.

Objective functions

An objective function (also called a loss function or cost function) maps model parameters to a single number that tells you how well the model is doing. Lower is better when we minimize; higher is better when we maximize.

Formally, we write:

minw  f(w)\min_{\mathbf{w}} \; f(\mathbf{w})

where w\mathbf{w} is the vector of parameters and ff is the objective. The output of ff is a scalar. The entire point of training a model is to find the w\mathbf{w}^* that gives the smallest value of ff.

Some common objective functions in ML:

ModelObjective function
Linear regressionMean squared error: 1ni=1n(yiwTxi)2\frac{1}{n}\sum_{i=1}^{n}(y_i - \mathbf{w}^T\mathbf{x}_i)^2
Logistic regressionCross-entropy loss: 1ni=1n[yilogy^i+(1yi)log(1y^i)]-\frac{1}{n}\sum_{i=1}^{n}[y_i \log \hat{y}_i + (1-y_i)\log(1-\hat{y}_i)]
SVMHinge loss + regularization
Neural networkTask-dependent (MSE, cross-entropy, etc.)

The choice of objective function determines what “best” means for your model.

Minimization vs maximization

We almost always frame ML problems as minimization. If you need to maximize some function g(w)g(\mathbf{w}), you can minimize g(w)-g(\mathbf{w}) instead. Every maximization problem has an equivalent minimization form, so there is no loss of generality.

maxw  g(w)        minw  [g(w)]\max_{\mathbf{w}} \; g(\mathbf{w}) \;\; \Longleftrightarrow \;\; \min_{\mathbf{w}} \; [-g(\mathbf{w})]

Flipping between minimization and maximization

graph TD
  A["Maximize g(w)"] -- "Negate: form -g(w)" --> B["Minimize -g(w)"]
  B -- "Negate back" --> A
  C["Same optimal w*"] -.-> A
  C -.-> B

Local vs global minima

A global minimum is the lowest value of ff over the entire domain. A local minimum is a point where ff is lower than all nearby points, but not necessarily the lowest overall.

graph TD
  A["Loss Landscape"] --> B["Local min: lowest nearby"]
  A --> C["Local min: lowest nearby"]
  A --> D["Global min: lowest everywhere"]
  B --> E["Optimizer may stop here"]
  C --> E
  D --> F["This is the true best"]

Loss landscape cross-section showing local and global minima of f(x) = x^4 - 3x^2 + x

More precisely:

  • w\mathbf{w}^* is a global minimum if f(w)f(w)f(\mathbf{w}^*) \leq f(\mathbf{w}) for all w\mathbf{w} in the domain.
  • w\mathbf{w}^* is a local minimum if there exists some neighborhood around w\mathbf{w}^* where f(w)f(w)f(\mathbf{w}^*) \leq f(\mathbf{w}) for all w\mathbf{w} in that neighborhood.

Every global minimum is also a local minimum, but not the other way around.

Example 1: A 1D function with local and global minima

Consider:

f(x)=x48x2+5f(x) = x^4 - 8x^2 + 5

Step 1. Find critical points by setting f(x)=0f'(x) = 0:

f(x)=4x316x=4x(x24)=0f'(x) = 4x^3 - 16x = 4x(x^2 - 4) = 0

This gives x=0x = 0, x=2x = 2, x=2x = -2.

Step 2. Evaluate ff at each critical point:

f(0)=00+5=5f(0) = 0 - 0 + 5 = 5

f(2)=1632+5=11f(2) = 16 - 32 + 5 = -11

f(2)=1632+5=11f(-2) = 16 - 32 + 5 = -11

Step 3. Check the second derivative to classify:

f(x)=12x216f''(x) = 12x^2 - 16

f(0)=16<0local maximumf''(0) = -16 < 0 \quad \Rightarrow \quad \text{local maximum}

f(2)=4816=32>0local minimumf''(2) = 48 - 16 = 32 > 0 \quad \Rightarrow \quad \text{local minimum}

f(2)=4816=32>0local minimumf''(-2) = 48 - 16 = 32 > 0 \quad \Rightarrow \quad \text{local minimum}

Both x=2x = 2 and x=2x = -2 are local minima with the same value of 11-11. Since f(x)+f(x) \to +\infty as x±x \to \pm\infty, these are also global minima. The point x=0x = 0 is a local maximum at f=5f = 5.

import numpy as np

f = lambda x: x**4 - 8*x**2 + 5
for x in [-2, 0, 2]:
    print(f"f({x}) = {f(x)}")
# f(-2) = -11
# f(0) = 5
# f(2) = -11

Why ML is an optimization problem

Example 2: Linear regression as optimization

Suppose you have three data points: (1,2)(1, 2), (2,3.5)(2, 3.5), (3,6)(3, 6). You want to fit a line y^=w0+w1x\hat{y} = w_0 + w_1 x by minimizing the mean squared error:

f(w0,w1)=13i=13(yiw0w1xi)2f(w_0, w_1) = \frac{1}{3}\sum_{i=1}^{3}(y_i - w_0 - w_1 x_i)^2

Step 1. Write out each squared error:

(2w0w1)2+(3.5w02w1)2+(6w03w1)2(2 - w_0 - w_1)^2 + (3.5 - w_0 - 2w_1)^2 + (6 - w_0 - 3w_1)^2

Step 2. Take partial derivatives and set them to zero. The gradient is:

fw0=23[(1)(2w0w1)+(1)(3.5w02w1)+(1)(6w03w1)]=0\frac{\partial f}{\partial w_0} = \frac{2}{3}[(-1)(2 - w_0 - w_1) + (-1)(3.5 - w_0 - 2w_1) + (-1)(6 - w_0 - 3w_1)] = 0

fw1=23[(1)(2w0w1)+(2)(3.5w02w1)+(3)(6w03w1)]=0\frac{\partial f}{\partial w_1} = \frac{2}{3}[(-1)(2 - w_0 - w_1) + (-2)(3.5 - w_0 - 2w_1) + (-3)(6 - w_0 - 3w_1)] = 0

From the first equation:

3w0+6w1=11.53w_0 + 6w_1 = 11.5

From the second equation:

6w0+14w1=276w_0 + 14w_1 = 27

Step 3. Solve the system. From the first equation, w0=(11.56w1)/3w_0 = (11.5 - 6w_1)/3. Substituting:

611.56w13+14w1=276 \cdot \frac{11.5 - 6w_1}{3} + 14w_1 = 27

2312w1+14w1=2723 - 12w_1 + 14w_1 = 27

2w1=4w1=22w_1 = 4 \quad \Rightarrow \quad w_1 = 2

w0=11.5123=0.530.167w_0 = \frac{11.5 - 12}{3} = \frac{-0.5}{3} \approx -0.167

The best-fit line is y^0.167+2x\hat{y} \approx -0.167 + 2x. This is the global minimum of the MSE because squared error is a convex function, which means any local minimum is also the global minimum.

import numpy as np

X = np.array([[1, 1], [1, 2], [1, 3]])
y = np.array([2, 3.5, 6])
w = np.linalg.lstsq(X, y, rcond=None)[0]
print(f"w0 = {w[0]:.3f}, w1 = {w[1]:.3f}")
# w0 = -0.167, w1 = 2.000

Why closed-form solutions rarely exist

Linear regression is special. You can write the gradient, set it to zero, and solve the resulting system of equations directly. This gives you the normal equations, which have a closed-form solution.

Most ML models are not this cooperative.

Example 3: Why you cannot “solve” a neural network

Consider a tiny neural network with one hidden layer, one hidden unit, and sigmoid activation. For a single data point (x,y)(x, y) with parameters w1,w2,b1,b2w_1, w_2, b_1, b_2:

y^=w2σ(w1x+b1)+b2\hat{y} = w_2 \cdot \sigma(w_1 x + b_1) + b_2

where σ(z)=11+ez\sigma(z) = \frac{1}{1 + e^{-z}}.

The loss for one data point is:

f=(yw2σ(w1x+b1)b2)2f = (y - w_2 \cdot \sigma(w_1 x + b_1) - b_2)^2

Step 1. Take the partial derivative with respect to w1w_1 using the chain rule:

fw1=2(yy^)w2σ(w1x+b1)x\frac{\partial f}{\partial w_1} = -2(y - \hat{y}) \cdot w_2 \cdot \sigma'(w_1 x + b_1) \cdot x

where σ(z)=σ(z)(1σ(z))\sigma'(z) = \sigma(z)(1 - \sigma(z)).

Step 2. Set this to zero:

2(yy^)w2σ(w1x+b1)(1σ(w1x+b1))x=0-2(y - \hat{y}) \cdot w_2 \cdot \sigma(w_1 x + b_1)(1 - \sigma(w_1 x + b_1)) \cdot x = 0

Step 3. Notice the problem. The sigmoid function creates a nonlinear, transcendental equation. You cannot isolate w1w_1 algebraically. The parameters are tangled inside nested nonlinear functions.

With real networks containing millions of parameters and thousands of data points, there is zero hope of finding a closed-form solution. This is why we need iterative optimization algorithms like gradient descent: start with a guess, compute the gradient, take a step downhill, repeat.

The landscape of ML optimization

The loss surface of a neural network is typically:

  • Non-convex: many local minima and saddle points
  • High-dimensional: thousands to billions of parameters
  • Noisy: we estimate the gradient from mini-batches of data

This makes optimization hard but not hopeless. Modern techniques like SGD with momentum, Adam, and RMSProp work remarkably well in practice, even without guarantees of finding the global minimum.

The key insight is that for neural networks, most local minima are “good enough.” They give similar loss values and generalize well. The real danger is saddle points, where the gradient is zero but the point is neither a minimum nor a maximum.

Constrained vs unconstrained optimization

So far we have talked about unconstrained optimization, where w\mathbf{w} can take any value. Many real problems have constraints:

  • Equality constraints: h(w)=0h(\mathbf{w}) = 0 (e.g., probabilities must sum to 1)
  • Inequality constraints: g(w)0g(\mathbf{w}) \leq 0 (e.g., parameters must be non-negative)

The general constrained optimization problem looks like:

minw  f(w)subject tohi(w)=0,    gj(w)0\min_{\mathbf{w}} \; f(\mathbf{w}) \quad \text{subject to} \quad h_i(\mathbf{w}) = 0, \;\; g_j(\mathbf{w}) \leq 0

Constrained problems require different tools. The Lagrangian and KKT conditions handle these cases. We will cover them later in the series.

Summary

ConceptKey point
Objective functionMaps parameters to a scalar measuring model quality
Local minimumBest in a neighborhood, not necessarily overall
Global minimumBest everywhere
Closed-formOnly works for simple models like linear regression
Iterative methodsRequired for neural networks and most modern models

Optimization is the engine of machine learning. You define what “good” means through an objective function, then use algorithms to find the parameters that achieve it. The rest of this series builds up the tools you need to understand and use these algorithms.

What comes next

Now that you know what optimization is and why it matters, the next step is understanding convex sets and convex functions. Convexity is the single most important property that makes an optimization problem easy or hard.

Start typing to search across all content
navigate Enter open Esc close