Mar 5, 2026 · 12 min read · ML Optimization

What is optimization and why ML needs it

In this series (18 parts)

Every machine learning model learns by solving an optimization problem. You define a function that measures how wrong the model is, then you find the parameters that make that function as small as possible. That function is the objective function, and the process of minimizing it is optimization.

Why optimization is the engine of ML

Every time a model trains, it solves an optimization problem. The model has knobs (parameters). A scoring function measures how wrong the current settings are. Training turns those knobs until the score is as low as possible.

ML Task	What you tune	What you minimize (or maximize)
Linear regression	Weights	Mean squared error
Logistic regression	Weights	Cross-entropy loss
SVM	Weights, bias	Hinge loss + regularization (maximize margin)
Neural network	All layer weights	Task loss (MSE, cross-entropy, etc.)
K-means	Cluster centers	Sum of squared distances
PCA	Projection axes	Reconstruction error (maximize variance)

The pattern repeats everywhere. Define what “good” means with a function, then find the parameters that optimize it.

The optimization loop

graph LR
  A["Initialize parameters"] --> B["Evaluate objective function"]
  B --> C["Good enough?"]
  C -- No --> D["Adjust parameters"]
  D --> B
  C -- Yes --> E["Return best parameters"]

Every optimization algorithm follows this loop. They differ only in how they adjust parameters at each step. Gradient descent uses first derivatives. Newton’s method uses second derivatives. Stochastic methods add randomness. But the loop is always the same.

Optimization is finding the inputs that make a function as small (or as large) as possible. That single sentence captures the entire field. The rest of this article builds the vocabulary you need to make that statement precise.

Now let’s formalize these ideas.

Prerequisites

This article assumes you are comfortable with partial derivatives and gradients. If you can take the derivative of a multivariable function and interpret the gradient vector, you are ready.

Objective functions

An objective function (also called a loss function or cost function) maps model parameters to a single number that tells you how well the model is doing. Lower is better when we minimize; higher is better when we maximize.

Formally, we write:

$\min_{\mathbf{w}} \; f(\mathbf{w})$

where $\mathbf{w}$ is the vector of parameters and $f$ is the objective. The output of $f$ is a scalar. The entire point of training a model is to find the $\mathbf{w}^*$ that gives the smallest value of $f$ .

Some common objective functions in ML:

Model	Objective function
Linear regression	Mean squared error: $\frac{1}{n}\sum_{i=1}^{n}(y_i - \mathbf{w}^T\mathbf{x}_i)^2$
Logistic regression	Cross-entropy loss: $-\frac{1}{n}\sum_{i=1}^{n}[y_i \log \hat{y}_i + (1-y_i)\log(1-\hat{y}_i)]$
SVM	Hinge loss + regularization
Neural network	Task-dependent (MSE, cross-entropy, etc.)

The choice of objective function determines what “best” means for your model.

Minimization vs maximization

We almost always frame ML problems as minimization. If you need to maximize some function $g(\mathbf{w})$ , you can minimize $-g(\mathbf{w})$ instead. Every maximization problem has an equivalent minimization form, so there is no loss of generality.

$\max_{\mathbf{w}} \; g(\mathbf{w}) \;\; \Longleftrightarrow \;\; \min_{\mathbf{w}} \; [-g(\mathbf{w})]$

Flipping between minimization and maximization

graph TD
  A["Maximize g(w)"] -- "Negate: form -g(w)" --> B["Minimize -g(w)"]
  B -- "Negate back" --> A
  C["Same optimal w*"] -.-> A
  C -.-> B

Local vs global minima

A global minimum is the lowest value of $f$ over the entire domain. A local minimum is a point where $f$ is lower than all nearby points, but not necessarily the lowest overall.

graph TD
  A["Loss Landscape"] --> B["Local min: lowest nearby"]
  A --> C["Local min: lowest nearby"]
  A --> D["Global min: lowest everywhere"]
  B --> E["Optimizer may stop here"]
  C --> E
  D --> F["This is the true best"]

Loss landscape cross-section showing local and global minima of f(x) = x^4 - 3x^2 + x

More precisely:

$\mathbf{w}^*$ is a global minimum if $f(\mathbf{w}^*) \leq f(\mathbf{w})$ for all $\mathbf{w}$ in the domain.
$\mathbf{w}^*$ is a local minimum if there exists some neighborhood around $\mathbf{w}^*$ where $f(\mathbf{w}^*) \leq f(\mathbf{w})$ for all $\mathbf{w}$ in that neighborhood.

Every global minimum is also a local minimum, but not the other way around.

Example 1: A 1D function with local and global minima

Consider:

$f(x) = x^4 - 8x^2 + 5$

Step 1. Find critical points by setting $f'(x) = 0$ :

$f'(x) = 4x^3 - 16x = 4x(x^2 - 4) = 0$

This gives $x = 0$ , $x = 2$ , $x = -2$ .

Step 2. Evaluate $f$ at each critical point:

$f(0) = 0 - 0 + 5 = 5$

$f(2) = 16 - 32 + 5 = -11$

$f(-2) = 16 - 32 + 5 = -11$

Step 3. Check the second derivative to classify:

$f''(x) = 12x^2 - 16$

$f''(0) = -16 < 0 \quad \Rightarrow \quad \text{local maximum}$

$f''(2) = 48 - 16 = 32 > 0 \quad \Rightarrow \quad \text{local minimum}$

$f''(-2) = 48 - 16 = 32 > 0 \quad \Rightarrow \quad \text{local minimum}$

Both $x = 2$ and $x = -2$ are local minima with the same value of $-11$ . Since $f(x) \to +\infty$ as $x \to \pm\infty$ , these are also global minima. The point $x = 0$ is a local maximum at $f = 5$ .

import numpy as np

f = lambda x: x**4 - 8*x**2 + 5
for x in [-2, 0, 2]:
    print(f"f({x}) = {f(x)}")
# f(-2) = -11
# f(0) = 5
# f(2) = -11

Why ML is an optimization problem

Example 2: Linear regression as optimization

Suppose you have three data points: $(1, 2)$ , $(2, 3.5)$ , $(3, 6)$ . You want to fit a line $\hat{y} = w_0 + w_1 x$ by minimizing the mean squared error:

$f(w_0, w_1) = \frac{1}{3}\sum_{i=1}^{3}(y_i - w_0 - w_1 x_i)^2$

Step 1. Write out each squared error:

$(2 - w_0 - w_1)^2 + (3.5 - w_0 - 2w_1)^2 + (6 - w_0 - 3w_1)^2$

Step 2. Take partial derivatives and set them to zero. The gradient is:

$\frac{\partial f}{\partial w_0} = \frac{2}{3}[(-1)(2 - w_0 - w_1) + (-1)(3.5 - w_0 - 2w_1) + (-1)(6 - w_0 - 3w_1)] = 0$

$\frac{\partial f}{\partial w_1} = \frac{2}{3}[(-1)(2 - w_0 - w_1) + (-2)(3.5 - w_0 - 2w_1) + (-3)(6 - w_0 - 3w_1)] = 0$

From the first equation:

$3w_0 + 6w_1 = 11.5$

From the second equation:

$6w_0 + 14w_1 = 27$

Step 3. Solve the system. From the first equation, $w_0 = (11.5 - 6w_1)/3$ . Substituting:

$6 \cdot \frac{11.5 - 6w_1}{3} + 14w_1 = 27$

$23 - 12w_1 + 14w_1 = 27$

$2w_1 = 4 \quad \Rightarrow \quad w_1 = 2$

$w_0 = \frac{11.5 - 12}{3} = \frac{-0.5}{3} \approx -0.167$

The best-fit line is $\hat{y} \approx -0.167 + 2x$ . This is the global minimum of the MSE because squared error is a convex function, which means any local minimum is also the global minimum.

import numpy as np

X = np.array([[1, 1], [1, 2], [1, 3]])
y = np.array([2, 3.5, 6])
w = np.linalg.lstsq(X, y, rcond=None)[0]
print(f"w0 = {w[0]:.3f}, w1 = {w[1]:.3f}")
# w0 = -0.167, w1 = 2.000

Why closed-form solutions rarely exist

Linear regression is special. You can write the gradient, set it to zero, and solve the resulting system of equations directly. This gives you the normal equations, which have a closed-form solution.

Most ML models are not this cooperative.

Example 3: Why you cannot “solve” a neural network

Consider a tiny neural network with one hidden layer, one hidden unit, and sigmoid activation. For a single data point $(x, y)$ with parameters $w_1, w_2, b_1, b_2$ :

$\hat{y} = w_2 \cdot \sigma(w_1 x + b_1) + b_2$

where $\sigma(z) = \frac{1}{1 + e^{-z}}$ .

The loss for one data point is:

$f = (y - w_2 \cdot \sigma(w_1 x + b_1) - b_2)^2$

Step 1. Take the partial derivative with respect to $w_1$ using the chain rule:

$\frac{\partial f}{\partial w_1} = -2(y - \hat{y}) \cdot w_2 \cdot \sigma'(w_1 x + b_1) \cdot x$

where $\sigma'(z) = \sigma(z)(1 - \sigma(z))$ .

Step 2. Set this to zero:

$-2(y - \hat{y}) \cdot w_2 \cdot \sigma(w_1 x + b_1)(1 - \sigma(w_1 x + b_1)) \cdot x = 0$

Step 3. Notice the problem. The sigmoid function creates a nonlinear, transcendental equation. You cannot isolate $w_1$ algebraically. The parameters are tangled inside nested nonlinear functions.

With real networks containing millions of parameters and thousands of data points, there is zero hope of finding a closed-form solution. This is why we need iterative optimization algorithms like gradient descent: start with a guess, compute the gradient, take a step downhill, repeat.

The landscape of ML optimization

The loss surface of a neural network is typically:

Non-convex: many local minima and saddle points
High-dimensional: thousands to billions of parameters
Noisy: we estimate the gradient from mini-batches of data

This makes optimization hard but not hopeless. Modern techniques like SGD with momentum, Adam, and RMSProp work remarkably well in practice, even without guarantees of finding the global minimum.

The key insight is that for neural networks, most local minima are “good enough.” They give similar loss values and generalize well. The real danger is saddle points, where the gradient is zero but the point is neither a minimum nor a maximum.

Constrained vs unconstrained optimization

So far we have talked about unconstrained optimization, where $\mathbf{w}$ can take any value. Many real problems have constraints:

Equality constraints: $h(\mathbf{w}) = 0$ (e.g., probabilities must sum to 1)
Inequality constraints: $g(\mathbf{w}) \leq 0$ (e.g., parameters must be non-negative)

The general constrained optimization problem looks like:

$\min_{\mathbf{w}} \; f(\mathbf{w}) \quad \text{subject to} \quad h_i(\mathbf{w}) = 0, \;\; g_j(\mathbf{w}) \leq 0$

Constrained problems require different tools. The Lagrangian and KKT conditions handle these cases. We will cover them later in the series.

Summary

Concept	Key point
Objective function	Maps parameters to a scalar measuring model quality
Local minimum	Best in a neighborhood, not necessarily overall
Global minimum	Best everywhere
Closed-form	Only works for simple models like linear regression
Iterative methods	Required for neural networks and most modern models

Optimization is the engine of machine learning. You define what “good” means through an objective function, then use algorithms to find the parameters that achieve it. The rest of this series builds up the tools you need to understand and use these algorithms.

What comes next

Now that you know what optimization is and why it matters, the next step is understanding convex sets and convex functions. Convexity is the single most important property that makes an optimization problem easy or hard.

← Back to all series