What is optimization and why ML needs it
In this series (18 parts)
- What is optimization and why ML needs it
- Convex sets and convex functions
- Optimality conditions: first order
- Optimality conditions: second order
- Line search methods
- Least squares: the closed-form solution
- Steepest descent (gradient descent)
- Newton's method for optimization
- Quasi-Newton methods: BFGS and L-BFGS
- Conjugate gradient methods
- Constrained optimization and Lagrangian duality
- KKT conditions
- Penalty and barrier methods
- Interior point methods
- The simplex method
- Frank-Wolfe method
- Optimization in dynamic programming and optimal control
- Stochastic gradient descent and variants
Every machine learning model learns by solving an optimization problem. You define a function that measures how wrong the model is, then you find the parameters that make that function as small as possible. That function is the objective function, and the process of minimizing it is optimization.
Why optimization is the engine of ML
Every time a model trains, it solves an optimization problem. The model has knobs (parameters). A scoring function measures how wrong the current settings are. Training turns those knobs until the score is as low as possible.
| ML Task | What you tune | What you minimize (or maximize) |
|---|---|---|
| Linear regression | Weights | Mean squared error |
| Logistic regression | Weights | Cross-entropy loss |
| SVM | Weights, bias | Hinge loss + regularization (maximize margin) |
| Neural network | All layer weights | Task loss (MSE, cross-entropy, etc.) |
| K-means | Cluster centers | Sum of squared distances |
| PCA | Projection axes | Reconstruction error (maximize variance) |
The pattern repeats everywhere. Define what “good” means with a function, then find the parameters that optimize it.
The optimization loop
graph LR A["Initialize parameters"] --> B["Evaluate objective function"] B --> C["Good enough?"] C -- No --> D["Adjust parameters"] D --> B C -- Yes --> E["Return best parameters"]
Every optimization algorithm follows this loop. They differ only in how they adjust parameters at each step. Gradient descent uses first derivatives. Newton’s method uses second derivatives. Stochastic methods add randomness. But the loop is always the same.
Optimization is finding the inputs that make a function as small (or as large) as possible. That single sentence captures the entire field. The rest of this article builds the vocabulary you need to make that statement precise.
Now let’s formalize these ideas.
Prerequisites
This article assumes you are comfortable with partial derivatives and gradients. If you can take the derivative of a multivariable function and interpret the gradient vector, you are ready.
Objective functions
An objective function (also called a loss function or cost function) maps model parameters to a single number that tells you how well the model is doing. Lower is better when we minimize; higher is better when we maximize.
Formally, we write:
where is the vector of parameters and is the objective. The output of is a scalar. The entire point of training a model is to find the that gives the smallest value of .
Some common objective functions in ML:
| Model | Objective function |
|---|---|
| Linear regression | Mean squared error: |
| Logistic regression | Cross-entropy loss: |
| SVM | Hinge loss + regularization |
| Neural network | Task-dependent (MSE, cross-entropy, etc.) |
The choice of objective function determines what “best” means for your model.
Minimization vs maximization
We almost always frame ML problems as minimization. If you need to maximize some function , you can minimize instead. Every maximization problem has an equivalent minimization form, so there is no loss of generality.
Flipping between minimization and maximization
graph TD A["Maximize g(w)"] -- "Negate: form -g(w)" --> B["Minimize -g(w)"] B -- "Negate back" --> A C["Same optimal w*"] -.-> A C -.-> B
Local vs global minima
A global minimum is the lowest value of over the entire domain. A local minimum is a point where is lower than all nearby points, but not necessarily the lowest overall.
graph TD A["Loss Landscape"] --> B["Local min: lowest nearby"] A --> C["Local min: lowest nearby"] A --> D["Global min: lowest everywhere"] B --> E["Optimizer may stop here"] C --> E D --> F["This is the true best"]
Loss landscape cross-section showing local and global minima of f(x) = x^4 - 3x^2 + x
More precisely:
- is a global minimum if for all in the domain.
- is a local minimum if there exists some neighborhood around where for all in that neighborhood.
Every global minimum is also a local minimum, but not the other way around.
Example 1: A 1D function with local and global minima
Consider:
Step 1. Find critical points by setting :
This gives , , .
Step 2. Evaluate at each critical point:
Step 3. Check the second derivative to classify:
Both and are local minima with the same value of . Since as , these are also global minima. The point is a local maximum at .
import numpy as np
f = lambda x: x**4 - 8*x**2 + 5
for x in [-2, 0, 2]:
print(f"f({x}) = {f(x)}")
# f(-2) = -11
# f(0) = 5
# f(2) = -11
Why ML is an optimization problem
Example 2: Linear regression as optimization
Suppose you have three data points: , , . You want to fit a line by minimizing the mean squared error:
Step 1. Write out each squared error:
Step 2. Take partial derivatives and set them to zero. The gradient is:
From the first equation:
From the second equation:
Step 3. Solve the system. From the first equation, . Substituting:
The best-fit line is . This is the global minimum of the MSE because squared error is a convex function, which means any local minimum is also the global minimum.
import numpy as np
X = np.array([[1, 1], [1, 2], [1, 3]])
y = np.array([2, 3.5, 6])
w = np.linalg.lstsq(X, y, rcond=None)[0]
print(f"w0 = {w[0]:.3f}, w1 = {w[1]:.3f}")
# w0 = -0.167, w1 = 2.000
Why closed-form solutions rarely exist
Linear regression is special. You can write the gradient, set it to zero, and solve the resulting system of equations directly. This gives you the normal equations, which have a closed-form solution.
Most ML models are not this cooperative.
Example 3: Why you cannot “solve” a neural network
Consider a tiny neural network with one hidden layer, one hidden unit, and sigmoid activation. For a single data point with parameters :
where .
The loss for one data point is:
Step 1. Take the partial derivative with respect to using the chain rule:
where .
Step 2. Set this to zero:
Step 3. Notice the problem. The sigmoid function creates a nonlinear, transcendental equation. You cannot isolate algebraically. The parameters are tangled inside nested nonlinear functions.
With real networks containing millions of parameters and thousands of data points, there is zero hope of finding a closed-form solution. This is why we need iterative optimization algorithms like gradient descent: start with a guess, compute the gradient, take a step downhill, repeat.
The landscape of ML optimization
The loss surface of a neural network is typically:
- Non-convex: many local minima and saddle points
- High-dimensional: thousands to billions of parameters
- Noisy: we estimate the gradient from mini-batches of data
This makes optimization hard but not hopeless. Modern techniques like SGD with momentum, Adam, and RMSProp work remarkably well in practice, even without guarantees of finding the global minimum.
The key insight is that for neural networks, most local minima are “good enough.” They give similar loss values and generalize well. The real danger is saddle points, where the gradient is zero but the point is neither a minimum nor a maximum.
Constrained vs unconstrained optimization
So far we have talked about unconstrained optimization, where can take any value. Many real problems have constraints:
- Equality constraints: (e.g., probabilities must sum to 1)
- Inequality constraints: (e.g., parameters must be non-negative)
The general constrained optimization problem looks like:
Constrained problems require different tools. The Lagrangian and KKT conditions handle these cases. We will cover them later in the series.
Summary
| Concept | Key point |
|---|---|
| Objective function | Maps parameters to a scalar measuring model quality |
| Local minimum | Best in a neighborhood, not necessarily overall |
| Global minimum | Best everywhere |
| Closed-form | Only works for simple models like linear regression |
| Iterative methods | Required for neural networks and most modern models |
Optimization is the engine of machine learning. You define what “good” means through an objective function, then use algorithms to find the parameters that achieve it. The rest of this series builds up the tools you need to understand and use these algorithms.
What comes next
Now that you know what optimization is and why it matters, the next step is understanding convex sets and convex functions. Convexity is the single most important property that makes an optimization problem easy or hard.