Newton's method for optimization
In this series (18 parts)
- What is optimization and why ML needs it
- Convex sets and convex functions
- Optimality conditions: first order
- Optimality conditions: second order
- Line search methods
- Least squares: the closed-form solution
- Steepest descent (gradient descent)
- Newton's method for optimization
- Quasi-Newton methods: BFGS and L-BFGS
- Conjugate gradient methods
- Constrained optimization and Lagrangian duality
- KKT conditions
- Penalty and barrier methods
- Interior point methods
- The simplex method
- Frank-Wolfe method
- Optimization in dynamic programming and optimal control
- Stochastic gradient descent and variants
Prerequisites: This article assumes you’re comfortable with gradient descent and the Hessian matrix. You should also know what eigenvalues are, though you won’t need to compute them here.
The key difference: slope vs curvature
Gradient descent sees the slope and walks downhill. It knows which way is down but not how far to go. Newton’s method sees both the slope and the curvature. It builds a local model of the landscape, a parabola that matches the function’s shape at the current point, and jumps directly to that parabola’s minimum.
On a simple quadratic, one jump is enough. On more complex functions, each jump lands much closer to the answer than a gradient step would.
Gradient descent vs Newton’s method on the same function
| Gradient descent | Newton’s method | |
|---|---|---|
| Information used | Slope only (gradient) | Slope and curvature (gradient + Hessian) |
| , start at , one step | (with ) | (exact minimum) |
| Same function, 5 steps | , still approaching | , done in 1 step |
| Step size choice | You pick (tricky) | Determined automatically by Hessian |
| Convergence near minimum | Linear (fixed error ratio) | Quadratic (digits of accuracy double each step) |
The quadratic approximation concept: at any point, fit a parabola to the function. The parabola matches the function’s value, slope, and curvature. Then jump to the parabola’s minimum.
Quadratic approximation: fit a parabola, jump to its minimum
graph LR A["Current point xk"] --> B["Build quadratic model<br/>matching value, gradient, Hessian"] B --> C["Find minimum of quadratic"] C --> D["Jump to that minimum: xk+1"] D -->|"Repeat"| A style C fill:#ccffcc
If the function is actually quadratic, the model is exact and one jump finds the minimum. If not, the model is approximate, but each jump still makes fast progress.
Contour plot comparing gradient descent (red, zigzag, 12 steps) vs Newton’s method (green, direct, 1 step) on f(x,y) = x^2 + 10y^2
Now let’s formalize this idea.
The core idea: use curvature, not just slope
Gradient descent looks at the gradient and walks downhill. That’s first-order information. It knows the slope but nothing about how the slope changes. Newton’s method goes further. It uses the Hessian, which captures curvature, to build a local quadratic model of the function. Then it jumps straight to the minimum of that model.
If the function actually is quadratic, Newton’s method finds the exact minimum in one step. If the function is approximately quadratic near the minimum (which most smooth functions are), Newton’s method converges extremely fast.
Taylor expansion to second order
The idea starts with the second-order Taylor expansion. For a function around a point :
where is the Hessian matrix evaluated at . Call this approximation . It’s a quadratic function that matches in value, gradient, and curvature at .
To minimize , take its gradient and set it to zero:
Solve for :
That’s the Newton step.
Newton’s method step by step
flowchart TD
A["Start at current point xk"] --> B["Compute gradient nabla f at xk"]
B --> C["Compute Hessian H at xk"]
C --> D["Solve H * step = negative gradient"]
D --> E["Update: xk+1 = xk + step"]
E --> F{"Converged?"}
F -->|"No"| A
F -->|"Yes"| G["Done: xk+1 is the minimum"]
The Newton update rule
The update is:
Compare this to gradient descent:
In gradient descent, the step size is a scalar you choose. In Newton’s method, plays the role of the step size, but it’s a matrix. It rescales the gradient direction based on the local curvature.
What the Hessian inverse does geometrically
Think of it this way. If the function curves steeply in one direction, the Hessian has a large eigenvalue in that direction. The inverse Hessian shrinks the step in that direction, because you’re already close to the bottom of the valley. If the function curves gently in another direction, the inverse Hessian stretches the step, because you need to go further.
Gradient descent doesn’t do this. It takes the same-sized step regardless of curvature, which is why it zigzags in elongated valleys. Newton’s method accounts for the shape of the landscape and steps directly toward the minimum.
graph LR A["Gradient descent"] -->|"uses"| B["First-order info<br/>(gradient only)"] C["Newton's method"] -->|"uses"| D["First + second-order info<br/>(gradient + Hessian)"] B -->|"linear convergence"| E["Slow near minimum"] D -->|"quadratic convergence"| F["Fast near minimum"]
Example 1: Newton is exact for quadratics
Consider . The minimum is at with .
First, compute the derivatives:
Start from . Apply the Newton step:
Done. One step. We hit the exact minimum because is quadratic, and Newton’s method minimizes a quadratic model exactly.
Compare to gradient descent on the same function
Run gradient descent with step size , starting from :
| Step | ||||
|---|---|---|---|---|
| 1 | 0.0000 | 0.6000 | 6.7600 | |
| 2 | 0.6000 | 1.0800 | 4.6864 | |
| 3 | 1.0800 | 1.4640 | 3.3593 | |
| 4 | 1.4640 | 1.7712 | 2.5099 | |
| 5 | 1.7712 | 2.0170 | 1.9664 |
After 5 steps of gradient descent, we’re at with . Still not close to the minimum. Newton got there in 1 step.
The contrast is stark. For a simple quadratic, gradient descent crawls while Newton teleports.
Example 2: Newton in two dimensions
Now a 2D problem. Let . This is an elliptical bowl with minimum at .
Start from .
Step 1: Compute the gradient.
At :
Step 2: Compute the Hessian.
The Hessian is constant here (the function is quadratic), so it’s the same everywhere.
Step 3: Compute the inverse Hessian.
Step 4: Apply the Newton update.
Again, one step to the exact minimum. Newton’s method nailed it because is quadratic.
Notice what the inverse Hessian did. The gradient pointed in the direction , which is biased toward the -direction because the function curves more steeply there. The inverse Hessian corrected this by shrinking the -component (dividing by 8) and the -component (dividing by 2), producing the step , which points directly at the origin.
Gradient descent on this same function would zigzag because the curvature differs along and . It needs many iterations to converge, as we covered in the steepest descent article.
Convergence: quadratic vs. linear
Near a minimum where the Hessian is positive definite, Newton’s method has quadratic convergence. This means the error roughly squares each iteration:
for some constant . If you’re within of the answer, the next step puts you within , then , then . The digits of accuracy roughly double each step.
Convergence speed: linear vs quadratic
graph TD A["Gradient descent: linear convergence"] --> B["Error: 0.1 -> 0.08 -> 0.064 -> 0.051 -> ..."] C["Newton's method: quadratic convergence"] --> D["Error: 0.1 -> 0.01 -> 0.0001 -> 0.00000001"] B --> E["Many steps to high accuracy"] D --> F["A few steps to machine precision"] style E fill:#fff3cd style F fill:#ccffcc
Gradient descent has linear convergence:
where is a fixed ratio. If , you reduce the error by 20% each step. Getting from error to error takes about 80 steps with , while Newton does it in about 4.
The catch? Each Newton step is much more expensive.
Convergence comparison: Newton reaches the minimum in 1 step while gradient descent takes many iterations
The cost of a Newton step
Computing the Newton step requires:
- The gradient : costs for variables.
- The Hessian : costs to store and compute.
- Solving for the direction : costs using direct methods.
That is the bottleneck. For a problem with 1,000 variables, each Newton step involves solving a linear system. For 1,000,000 variables (common in deep learning), it’s completely impractical.
This is exactly why quasi-Newton methods exist. They approximate the Hessian (or its inverse) cheaply, trading some convergence speed for dramatically lower cost per step.
Example 3: when the function is not quadratic
Real functions aren’t quadratic, so Newton’s method needs multiple iterations. Let’s see what happens on a non-convex function.
Consider . The derivatives are:
This function has critical points where :
At : , so this is a local maximum.
At : , so these are local minima.
Start from and run 4 Newton iterations:
Iteration 1:
Iteration 2:
Iteration 3:
Iteration 4:
After 4 iterations we’re at , closing in on the local minimum at . The convergence accelerates as we get closer because the quadratic approximation becomes more accurate near the minimum.
Notice something important: this function has a local maximum at . If we had started at a point where , the Newton step would move us toward the maximum, not the minimum. Newton’s method doesn’t distinguish between minima, maxima, and saddle points on its own.
When Newton’s method fails
Newton’s method is powerful, but it has real failure modes.
Singular or near-singular Hessian
If the Hessian is singular (determinant zero), you can’t invert it. This happens at inflection points or degenerate critical points. Even if the Hessian is technically invertible but has very small eigenvalues, the Newton step can be enormous and overshoot wildly.
Non-convex functions and saddle points
Newton’s method finds points where the gradient is zero. It does not care whether that point is a minimum, maximum, or saddle point. On non-convex functions (which include most neural network loss surfaces), saddle points are everywhere. Pure Newton’s method can converge to them.
No guaranteed descent
Unlike gradient descent with a small enough step size, Newton’s method does not guarantee that . The step might overshoot or head toward a maximum. In practice, people fix this by adding a line search or a trust region that limits the step size.
Expensive in high dimensions
As discussed above, the cost per step makes Newton’s method impractical for large-scale problems. This is the main reason you rarely see pure Newton’s method in machine learning.
graph TD
A["Newton's method"] --> B{"Is Hessian<br/>positive definite?"}
B -->|"Yes"| C["Step is a descent direction<br/>Converges to minimum"]
B -->|"No - indefinite"| D["Might converge to<br/>saddle point or maximum"]
B -->|"No - singular"| E["Step is undefined<br/>Method breaks down"]
C --> F{"Near minimum?"}
F -->|"Yes"| G["Quadratic convergence ✓"]
F -->|"No"| H["May overshoot ⚠"]
Python implementation
Here’s a simple Newton’s method implementation for 1D and multidimensional problems:
import numpy as np
def newton_1d(f_prime, f_double_prime, x0, tol=1e-8, max_iter=50):
"""Newton's method for 1D optimization."""
x = x0
for i in range(max_iter):
g = f_prime(x)
h = f_double_prime(x)
if abs(h) < 1e-12:
print(f"Hessian near zero at iteration {i}. Stopping.")
break
x_new = x - g / h
print(f"Iter {i+1}: x = {x_new:.6f}")
if abs(x_new - x) < tol:
break
x = x_new
return x
# Example: f(x) = x^4 - x^2
x_star = newton_1d(
f_prime=lambda x: 4*x**3 - 2*x,
f_double_prime=lambda x: 12*x**2 - 2,
x0=2.0
)
def newton_nd(grad_f, hess_f, x0, tol=1e-8, max_iter=50):
"""Newton's method for multidimensional optimization."""
x = np.array(x0, dtype=float)
for i in range(max_iter):
g = grad_f(x)
H = hess_f(x)
step = np.linalg.solve(H, g)
x_new = x - step
print(f"Iter {i+1}: x = {x_new}, ||grad|| = {np.linalg.norm(g):.6f}")
if np.linalg.norm(x_new - x) < tol:
break
x = x_new
return x
# Example: f(x,y) = x^2 + 4y^2
x_star = newton_nd(
grad_f=lambda x: np.array([2*x[0], 8*x[1]]),
hess_f=lambda x: np.array([[2, 0], [0, 8]]),
x0=[4.0, 2.0]
)
Notice we use np.linalg.solve(H, g) instead of computing directly. Solving the linear system is numerically more stable and often faster.
Summary
| Property | Gradient descent | Newton’s method |
|---|---|---|
| Info used | Gradient (first-order) | Gradient + Hessian (second-order) |
| Step size | Fixed (you choose) | Determined by (automatic) |
| Convergence | Linear | Quadratic (near minimum) |
| Cost per step | ||
| Works on non-convex? | Descends, but slowly | Can converge to saddle points |
| Exact for quadratics? | No | Yes, in 1 step |
Newton’s method is the gold standard for convergence speed on smooth, well-behaved functions. It uses the Hessian to correct the gradient direction, accounting for curvature in every direction simultaneously. For quadratic functions, it’s exact in one step. Near the minimum of any smooth function, it converges quadratically.
The cost is high: per step and the need for the full Hessian matrix. For problems with hundreds or thousands of variables, this is manageable. For millions of variables, it’s not.
What comes next
The high cost of Newton’s method leads to a natural question: can we get most of the benefit of second-order information without paying the full price? The answer is yes, and that’s exactly what quasi-Newton methods do. Methods like BFGS and L-BFGS build up an approximation of the inverse Hessian using only gradient information, achieving superlinear convergence at a fraction of the cost. That’s what we cover next.