Search…

Newton's method for optimization

In this series (18 parts)
  1. What is optimization and why ML needs it
  2. Convex sets and convex functions
  3. Optimality conditions: first order
  4. Optimality conditions: second order
  5. Line search methods
  6. Least squares: the closed-form solution
  7. Steepest descent (gradient descent)
  8. Newton's method for optimization
  9. Quasi-Newton methods: BFGS and L-BFGS
  10. Conjugate gradient methods
  11. Constrained optimization and Lagrangian duality
  12. KKT conditions
  13. Penalty and barrier methods
  14. Interior point methods
  15. The simplex method
  16. Frank-Wolfe method
  17. Optimization in dynamic programming and optimal control
  18. Stochastic gradient descent and variants

Prerequisites: This article assumes you’re comfortable with gradient descent and the Hessian matrix. You should also know what eigenvalues are, though you won’t need to compute them here.

The key difference: slope vs curvature

Gradient descent sees the slope and walks downhill. It knows which way is down but not how far to go. Newton’s method sees both the slope and the curvature. It builds a local model of the landscape, a parabola that matches the function’s shape at the current point, and jumps directly to that parabola’s minimum.

On a simple quadratic, one jump is enough. On more complex functions, each jump lands much closer to the answer than a gradient step would.

Gradient descent vs Newton’s method on the same function

Gradient descentNewton’s method
Information usedSlope only (gradient)Slope and curvature (gradient + Hessian)
f(x)=(x3)2f(x) = (x-3)^2, start at x=0x=0, one stepx1=0.6x_1 = 0.6 (with α=0.1\alpha = 0.1)x1=3.0x_1 = 3.0 (exact minimum)
Same function, 5 stepsx52.0x_5 \approx 2.0, still approachingx1=3.0x_1 = 3.0, done in 1 step
Step size choiceYou pick α\alpha (tricky)Determined automatically by Hessian
Convergence near minimumLinear (fixed error ratio)Quadratic (digits of accuracy double each step)

The quadratic approximation concept: at any point, fit a parabola to the function. The parabola matches the function’s value, slope, and curvature. Then jump to the parabola’s minimum.

Quadratic approximation: fit a parabola, jump to its minimum

graph LR
  A["Current point xk"] --> B["Build quadratic model<br/>matching value, gradient, Hessian"]
  B --> C["Find minimum of quadratic"]
  C --> D["Jump to that minimum: xk+1"]
  D -->|"Repeat"| A
  style C fill:#ccffcc

If the function is actually quadratic, the model is exact and one jump finds the minimum. If not, the model is approximate, but each jump still makes fast progress.

Contour plot comparing gradient descent (red, zigzag, 12 steps) vs Newton’s method (green, direct, 1 step) on f(x,y) = x^2 + 10y^2

Now let’s formalize this idea.

The core idea: use curvature, not just slope

Gradient descent looks at the gradient and walks downhill. That’s first-order information. It knows the slope but nothing about how the slope changes. Newton’s method goes further. It uses the Hessian, which captures curvature, to build a local quadratic model of the function. Then it jumps straight to the minimum of that model.

If the function actually is quadratic, Newton’s method finds the exact minimum in one step. If the function is approximately quadratic near the minimum (which most smooth functions are), Newton’s method converges extremely fast.

Taylor expansion to second order

The idea starts with the second-order Taylor expansion. For a function ff around a point xkx_k:

f(x)f(xk)+f(xk)T(xxk)+12(xxk)TH(xk)(xxk)f(x) \approx f(x_k) + \nabla f(x_k)^T (x - x_k) + \frac{1}{2} (x - x_k)^T H(x_k) (x - x_k)

where H(xk)H(x_k) is the Hessian matrix evaluated at xkx_k. Call this approximation q(x)q(x). It’s a quadratic function that matches ff in value, gradient, and curvature at xkx_k.

To minimize q(x)q(x), take its gradient and set it to zero:

q(x)=f(xk)+H(xk)(xxk)=0\nabla q(x) = \nabla f(x_k) + H(x_k)(x - x_k) = 0

Solve for xx:

x=xkH(xk)1f(xk)x = x_k - H(x_k)^{-1} \nabla f(x_k)

That’s the Newton step.

Newton’s method step by step

flowchart TD
  A["Start at current point xk"] --> B["Compute gradient nabla f at xk"]
  B --> C["Compute Hessian H at xk"]
  C --> D["Solve H * step = negative gradient"]
  D --> E["Update: xk+1 = xk + step"]
  E --> F{"Converged?"}
  F -->|"No"| A
  F -->|"Yes"| G["Done: xk+1 is the minimum"]

The Newton update rule

The update is:

xk+1=xkH(xk)1f(xk)x_{k+1} = x_k - H(x_k)^{-1} \nabla f(x_k)

Compare this to gradient descent:

xk+1=xkαf(xk)x_{k+1} = x_k - \alpha \nabla f(x_k)

In gradient descent, the step size α\alpha is a scalar you choose. In Newton’s method, H1H^{-1} plays the role of the step size, but it’s a matrix. It rescales the gradient direction based on the local curvature.

What the Hessian inverse does geometrically

Think of it this way. If the function curves steeply in one direction, the Hessian has a large eigenvalue in that direction. The inverse Hessian shrinks the step in that direction, because you’re already close to the bottom of the valley. If the function curves gently in another direction, the inverse Hessian stretches the step, because you need to go further.

Gradient descent doesn’t do this. It takes the same-sized step regardless of curvature, which is why it zigzags in elongated valleys. Newton’s method accounts for the shape of the landscape and steps directly toward the minimum.

graph LR
  A["Gradient descent"] -->|"uses"| B["First-order info<br/>(gradient only)"]
  C["Newton's method"] -->|"uses"| D["First + second-order info<br/>(gradient + Hessian)"]
  B -->|"linear convergence"| E["Slow near minimum"]
  D -->|"quadratic convergence"| F["Fast near minimum"]

Example 1: Newton is exact for quadratics

Consider f(x)=(x3)2+1f(x) = (x - 3)^2 + 1. The minimum is at x=3x = 3 with f(3)=1f(3) = 1.

First, compute the derivatives:

f(x)=2(x3),f(x)=2f'(x) = 2(x - 3), \quad f''(x) = 2

Start from x0=0x_0 = 0. Apply the Newton step:

x1=x0f(x0)f(x0)=02(03)2=062=0+3=3x_1 = x_0 - \frac{f'(x_0)}{f''(x_0)} = 0 - \frac{2(0 - 3)}{2} = 0 - \frac{-6}{2} = 0 + 3 = 3

Done. One step. We hit the exact minimum because ff is quadratic, and Newton’s method minimizes a quadratic model exactly.

Compare to gradient descent on the same function

Run gradient descent with step size α=0.1\alpha = 0.1, starting from x0=0x_0 = 0:

Stepxkx_kf(xk)\nabla f(x_k)xk+1=xk0.1fx_{k+1} = x_k - 0.1 \cdot \nabla ff(xk+1)f(x_{k+1})
10.00006.0000-6.00000.60006.7600
20.60004.8000-4.80001.08004.6864
31.08003.8400-3.84001.46403.3593
41.46403.0720-3.07201.77122.5099
51.77122.4576-2.45762.01701.9664

After 5 steps of gradient descent, we’re at x2.02x \approx 2.02 with f1.97f \approx 1.97. Still not close to the minimum. Newton got there in 1 step.

The contrast is stark. For a simple quadratic, gradient descent crawls while Newton teleports.

Example 2: Newton in two dimensions

Now a 2D problem. Let f(x,y)=x2+4y2f(x, y) = x^2 + 4y^2. This is an elliptical bowl with minimum at (0,0)(0, 0).

Start from (x0,y0)=(4,2)(x_0, y_0) = (4, 2).

Step 1: Compute the gradient.

f=[2x8y]\nabla f = \begin{bmatrix} 2x \\ 8y \end{bmatrix}

At (4,2)(4, 2):

f(4,2)=[816]\nabla f(4, 2) = \begin{bmatrix} 8 \\ 16 \end{bmatrix}

Step 2: Compute the Hessian.

H=[2008]H = \begin{bmatrix} 2 & 0 \\ 0 & 8 \end{bmatrix}

The Hessian is constant here (the function is quadratic), so it’s the same everywhere.

Step 3: Compute the inverse Hessian.

H1=[1/2001/8]H^{-1} = \begin{bmatrix} 1/2 & 0 \\ 0 & 1/8 \end{bmatrix}

Step 4: Apply the Newton update.

[x1y1]=[42][1/2001/8][816]=[42][42]=[00]\begin{bmatrix} x_1 \\ y_1 \end{bmatrix} = \begin{bmatrix} 4 \\ 2 \end{bmatrix} - \begin{bmatrix} 1/2 & 0 \\ 0 & 1/8 \end{bmatrix} \begin{bmatrix} 8 \\ 16 \end{bmatrix} = \begin{bmatrix} 4 \\ 2 \end{bmatrix} - \begin{bmatrix} 4 \\ 2 \end{bmatrix} = \begin{bmatrix} 0 \\ 0 \end{bmatrix}

Again, one step to the exact minimum. Newton’s method nailed it because ff is quadratic.

Notice what the inverse Hessian did. The gradient pointed in the direction (8,16)(8, 16), which is biased toward the yy-direction because the function curves more steeply there. The inverse Hessian corrected this by shrinking the yy-component (dividing by 8) and the xx-component (dividing by 2), producing the step (4,2)(4, 2), which points directly at the origin.

Gradient descent on this same function would zigzag because the curvature differs along xx and yy. It needs many iterations to converge, as we covered in the steepest descent article.

Convergence: quadratic vs. linear

Near a minimum where the Hessian is positive definite, Newton’s method has quadratic convergence. This means the error roughly squares each iteration:

xk+1xCxkx2\|x_{k+1} - x^*\| \leq C \|x_k - x^*\|^2

for some constant CC. If you’re within 0.10.1 of the answer, the next step puts you within 0.010.01, then 0.00010.0001, then 0.000000010.00000001. The digits of accuracy roughly double each step.

Convergence speed: linear vs quadratic

graph TD
  A["Gradient descent: linear convergence"] --> B["Error: 0.1 -> 0.08 -> 0.064 -> 0.051 -> ..."]
  C["Newton's method: quadratic convergence"] --> D["Error: 0.1 -> 0.01 -> 0.0001 -> 0.00000001"]
  B --> E["Many steps to high accuracy"]
  D --> F["A few steps to machine precision"]
  style E fill:#fff3cd
  style F fill:#ccffcc

Gradient descent has linear convergence:

xk+1xrxkx\|x_{k+1} - x^*\| \leq r \|x_k - x^*\|

where r<1r < 1 is a fixed ratio. If r=0.8r = 0.8, you reduce the error by 20% each step. Getting from 0.10.1 error to 10810^{-8} error takes about 80 steps with r=0.8r = 0.8, while Newton does it in about 4.

The catch? Each Newton step is much more expensive.

Convergence comparison: Newton reaches the minimum in 1 step while gradient descent takes many iterations

The cost of a Newton step

Computing the Newton step requires:

  1. The gradient f(xk)\nabla f(x_k): costs O(n)O(n) for nn variables.
  2. The Hessian H(xk)H(x_k): costs O(n2)O(n^2) to store and compute.
  3. Solving H(xk)d=f(xk)H(x_k) \, d = -\nabla f(x_k) for the direction dd: costs O(n3)O(n^3) using direct methods.

That O(n3)O(n^3) is the bottleneck. For a problem with 1,000 variables, each Newton step involves solving a 1000×10001000 \times 1000 linear system. For 1,000,000 variables (common in deep learning), it’s completely impractical.

This is exactly why quasi-Newton methods exist. They approximate the Hessian (or its inverse) cheaply, trading some convergence speed for dramatically lower cost per step.

Example 3: when the function is not quadratic

Real functions aren’t quadratic, so Newton’s method needs multiple iterations. Let’s see what happens on a non-convex function.

Consider f(x)=x4x2f(x) = x^4 - x^2. The derivatives are:

f(x)=4x32x,f(x)=12x22f'(x) = 4x^3 - 2x, \quad f''(x) = 12x^2 - 2

This function has critical points where f(x)=0f'(x) = 0:

2x(2x21)=0    x=0 or x=±12±0.70712x(2x^2 - 1) = 0 \implies x = 0 \text{ or } x = \pm \frac{1}{\sqrt{2}} \approx \pm 0.7071

At x=0x = 0: f(0)=2<0f''(0) = -2 < 0, so this is a local maximum.

At x=±1/2x = \pm 1/\sqrt{2}: f(±1/2)=12122=4>0f''(\pm 1/\sqrt{2}) = 12 \cdot \frac{1}{2} - 2 = 4 > 0, so these are local minima.

Start from x0=2x_0 = 2 and run 4 Newton iterations:

Iteration 1: x0=2.0000x_0 = 2.0000

f(2)=4(8)2(2)=28.0000f'(2) = 4(8) - 2(2) = 28.0000 f(2)=12(4)2=46.0000f''(2) = 12(4) - 2 = 46.0000 x1=2.000028.000046.0000=2.00000.6087=1.3913x_1 = 2.0000 - \frac{28.0000}{46.0000} = 2.0000 - 0.6087 = 1.3913

Iteration 2: x1=1.3913x_1 = 1.3913

f(1.3913)=4(1.3913)32(1.3913)=10.77322.7826=7.9901f'(1.3913) = 4(1.3913)^3 - 2(1.3913) = 10.7732 - 2.7826 = 7.9901 f(1.3913)=12(1.3913)22=23.22872=21.2287f''(1.3913) = 12(1.3913)^2 - 2 = 23.2287 - 2 = 21.2287 x2=1.39137.990121.2287=1.39130.3764=1.0149x_2 = 1.3913 - \frac{7.9901}{21.2287} = 1.3913 - 0.3764 = 1.0149

Iteration 3: x2=1.0149x_2 = 1.0149

f(1.0149)=4(1.0149)32(1.0149)=4.18142.0298=2.1519f'(1.0149) = 4(1.0149)^3 - 2(1.0149) = 4.1814 - 2.0298 = 2.1519 f(1.0149)=12(1.0149)22=12.36082=10.3608f''(1.0149) = 12(1.0149)^2 - 2 = 12.3608 - 2 = 10.3608 x3=1.01492.151910.3608=1.01490.2077=0.8072x_3 = 1.0149 - \frac{2.1519}{10.3608} = 1.0149 - 0.2077 = 0.8072

Iteration 4: x3=0.8072x_3 = 0.8072

f(0.8072)=4(0.8072)32(0.8072)=2.10481.6144=0.4895f'(0.8072) = 4(0.8072)^3 - 2(0.8072) = 2.1048 - 1.6144 = 0.4895 f(0.8072)=12(0.8072)22=7.81932=5.8193f''(0.8072) = 12(0.8072)^2 - 2 = 7.8193 - 2 = 5.8193 x4=0.80720.48955.8193=0.80720.0841=0.7231x_4 = 0.8072 - \frac{0.4895}{5.8193} = 0.8072 - 0.0841 = 0.7231

After 4 iterations we’re at x0.7231x \approx 0.7231, closing in on the local minimum at x=1/20.7071x = 1/\sqrt{2} \approx 0.7071. The convergence accelerates as we get closer because the quadratic approximation becomes more accurate near the minimum.

Notice something important: this function has a local maximum at x=0x = 0. If we had started at a point where f(x)<0f''(x) < 0, the Newton step would move us toward the maximum, not the minimum. Newton’s method doesn’t distinguish between minima, maxima, and saddle points on its own.

When Newton’s method fails

Newton’s method is powerful, but it has real failure modes.

Singular or near-singular Hessian

If the Hessian is singular (determinant zero), you can’t invert it. This happens at inflection points or degenerate critical points. Even if the Hessian is technically invertible but has very small eigenvalues, the Newton step can be enormous and overshoot wildly.

Non-convex functions and saddle points

Newton’s method finds points where the gradient is zero. It does not care whether that point is a minimum, maximum, or saddle point. On non-convex functions (which include most neural network loss surfaces), saddle points are everywhere. Pure Newton’s method can converge to them.

No guaranteed descent

Unlike gradient descent with a small enough step size, Newton’s method does not guarantee that f(xk+1)<f(xk)f(x_{k+1}) < f(x_k). The step might overshoot or head toward a maximum. In practice, people fix this by adding a line search or a trust region that limits the step size.

Expensive in high dimensions

As discussed above, the O(n3)O(n^3) cost per step makes Newton’s method impractical for large-scale problems. This is the main reason you rarely see pure Newton’s method in machine learning.

graph TD
  A["Newton's method"] --> B{"Is Hessian<br/>positive definite?"}
  B -->|"Yes"| C["Step is a descent direction<br/>Converges to minimum"]
  B -->|"No - indefinite"| D["Might converge to<br/>saddle point or maximum"]
  B -->|"No - singular"| E["Step is undefined<br/>Method breaks down"]
  C --> F{"Near minimum?"}
  F -->|"Yes"| G["Quadratic convergence ✓"]
  F -->|"No"| H["May overshoot ⚠"]

Python implementation

Here’s a simple Newton’s method implementation for 1D and multidimensional problems:

import numpy as np

def newton_1d(f_prime, f_double_prime, x0, tol=1e-8, max_iter=50):
    """Newton's method for 1D optimization."""
    x = x0
    for i in range(max_iter):
        g = f_prime(x)
        h = f_double_prime(x)
        if abs(h) < 1e-12:
            print(f"Hessian near zero at iteration {i}. Stopping.")
            break
        x_new = x - g / h
        print(f"Iter {i+1}: x = {x_new:.6f}")
        if abs(x_new - x) < tol:
            break
        x = x_new
    return x

# Example: f(x) = x^4 - x^2
x_star = newton_1d(
    f_prime=lambda x: 4*x**3 - 2*x,
    f_double_prime=lambda x: 12*x**2 - 2,
    x0=2.0
)

def newton_nd(grad_f, hess_f, x0, tol=1e-8, max_iter=50):
    """Newton's method for multidimensional optimization."""
    x = np.array(x0, dtype=float)
    for i in range(max_iter):
        g = grad_f(x)
        H = hess_f(x)
        step = np.linalg.solve(H, g)
        x_new = x - step
        print(f"Iter {i+1}: x = {x_new}, ||grad|| = {np.linalg.norm(g):.6f}")
        if np.linalg.norm(x_new - x) < tol:
            break
        x = x_new
    return x

# Example: f(x,y) = x^2 + 4y^2
x_star = newton_nd(
    grad_f=lambda x: np.array([2*x[0], 8*x[1]]),
    hess_f=lambda x: np.array([[2, 0], [0, 8]]),
    x0=[4.0, 2.0]
)

Notice we use np.linalg.solve(H, g) instead of computing H1H^{-1} directly. Solving the linear system is numerically more stable and often faster.

Summary

PropertyGradient descentNewton’s method
Info usedGradient (first-order)Gradient + Hessian (second-order)
Step sizeFixed α\alpha (you choose)Determined by H1H^{-1} (automatic)
ConvergenceLinearQuadratic (near minimum)
Cost per stepO(n)O(n)O(n3)O(n^3)
Works on non-convex?Descends, but slowlyCan converge to saddle points
Exact for quadratics?NoYes, in 1 step

Newton’s method is the gold standard for convergence speed on smooth, well-behaved functions. It uses the Hessian to correct the gradient direction, accounting for curvature in every direction simultaneously. For quadratic functions, it’s exact in one step. Near the minimum of any smooth function, it converges quadratically.

The cost is high: O(n3)O(n^3) per step and the need for the full Hessian matrix. For problems with hundreds or thousands of variables, this is manageable. For millions of variables, it’s not.

What comes next

The high cost of Newton’s method leads to a natural question: can we get most of the benefit of second-order information without paying the full price? The answer is yes, and that’s exactly what quasi-Newton methods do. Methods like BFGS and L-BFGS build up an approximation of the inverse Hessian using only gradient information, achieving superlinear convergence at a fraction of the cost. That’s what we cover next.

Start typing to search across all content
navigate Enter open Esc close