Mar 12, 2026 · 18 min read · ML Optimization

Newton's method for optimization

In this series (18 parts)

Prerequisites: This article assumes you’re comfortable with gradient descent and the Hessian matrix. You should also know what eigenvalues are, though you won’t need to compute them here.

The key difference: slope vs curvature

Gradient descent sees the slope and walks downhill. It knows which way is down but not how far to go. Newton’s method sees both the slope and the curvature. It builds a local model of the landscape, a parabola that matches the function’s shape at the current point, and jumps directly to that parabola’s minimum.

On a simple quadratic, one jump is enough. On more complex functions, each jump lands much closer to the answer than a gradient step would.

Gradient descent vs Newton’s method on the same function

	Gradient descent	Newton’s method
Information used	Slope only (gradient)	Slope and curvature (gradient + Hessian)
$f(x) = (x-3)^2$ , start at $x=0$ , one step	$x_1 = 0.6$ (with $\alpha = 0.1$ )	$x_1 = 3.0$ (exact minimum)
Same function, 5 steps	$x_5 \approx 2.0$ , still approaching	$x_1 = 3.0$ , done in 1 step
Step size choice	You pick $\alpha$ (tricky)	Determined automatically by Hessian
Convergence near minimum	Linear (fixed error ratio)	Quadratic (digits of accuracy double each step)

The quadratic approximation concept: at any point, fit a parabola to the function. The parabola matches the function’s value, slope, and curvature. Then jump to the parabola’s minimum.

Quadratic approximation: fit a parabola, jump to its minimum

graph LR
  A["Current point xk"] --> B["Build quadratic model<br/>matching value, gradient, Hessian"]
  B --> C["Find minimum of quadratic"]
  C --> D["Jump to that minimum: xk+1"]
  D -->|"Repeat"| A
  style C fill:#ccffcc

If the function is actually quadratic, the model is exact and one jump finds the minimum. If not, the model is approximate, but each jump still makes fast progress.

Contour plot comparing gradient descent (red, zigzag, 12 steps) vs Newton’s method (green, direct, 1 step) on f(x,y) = x^2 + 10y^2

Now let’s formalize this idea.

The core idea: use curvature, not just slope

Gradient descent looks at the gradient and walks downhill. That’s first-order information. It knows the slope but nothing about how the slope changes. Newton’s method goes further. It uses the Hessian, which captures curvature, to build a local quadratic model of the function. Then it jumps straight to the minimum of that model.

If the function actually is quadratic, Newton’s method finds the exact minimum in one step. If the function is approximately quadratic near the minimum (which most smooth functions are), Newton’s method converges extremely fast.

Taylor expansion to second order

The idea starts with the second-order Taylor expansion. For a function $f$ around a point $x_k$ :

f(x) \approx f(x_k) + \nabla f(x_k)^T (x - x_k) + \frac{1}{2} (x - x_k)^T H(x_k) (x - x_k)

where $H(x_k)$ is the Hessian matrix evaluated at $x_k$ . Call this approximation $q(x)$ . It’s a quadratic function that matches $f$ in value, gradient, and curvature at $x_k$ .

To minimize $q(x)$ , take its gradient and set it to zero:

\nabla q(x) = \nabla f(x_k) + H(x_k)(x - x_k) = 0

Solve for $x$ :

x = x_k - H(x_k)^{-1} \nabla f(x_k)

That’s the Newton step.

Newton’s method step by step

flowchart TD
  A["Start at current point xk"] --> B["Compute gradient nabla f at xk"]
  B --> C["Compute Hessian H at xk"]
  C --> D["Solve H * step = negative gradient"]
  D --> E["Update: xk+1 = xk + step"]
  E --> F{"Converged?"}
  F -->|"No"| A
  F -->|"Yes"| G["Done: xk+1 is the minimum"]

The Newton update rule

The update is:

x_{k+1} = x_k - H(x_k)^{-1} \nabla f(x_k)

Compare this to gradient descent:

x_{k+1} = x_k - \alpha \nabla f(x_k)

In gradient descent, the step size $\alpha$ is a scalar you choose. In Newton’s method, $H^{-1}$ plays the role of the step size, but it’s a matrix. It rescales the gradient direction based on the local curvature.

What the Hessian inverse does geometrically

Think of it this way. If the function curves steeply in one direction, the Hessian has a large eigenvalue in that direction. The inverse Hessian shrinks the step in that direction, because you’re already close to the bottom of the valley. If the function curves gently in another direction, the inverse Hessian stretches the step, because you need to go further.

Gradient descent doesn’t do this. It takes the same-sized step regardless of curvature, which is why it zigzags in elongated valleys. Newton’s method accounts for the shape of the landscape and steps directly toward the minimum.

graph LR
  A["Gradient descent"] -->|"uses"| B["First-order info<br/>(gradient only)"]
  C["Newton's method"] -->|"uses"| D["First + second-order info<br/>(gradient + Hessian)"]
  B -->|"linear convergence"| E["Slow near minimum"]
  D -->|"quadratic convergence"| F["Fast near minimum"]

Example 1: Newton is exact for quadratics

Consider $f(x) = (x - 3)^2 + 1$ . The minimum is at $x = 3$ with $f(3) = 1$ .

First, compute the derivatives:

f'(x) = 2(x - 3), \quad f''(x) = 2

Start from $x_0 = 0$ . Apply the Newton step:

x_1 = x_0 - \frac{f'(x_0)}{f''(x_0)} = 0 - \frac{2(0 - 3)}{2} = 0 - \frac{-6}{2} = 0 + 3 = 3

Done. One step. We hit the exact minimum because $f$ is quadratic, and Newton’s method minimizes a quadratic model exactly.

Compare to gradient descent on the same function

Run gradient descent with step size $\alpha = 0.1$ , starting from $x_0 = 0$ :

Step	$x_k$	$\nabla f(x_k)$	$x_{k+1} = x_k - 0.1 \cdot \nabla f$	$f(x_{k+1})$
1	0.0000	$-6.0000$	0.6000	6.7600
2	0.6000	$-4.8000$	1.0800	4.6864
3	1.0800	$-3.8400$	1.4640	3.3593
4	1.4640	$-3.0720$	1.7712	2.5099
5	1.7712	$-2.4576$	2.0170	1.9664

After 5 steps of gradient descent, we’re at $x \approx 2.02$ with $f \approx 1.97$ . Still not close to the minimum. Newton got there in 1 step.

The contrast is stark. For a simple quadratic, gradient descent crawls while Newton teleports.

Example 2: Newton in two dimensions

Now a 2D problem. Let $f(x, y) = x^2 + 4y^2$ . This is an elliptical bowl with minimum at $(0, 0)$ .

Start from $(x_0, y_0) = (4, 2)$ .

Step 1: Compute the gradient.

\nabla f = \begin{bmatrix} 2x \\ 8y \end{bmatrix}

At $(4, 2)$ :

\nabla f(4, 2) = \begin{bmatrix} 8 \\ 16 \end{bmatrix}

Step 2: Compute the Hessian.

H = \begin{bmatrix} 2 & 0 \\ 0 & 8 \end{bmatrix}

The Hessian is constant here (the function is quadratic), so it’s the same everywhere.

Step 3: Compute the inverse Hessian.

H^{-1} = \begin{bmatrix} 1/2 & 0 \\ 0 & 1/8 \end{bmatrix}

Step 4: Apply the Newton update.

\begin{bmatrix} x_1 \\ y_1 \end{bmatrix} = \begin{bmatrix} 4 \\ 2 \end{bmatrix} - \begin{bmatrix} 1/2 & 0 \\ 0 & 1/8 \end{bmatrix} \begin{bmatrix} 8 \\ 16 \end{bmatrix} = \begin{bmatrix} 4 \\ 2 \end{bmatrix} - \begin{bmatrix} 4 \\ 2 \end{bmatrix} = \begin{bmatrix} 0 \\ 0 \end{bmatrix}

Again, one step to the exact minimum. Newton’s method nailed it because $f$ is quadratic.

Notice what the inverse Hessian did. The gradient pointed in the direction $(8, 16)$ , which is biased toward the $y$ -direction because the function curves more steeply there. The inverse Hessian corrected this by shrinking the $y$ -component (dividing by 8) and the $x$ -component (dividing by 2), producing the step $(4, 2)$ , which points directly at the origin.

Gradient descent on this same function would zigzag because the curvature differs along $x$ and $y$ . It needs many iterations to converge, as we covered in the steepest descent article.

Convergence: quadratic vs. linear

Near a minimum where the Hessian is positive definite, Newton’s method has quadratic convergence. This means the error roughly squares each iteration:

\|x_{k+1} - x^*\| \leq C \|x_k - x^*\|^2

for some constant $C$ . If you’re within $0.1$ of the answer, the next step puts you within $0.01$ , then $0.0001$ , then $0.00000001$ . The digits of accuracy roughly double each step.

Convergence speed: linear vs quadratic

graph TD
  A["Gradient descent: linear convergence"] --> B["Error: 0.1 -> 0.08 -> 0.064 -> 0.051 -> ..."]
  C["Newton's method: quadratic convergence"] --> D["Error: 0.1 -> 0.01 -> 0.0001 -> 0.00000001"]
  B --> E["Many steps to high accuracy"]
  D --> F["A few steps to machine precision"]
  style E fill:#fff3cd
  style F fill:#ccffcc

Gradient descent has linear convergence:

\|x_{k+1} - x^*\| \leq r \|x_k - x^*\|

where $r < 1$ is a fixed ratio. If $r = 0.8$ , you reduce the error by 20% each step. Getting from $0.1$ error to $10^{-8}$ error takes about 80 steps with $r = 0.8$ , while Newton does it in about 4.

The catch? Each Newton step is much more expensive.

Convergence comparison: Newton reaches the minimum in 1 step while gradient descent takes many iterations

The cost of a Newton step

Computing the Newton step requires:

The gradient $\nabla f(x_k)$ : costs $O(n)$ for $n$ variables.
The Hessian $H(x_k)$ : costs $O(n^2)$ to store and compute.
Solving $H(x_k) \, d = -\nabla f(x_k)$ for the direction $d$ : costs $O(n^3)$ using direct methods.

That $O(n^3)$ is the bottleneck. For a problem with 1,000 variables, each Newton step involves solving a $1000 \times 1000$ linear system. For 1,000,000 variables (common in deep learning), it’s completely impractical.

This is exactly why quasi-Newton methods exist. They approximate the Hessian (or its inverse) cheaply, trading some convergence speed for dramatically lower cost per step.

Example 3: when the function is not quadratic

Real functions aren’t quadratic, so Newton’s method needs multiple iterations. Let’s see what happens on a non-convex function.

Consider $f(x) = x^4 - x^2$ . The derivatives are:

f'(x) = 4x^3 - 2x, \quad f''(x) = 12x^2 - 2

This function has critical points where $f'(x) = 0$ :

2x(2x^2 - 1) = 0 \implies x = 0 \text{ or } x = \pm \frac{1}{\sqrt{2}} \approx \pm 0.7071

At $x = 0$ : $f''(0) = -2 < 0$ , so this is a local maximum.

At $x = \pm 1/\sqrt{2}$ : $f''(\pm 1/\sqrt{2}) = 12 \cdot \frac{1}{2} - 2 = 4 > 0$ , so these are local minima.

Start from $x_0 = 2$ and run 4 Newton iterations:

Iteration 1: $x_0 = 2.0000$

f'(2) = 4(8) - 2(2) = 28.0000

f''(2) = 12(4) - 2 = 46.0000

x_1 = 2.0000 - \frac{28.0000}{46.0000} = 2.0000 - 0.6087 = 1.3913

Iteration 2: $x_1 = 1.3913$

f'(1.3913) = 4(1.3913)^3 - 2(1.3913) = 10.7732 - 2.7826 = 7.9901

f''(1.3913) = 12(1.3913)^2 - 2 = 23.2287 - 2 = 21.2287

x_2 = 1.3913 - \frac{7.9901}{21.2287} = 1.3913 - 0.3764 = 1.0149

Iteration 3: $x_2 = 1.0149$

f'(1.0149) = 4(1.0149)^3 - 2(1.0149) = 4.1814 - 2.0298 = 2.1519

f''(1.0149) = 12(1.0149)^2 - 2 = 12.3608 - 2 = 10.3608

x_3 = 1.0149 - \frac{2.1519}{10.3608} = 1.0149 - 0.2077 = 0.8072

Iteration 4: $x_3 = 0.8072$

f'(0.8072) = 4(0.8072)^3 - 2(0.8072) = 2.1048 - 1.6144 = 0.4895

f''(0.8072) = 12(0.8072)^2 - 2 = 7.8193 - 2 = 5.8193

x_4 = 0.8072 - \frac{0.4895}{5.8193} = 0.8072 - 0.0841 = 0.7231

After 4 iterations we’re at $x \approx 0.7231$ , closing in on the local minimum at $x = 1/\sqrt{2} \approx 0.7071$ . The convergence accelerates as we get closer because the quadratic approximation becomes more accurate near the minimum.

Notice something important: this function has a local maximum at $x = 0$ . If we had started at a point where $f''(x) < 0$ , the Newton step would move us toward the maximum, not the minimum. Newton’s method doesn’t distinguish between minima, maxima, and saddle points on its own.

When Newton’s method fails

Newton’s method is powerful, but it has real failure modes.

Singular or near-singular Hessian

If the Hessian is singular (determinant zero), you can’t invert it. This happens at inflection points or degenerate critical points. Even if the Hessian is technically invertible but has very small eigenvalues, the Newton step can be enormous and overshoot wildly.

Non-convex functions and saddle points

Newton’s method finds points where the gradient is zero. It does not care whether that point is a minimum, maximum, or saddle point. On non-convex functions (which include most neural network loss surfaces), saddle points are everywhere. Pure Newton’s method can converge to them.

No guaranteed descent

Unlike gradient descent with a small enough step size, Newton’s method does not guarantee that $f(x_{k+1}) < f(x_k)$ . The step might overshoot or head toward a maximum. In practice, people fix this by adding a line search or a trust region that limits the step size.

Expensive in high dimensions

As discussed above, the $O(n^3)$ cost per step makes Newton’s method impractical for large-scale problems. This is the main reason you rarely see pure Newton’s method in machine learning.

graph TD
  A["Newton's method"] --> B{"Is Hessian<br/>positive definite?"}
  B -->|"Yes"| C["Step is a descent direction<br/>Converges to minimum"]
  B -->|"No - indefinite"| D["Might converge to<br/>saddle point or maximum"]
  B -->|"No - singular"| E["Step is undefined<br/>Method breaks down"]
  C --> F{"Near minimum?"}
  F -->|"Yes"| G["Quadratic convergence ✓"]
  F -->|"No"| H["May overshoot ⚠"]

Python implementation

Here’s a simple Newton’s method implementation for 1D and multidimensional problems:

import numpy as np

def newton_1d(f_prime, f_double_prime, x0, tol=1e-8, max_iter=50):
    """Newton's method for 1D optimization."""
    x = x0
    for i in range(max_iter):
        g = f_prime(x)
        h = f_double_prime(x)
        if abs(h) < 1e-12:
            print(f"Hessian near zero at iteration {i}. Stopping.")
            break
        x_new = x - g / h
        print(f"Iter {i+1}: x = {x_new:.6f}")
        if abs(x_new - x) < tol:
            break
        x = x_new
    return x

# Example: f(x) = x^4 - x^2
x_star = newton_1d(
    f_prime=lambda x: 4*x**3 - 2*x,
    f_double_prime=lambda x: 12*x**2 - 2,
    x0=2.0
)

def newton_nd(grad_f, hess_f, x0, tol=1e-8, max_iter=50):
    """Newton's method for multidimensional optimization."""
    x = np.array(x0, dtype=float)
    for i in range(max_iter):
        g = grad_f(x)
        H = hess_f(x)
        step = np.linalg.solve(H, g)
        x_new = x - step
        print(f"Iter {i+1}: x = {x_new}, ||grad|| = {np.linalg.norm(g):.6f}")
        if np.linalg.norm(x_new - x) < tol:
            break
        x = x_new
    return x

# Example: f(x,y) = x^2 + 4y^2
x_star = newton_nd(
    grad_f=lambda x: np.array([2*x[0], 8*x[1]]),
    hess_f=lambda x: np.array([[2, 0], [0, 8]]),
    x0=[4.0, 2.0]
)

Notice we use np.linalg.solve(H, g) instead of computing $H^{-1}$ directly. Solving the linear system is numerically more stable and often faster.

Summary

Property	Gradient descent	Newton’s method
Info used	Gradient (first-order)	Gradient + Hessian (second-order)
Step size	Fixed $\alpha$ (you choose)	Determined by $H^{-1}$ (automatic)
Convergence	Linear	Quadratic (near minimum)
Cost per step	$O(n)$	$O(n^3)$
Works on non-convex?	Descends, but slowly	Can converge to saddle points
Exact for quadratics?	No	Yes, in 1 step

Newton’s method is the gold standard for convergence speed on smooth, well-behaved functions. It uses the Hessian to correct the gradient direction, accounting for curvature in every direction simultaneously. For quadratic functions, it’s exact in one step. Near the minimum of any smooth function, it converges quadratically.

The cost is high: $O(n^3)$ per step and the need for the full Hessian matrix. For problems with hundreds or thousands of variables, this is manageable. For millions of variables, it’s not.

What comes next

The high cost of Newton’s method leads to a natural question: can we get most of the benefit of second-order information without paying the full price? The answer is yes, and that’s exactly what quasi-Newton methods do. Methods like BFGS and L-BFGS build up an approximation of the inverse Hessian using only gradient information, achieving superlinear convergence at a fraction of the cost. That’s what we cover next.

← Back to all series