Feb 28, 2026 · 18 min read · Maths for ML

Taylor series and local approximations

In this series (15 parts)

Every optimization algorithm in machine learning relies on one idea: if you zoom in close enough to any smooth function, it looks like a polynomial. Taylor series make that idea precise. They give you a recipe for building polynomial approximations of any order, and the higher the order, the better the fit near your chosen point.

Prerequisites

Before reading this, you should be comfortable with derivatives and the chain rule and Jacobians and Hessians.

The core idea

Pick a function $f(x)$ and a point $a$ . You want a polynomial that matches the function’s value, slope, curvature, and higher-order behavior at $a$ . Taylor’s theorem says:

f(x) = f(a) + f'(a)(x - a) + \frac{f''(a)}{2!}(x - a)^2 + \frac{f'''(a)}{3!}(x - a)^3 + \cdots

Each term captures one more “layer” of the function’s shape at $a$ . The zeroth-order term $f(a)$ is just the function value. The first-order term $f'(a)(x - a)$ adds the slope. The second-order term adds curvature. And so on.

Successive Taylor approximations get closer to the true function:

graph LR
  T0["0th order: constant"] --> T1["1st order: tangent line"]
  T1 --> T2["2nd order: parabola"]
  T2 --> T3["3rd order: cubic"]
  T3 --> TN["Higher orders..."]
  TN --> TF["True function"]

When $a = 0$ , this is called a Maclaurin series, but the idea is identical.

First-order approximation (linear)

Drop everything after the first derivative term:

f(x) \approx f(a) + f'(a)(x - a)

This is a tangent line. It is the best linear approximation to $f$ near $a$ . In optimization, this is what gradient descent uses: it treats the loss surface as locally flat and steps downhill along the gradient.

Second-order approximation (quadratic)

Keep one more term:

f(x) \approx f(a) + f'(a)(x - a) + \frac{f''(a)}{2}(x - a)^2

Now you have a parabola that matches the function’s value, slope, and curvature at $a$ . This is significantly more accurate for points near $a$ . In multiple dimensions, the second derivative becomes the Hessian, and the quadratic approximation becomes:

f(\mathbf{x}) \approx f(\mathbf{a}) + \nabla f(\mathbf{a})^\top (\mathbf{x} - \mathbf{a}) + \frac{1}{2} (\mathbf{x} - \mathbf{a})^\top \mathbf{H}(\mathbf{a}) (\mathbf{x} - \mathbf{a})

where $\mathbf{H}$ is the Hessian matrix. This quadratic form is the backbone of Newton’s method.

Why Newton’s method uses the quadratic approximation

Newton’s method finds the minimum of the quadratic approximation, then moves there. If you take the derivative of the quadratic approximation and set it to zero:

f'(a) + f''(a)(x - a) = 0

x = a - \frac{f'(a)}{f''(a)}

That is Newton’s update rule. In multiple dimensions:

\mathbf{x}_{\text{new}} = \mathbf{x}_{\text{old}} - \mathbf{H}^{-1} \nabla f(\mathbf{x}_{\text{old}})

The key insight: gradient descent uses the first-order approximation (linear), while Newton’s method uses the second-order approximation (quadratic). Newton’s method converges much faster near a minimum because the quadratic captures curvature information that the linear approximation misses.

When to use 1st-order vs 2nd-order Taylor approximations:

graph TD
  A["Choose optimization method"] --> B{"Hessian affordable?"}
  B -->|No| C["1st-order Taylor"]
  B -->|Yes| D["2nd-order Taylor"]
  C --> E["Gradient descent"]
  E --> F1["Cheap per step, slow convergence"]
  D --> G["Newton's method"]
  G --> F2["Expensive per step, fast convergence"]

Example 1: Taylor expansion of $e^x$ around $x = 0$

The function $f(x) = e^x$ has a nice property: all its derivatives are $e^x$ , and $e^0 = 1$ . So every derivative evaluated at $a = 0$ equals 1.

f(0) = 1, \quad f'(0) = 1, \quad f''(0) = 1, \quad f'''(0) = 1

Plugging into the Taylor formula:

e^x \approx 1 + x + \frac{x^2}{2} + \frac{x^3}{6} + \cdots

Let’s check how well this works for $x = 0.5$ . The true value is $e^{0.5} \approx 1.6487$ .

First-order (linear):

1 + 0.5 = 1.5

Error: $|1.6487 - 1.5| = 0.1487$

Second-order (quadratic):

1 + 0.5 + \frac{(0.5)^2}{2} = 1 + 0.5 + 0.125 = 1.625

Error: $|1.6487 - 1.625| = 0.0237$

Third-order (cubic):

1 + 0.5 + 0.125 + \frac{(0.5)^3}{6} = 1.625 + 0.0208 = 1.6458

Error: $|1.6487 - 1.6458| = 0.0029$

Each additional term cuts the error by roughly a factor of 6. The approximation gets dramatically better with each order.

Taylor approximations of e^x around x = 0:

Example 2: Taylor expansion of $\sin(x)$ around $x = 0$

The derivatives of $\sin(x)$ cycle: $\sin, \cos, -\sin, -\cos, \sin, \ldots$

Evaluated at $x = 0$ : $0, 1, 0, -1, 0, 1, \ldots$

Only the odd-power terms survive:

\sin(x) \approx x - \frac{x^3}{6} + \frac{x^5}{120} - \cdots

Let’s approximate $\sin(0.3)$ . The true value is $\sin(0.3) \approx 0.29552$ .

First-order:

\sin(0.3) \approx 0.3

Error: $|0.29552 - 0.3| = 0.00448$

Third-order:

0.3 - \frac{(0.3)^3}{6} = 0.3 - \frac{0.027}{6} = 0.3 - 0.0045 = 0.2955

Error: $|0.29552 - 0.2955| = 0.00002$

That is remarkably accurate. For small $x$ , the approximation $\sin(x) \approx x$ works well, and adding the cubic term makes it nearly exact.

Example 3: Comparing first-order and second-order error

Consider $f(x) = \ln(1 + x)$ near $a = 0$ . We have:

f(0) = 0, \quad f'(0) = 1, \quad f''(0) = -1

First-order approximation:

\ln(1 + x) \approx x

Second-order approximation:

\ln(1 + x) \approx x - \frac{x^2}{2}

Let’s test at $x = 0.4$ . The true value is $\ln(1.4) \approx 0.3365$ .

First-order:

0.4

Error: $|0.3365 - 0.4| = 0.0635$

Second-order:

0.4 - \frac{(0.4)^2}{2} = 0.4 - 0.08 = 0.32

Error: $|0.3365 - 0.32| = 0.0165$

The second-order approximation cuts the error by roughly a factor of 4. Now test at $x = 0.1$ . True value: $\ln(1.1) \approx 0.09531$ .

First-order:

0.1

Error: $|0.09531 - 0.1| = 0.00469$

Second-order:

0.1 - \frac{(0.1)^2}{2} = 0.1 - 0.005 = 0.095

Error: $|0.09531 - 0.095| = 0.00031$

Closer to the expansion point, the improvement is even more dramatic. This is a general pattern: the error of an $n$ -th order Taylor approximation scales as $O((x - a)^{n+1})$ .

Example 4: One step of Newton’s method

Let’s find the minimum of $f(x) = x^4 - 4x^2 + 2$ , starting from $x_0 = 3$ .

First, compute the derivatives:

f'(x) = 4x^3 - 8x, \quad f''(x) = 12x^2 - 8

At $x_0 = 3$ :

f'(3) = 4(27) - 8(3) = 108 - 24 = 84

f''(3) = 12(9) - 8 = 108 - 8 = 100

Newton’s update:

x_1 = 3 - \frac{84}{100} = 3 - 0.84 = 2.16

Second step, at $x_1 = 2.16$ :

f'(2.16) = 4(2.16)^3 - 8(2.16) = 4(10.078) - 17.28 = 40.31 - 17.28 = 23.03

f''(2.16) = 12(2.16)^2 - 8 = 12(4.666) - 8 = 55.99 - 8 = 47.99

x_2 = 2.16 - \frac{23.03}{47.99} = 2.16 - 0.48 = 1.68

The true minimum is near $x \approx 1.414$ (which is $\sqrt{2}$ ). Newton’s method is converging toward it. Compare this to gradient descent, which would use only $f'(x)$ and a fixed step size, converging more slowly. Newton’s method uses curvature to take smarter steps.

The remainder term

The error of a Taylor approximation has a precise bound. For an $n$ -th order approximation around $a$ :

R_n(x) = \frac{f^{(n+1)}(c)}{(n+1)!}(x - a)^{n+1}

for some $c$ between $a$ and $x$ . Two things control the error:

Distance from the expansion point: $(x - a)^{n+1}$ shrinks fast when $x$ is close to $a$ .
Order of approximation: $(n+1)!$ in the denominator grows extremely fast, so higher-order terms contribute less.

This is why Taylor approximations are called “local” approximations. They work great near $a$ but can break down far away.

Approximation quality depends on distance from the expansion point:

graph TD
  A["Expansion point a"] --> B["Near a: error is small"]
  A --> C["Far from a: error grows fast"]
  B --> D["Higher-order terms shrink quickly"]
  C --> E["Higher-order terms blow up"]
  E --> F["Solution: re-expand around a closer point"]

Multivariate Taylor expansion

For a function of two variables $f(x, y)$ expanded around $(a, b)$ :

First order:

f(x, y) \approx f(a, b) + f_x(a, b)(x - a) + f_y(a, b)(y - b)

Second order:

f(x, y) \approx f(a, b) + f_x(x - a) + f_y(y - b) + \frac{1}{2}\left[f_{xx}(x - a)^2 + 2f_{xy}(x - a)(y - b) + f_{yy}(y - b)^2\right]

where all partial derivatives are evaluated at $(a, b)$ . The second-order terms form a quadratic form using the Hessian matrix:

\mathbf{H} = \begin{bmatrix} f_{xx} & f_{xy} \\ f_{xy} & f_{yy} \end{bmatrix}

The eigenvalues of this Hessian tell you the curvature in each principal direction. If all eigenvalues are positive, the point is a local minimum. If all are negative, it is a local maximum. Mixed signs mean a saddle point.

Connection to convexity

A function is convex if and only if it always lies above its first-order Taylor approximation:

f(x) \geq f(a) + f'(a)(x - a) \quad \text{for all } x, a

Equivalently, the second-order term is always non-negative, which means $f''(x) \geq 0$ everywhere. For multivariable functions, this requires the Hessian to be positive semidefinite everywhere. Convex functions are “easy” to optimize because every local minimum is a global minimum.

Python: visualizing Taylor approximations

import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(-2, 2, 300)
f = np.exp(x)

# Taylor approximations of e^x around x=0
order1 = 1 + x
order2 = 1 + x + x**2 / 2
order3 = 1 + x + x**2 / 2 + x**3 / 6

plt.plot(x, f, label="e^x (true)", linewidth=2)
plt.plot(x, order1, "--", label="1st order")
plt.plot(x, order2, "--", label="2nd order")
plt.plot(x, order3, "--", label="3rd order")
plt.legend()
plt.ylim(-1, 7)
plt.title("Taylor approximations of e^x")
plt.xlabel("x")
plt.ylabel("f(x)")
plt.grid(True)
plt.show()

Summary

Concept	What it gives you	Used in
1st-order Taylor	Tangent line (linear approx)	Gradient descent
2nd-order Taylor	Quadratic approximation	Newton’s method
Remainder $R_n$	Error bound	Convergence analysis
Hessian	Curvature matrix	Second-order optimization

The key takeaway: Taylor series turn hard nonlinear problems into tractable polynomial problems, at the cost of accuracy far from the expansion point. First-order methods are cheap but slow; second-order methods are more expensive per step but converge faster. This trade-off shows up everywhere in ML optimization.

What comes next

The next article covers probability fundamentals, where we shift from calculus to the language of uncertainty, the other essential foundation for machine learning.

← Back to all series

Taylor series and local approximations

Prerequisites

The core idea

First-order approximation (linear)

Second-order approximation (quadratic)

Why Newton’s method uses the quadratic approximation

Example 1: Taylor expansion of exe^xex around x=0x = 0x=0

Example 2: Taylor expansion of sin⁡(x)\sin(x)sin(x) around x=0x = 0x=0

Example 3: Comparing first-order and second-order error

Example 4: One step of Newton’s method

The remainder term

Multivariate Taylor expansion

Connection to convexity

Python: visualizing Taylor approximations

Summary

What comes next

Example 1: Taylor expansion of $e^x$ around $x = 0$

Example 2: Taylor expansion of $\sin(x)$ around $x = 0$