Search…
Maths for ML · Part 11

Taylor series and local approximations

In this series (15 parts)
  1. Why Maths Matters for ML: A Practical Overview
  2. Scalars, Vectors, and Vector Spaces
  3. Matrices and Matrix Operations
  4. Matrix Inverses and Systems of Linear Equations
  5. Eigenvalues and Eigenvectors
  6. Matrix Decompositions: LU, QR, SVD
  7. Norms, Distances, and Similarity
  8. Calculus Review: Derivatives and the Chain Rule
  9. Partial Derivatives and Gradients
  10. The Jacobian and Hessian Matrices
  11. Taylor series and local approximations
  12. Probability fundamentals
  13. Random variables and distributions
  14. Bayes theorem and its role in ML
  15. Information theory: entropy, KL divergence, cross-entropy

Every optimization algorithm in machine learning relies on one idea: if you zoom in close enough to any smooth function, it looks like a polynomial. Taylor series make that idea precise. They give you a recipe for building polynomial approximations of any order, and the higher the order, the better the fit near your chosen point.

Prerequisites

Before reading this, you should be comfortable with derivatives and the chain rule and Jacobians and Hessians.

The core idea

Pick a function f(x)f(x) and a point aa. You want a polynomial that matches the function’s value, slope, curvature, and higher-order behavior at aa. Taylor’s theorem says:

f(x)=f(a)+f(a)(xa)+f(a)2!(xa)2+f(a)3!(xa)3+f(x) = f(a) + f'(a)(x - a) + \frac{f''(a)}{2!}(x - a)^2 + \frac{f'''(a)}{3!}(x - a)^3 + \cdots

Each term captures one more “layer” of the function’s shape at aa. The zeroth-order term f(a)f(a) is just the function value. The first-order term f(a)(xa)f'(a)(x - a) adds the slope. The second-order term adds curvature. And so on.

Successive Taylor approximations get closer to the true function:

graph LR
  T0["0th order: constant"] --> T1["1st order: tangent line"]
  T1 --> T2["2nd order: parabola"]
  T2 --> T3["3rd order: cubic"]
  T3 --> TN["Higher orders..."]
  TN --> TF["True function"]

When a=0a = 0, this is called a Maclaurin series, but the idea is identical.

First-order approximation (linear)

Drop everything after the first derivative term:

f(x)f(a)+f(a)(xa)f(x) \approx f(a) + f'(a)(x - a)

This is a tangent line. It is the best linear approximation to ff near aa. In optimization, this is what gradient descent uses: it treats the loss surface as locally flat and steps downhill along the gradient.

Second-order approximation (quadratic)

Keep one more term:

f(x)f(a)+f(a)(xa)+f(a)2(xa)2f(x) \approx f(a) + f'(a)(x - a) + \frac{f''(a)}{2}(x - a)^2

Now you have a parabola that matches the function’s value, slope, and curvature at aa. This is significantly more accurate for points near aa. In multiple dimensions, the second derivative becomes the Hessian, and the quadratic approximation becomes:

f(x)f(a)+f(a)(xa)+12(xa)H(a)(xa)f(\mathbf{x}) \approx f(\mathbf{a}) + \nabla f(\mathbf{a})^\top (\mathbf{x} - \mathbf{a}) + \frac{1}{2} (\mathbf{x} - \mathbf{a})^\top \mathbf{H}(\mathbf{a}) (\mathbf{x} - \mathbf{a})

where H\mathbf{H} is the Hessian matrix. This quadratic form is the backbone of Newton’s method.

Why Newton’s method uses the quadratic approximation

Newton’s method finds the minimum of the quadratic approximation, then moves there. If you take the derivative of the quadratic approximation and set it to zero:

f(a)+f(a)(xa)=0f'(a) + f''(a)(x - a) = 0 x=af(a)f(a)x = a - \frac{f'(a)}{f''(a)}

That is Newton’s update rule. In multiple dimensions:

xnew=xoldH1f(xold)\mathbf{x}_{\text{new}} = \mathbf{x}_{\text{old}} - \mathbf{H}^{-1} \nabla f(\mathbf{x}_{\text{old}})

The key insight: gradient descent uses the first-order approximation (linear), while Newton’s method uses the second-order approximation (quadratic). Newton’s method converges much faster near a minimum because the quadratic captures curvature information that the linear approximation misses.

When to use 1st-order vs 2nd-order Taylor approximations:

graph TD
  A["Choose optimization method"] --> B{"Hessian affordable?"}
  B -->|No| C["1st-order Taylor"]
  B -->|Yes| D["2nd-order Taylor"]
  C --> E["Gradient descent"]
  E --> F1["Cheap per step, slow convergence"]
  D --> G["Newton's method"]
  G --> F2["Expensive per step, fast convergence"]

Example 1: Taylor expansion of exe^x around x=0x = 0

The function f(x)=exf(x) = e^x has a nice property: all its derivatives are exe^x, and e0=1e^0 = 1. So every derivative evaluated at a=0a = 0 equals 1.

f(0)=1,f(0)=1,f(0)=1,f(0)=1f(0) = 1, \quad f'(0) = 1, \quad f''(0) = 1, \quad f'''(0) = 1

Plugging into the Taylor formula:

ex1+x+x22+x36+e^x \approx 1 + x + \frac{x^2}{2} + \frac{x^3}{6} + \cdots

Let’s check how well this works for x=0.5x = 0.5. The true value is e0.51.6487e^{0.5} \approx 1.6487.

First-order (linear):

1+0.5=1.51 + 0.5 = 1.5

Error: 1.64871.5=0.1487|1.6487 - 1.5| = 0.1487

Second-order (quadratic):

1+0.5+(0.5)22=1+0.5+0.125=1.6251 + 0.5 + \frac{(0.5)^2}{2} = 1 + 0.5 + 0.125 = 1.625

Error: 1.64871.625=0.0237|1.6487 - 1.625| = 0.0237

Third-order (cubic):

1+0.5+0.125+(0.5)36=1.625+0.0208=1.64581 + 0.5 + 0.125 + \frac{(0.5)^3}{6} = 1.625 + 0.0208 = 1.6458

Error: 1.64871.6458=0.0029|1.6487 - 1.6458| = 0.0029

Each additional term cuts the error by roughly a factor of 6. The approximation gets dramatically better with each order.

Taylor approximations of e^x around x = 0:

Example 2: Taylor expansion of sin(x)\sin(x) around x=0x = 0

The derivatives of sin(x)\sin(x) cycle: sin,cos,sin,cos,sin,\sin, \cos, -\sin, -\cos, \sin, \ldots

Evaluated at x=0x = 0: 0,1,0,1,0,1,0, 1, 0, -1, 0, 1, \ldots

Only the odd-power terms survive:

sin(x)xx36+x5120\sin(x) \approx x - \frac{x^3}{6} + \frac{x^5}{120} - \cdots

Let’s approximate sin(0.3)\sin(0.3). The true value is sin(0.3)0.29552\sin(0.3) \approx 0.29552.

First-order:

sin(0.3)0.3\sin(0.3) \approx 0.3

Error: 0.295520.3=0.00448|0.29552 - 0.3| = 0.00448

Third-order:

0.3(0.3)36=0.30.0276=0.30.0045=0.29550.3 - \frac{(0.3)^3}{6} = 0.3 - \frac{0.027}{6} = 0.3 - 0.0045 = 0.2955

Error: 0.295520.2955=0.00002|0.29552 - 0.2955| = 0.00002

That is remarkably accurate. For small xx, the approximation sin(x)x\sin(x) \approx x works well, and adding the cubic term makes it nearly exact.

Example 3: Comparing first-order and second-order error

Consider f(x)=ln(1+x)f(x) = \ln(1 + x) near a=0a = 0. We have:

f(0)=0,f(0)=1,f(0)=1f(0) = 0, \quad f'(0) = 1, \quad f''(0) = -1

First-order approximation:

ln(1+x)x\ln(1 + x) \approx x

Second-order approximation:

ln(1+x)xx22\ln(1 + x) \approx x - \frac{x^2}{2}

Let’s test at x=0.4x = 0.4. The true value is ln(1.4)0.3365\ln(1.4) \approx 0.3365.

First-order:

0.40.4

Error: 0.33650.4=0.0635|0.3365 - 0.4| = 0.0635

Second-order:

0.4(0.4)22=0.40.08=0.320.4 - \frac{(0.4)^2}{2} = 0.4 - 0.08 = 0.32

Error: 0.33650.32=0.0165|0.3365 - 0.32| = 0.0165

The second-order approximation cuts the error by roughly a factor of 4. Now test at x=0.1x = 0.1. True value: ln(1.1)0.09531\ln(1.1) \approx 0.09531.

First-order:

0.10.1

Error: 0.095310.1=0.00469|0.09531 - 0.1| = 0.00469

Second-order:

0.1(0.1)22=0.10.005=0.0950.1 - \frac{(0.1)^2}{2} = 0.1 - 0.005 = 0.095

Error: 0.095310.095=0.00031|0.09531 - 0.095| = 0.00031

Closer to the expansion point, the improvement is even more dramatic. This is a general pattern: the error of an nn-th order Taylor approximation scales as O((xa)n+1)O((x - a)^{n+1}).

Example 4: One step of Newton’s method

Let’s find the minimum of f(x)=x44x2+2f(x) = x^4 - 4x^2 + 2, starting from x0=3x_0 = 3.

First, compute the derivatives:

f(x)=4x38x,f(x)=12x28f'(x) = 4x^3 - 8x, \quad f''(x) = 12x^2 - 8

At x0=3x_0 = 3:

f(3)=4(27)8(3)=10824=84f'(3) = 4(27) - 8(3) = 108 - 24 = 84 f(3)=12(9)8=1088=100f''(3) = 12(9) - 8 = 108 - 8 = 100

Newton’s update:

x1=384100=30.84=2.16x_1 = 3 - \frac{84}{100} = 3 - 0.84 = 2.16

Second step, at x1=2.16x_1 = 2.16:

f(2.16)=4(2.16)38(2.16)=4(10.078)17.28=40.3117.28=23.03f'(2.16) = 4(2.16)^3 - 8(2.16) = 4(10.078) - 17.28 = 40.31 - 17.28 = 23.03 f(2.16)=12(2.16)28=12(4.666)8=55.998=47.99f''(2.16) = 12(2.16)^2 - 8 = 12(4.666) - 8 = 55.99 - 8 = 47.99 x2=2.1623.0347.99=2.160.48=1.68x_2 = 2.16 - \frac{23.03}{47.99} = 2.16 - 0.48 = 1.68

The true minimum is near x1.414x \approx 1.414 (which is 2\sqrt{2}). Newton’s method is converging toward it. Compare this to gradient descent, which would use only f(x)f'(x) and a fixed step size, converging more slowly. Newton’s method uses curvature to take smarter steps.

The remainder term

The error of a Taylor approximation has a precise bound. For an nn-th order approximation around aa:

Rn(x)=f(n+1)(c)(n+1)!(xa)n+1R_n(x) = \frac{f^{(n+1)}(c)}{(n+1)!}(x - a)^{n+1}

for some cc between aa and xx. Two things control the error:

  1. Distance from the expansion point: (xa)n+1(x - a)^{n+1} shrinks fast when xx is close to aa.
  2. Order of approximation: (n+1)!(n+1)! in the denominator grows extremely fast, so higher-order terms contribute less.

This is why Taylor approximations are called “local” approximations. They work great near aa but can break down far away.

Approximation quality depends on distance from the expansion point:

graph TD
  A["Expansion point a"] --> B["Near a: error is small"]
  A --> C["Far from a: error grows fast"]
  B --> D["Higher-order terms shrink quickly"]
  C --> E["Higher-order terms blow up"]
  E --> F["Solution: re-expand around a closer point"]

Multivariate Taylor expansion

For a function of two variables f(x,y)f(x, y) expanded around (a,b)(a, b):

First order:

f(x,y)f(a,b)+fx(a,b)(xa)+fy(a,b)(yb)f(x, y) \approx f(a, b) + f_x(a, b)(x - a) + f_y(a, b)(y - b)

Second order:

f(x,y)f(a,b)+fx(xa)+fy(yb)+12[fxx(xa)2+2fxy(xa)(yb)+fyy(yb)2]f(x, y) \approx f(a, b) + f_x(x - a) + f_y(y - b) + \frac{1}{2}\left[f_{xx}(x - a)^2 + 2f_{xy}(x - a)(y - b) + f_{yy}(y - b)^2\right]

where all partial derivatives are evaluated at (a,b)(a, b). The second-order terms form a quadratic form using the Hessian matrix:

H=[fxxfxyfxyfyy]\mathbf{H} = \begin{bmatrix} f_{xx} & f_{xy} \\ f_{xy} & f_{yy} \end{bmatrix}

The eigenvalues of this Hessian tell you the curvature in each principal direction. If all eigenvalues are positive, the point is a local minimum. If all are negative, it is a local maximum. Mixed signs mean a saddle point.

Connection to convexity

A function is convex if and only if it always lies above its first-order Taylor approximation:

f(x)f(a)+f(a)(xa)for all x,af(x) \geq f(a) + f'(a)(x - a) \quad \text{for all } x, a

Equivalently, the second-order term is always non-negative, which means f(x)0f''(x) \geq 0 everywhere. For multivariable functions, this requires the Hessian to be positive semidefinite everywhere. Convex functions are “easy” to optimize because every local minimum is a global minimum.

Python: visualizing Taylor approximations

import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(-2, 2, 300)
f = np.exp(x)

# Taylor approximations of e^x around x=0
order1 = 1 + x
order2 = 1 + x + x**2 / 2
order3 = 1 + x + x**2 / 2 + x**3 / 6

plt.plot(x, f, label="e^x (true)", linewidth=2)
plt.plot(x, order1, "--", label="1st order")
plt.plot(x, order2, "--", label="2nd order")
plt.plot(x, order3, "--", label="3rd order")
plt.legend()
plt.ylim(-1, 7)
plt.title("Taylor approximations of e^x")
plt.xlabel("x")
plt.ylabel("f(x)")
plt.grid(True)
plt.show()

Summary

ConceptWhat it gives youUsed in
1st-order TaylorTangent line (linear approx)Gradient descent
2nd-order TaylorQuadratic approximationNewton’s method
Remainder RnR_nError boundConvergence analysis
HessianCurvature matrixSecond-order optimization

The key takeaway: Taylor series turn hard nonlinear problems into tractable polynomial problems, at the cost of accuracy far from the expansion point. First-order methods are cheap but slow; second-order methods are more expensive per step but converge faster. This trade-off shows up everywhere in ML optimization.

What comes next

The next article covers probability fundamentals, where we shift from calculus to the language of uncertainty, the other essential foundation for machine learning.

Start typing to search across all content
navigate Enter open Esc close