Taylor series and local approximations
In this series (15 parts)
- Why Maths Matters for ML: A Practical Overview
- Scalars, Vectors, and Vector Spaces
- Matrices and Matrix Operations
- Matrix Inverses and Systems of Linear Equations
- Eigenvalues and Eigenvectors
- Matrix Decompositions: LU, QR, SVD
- Norms, Distances, and Similarity
- Calculus Review: Derivatives and the Chain Rule
- Partial Derivatives and Gradients
- The Jacobian and Hessian Matrices
- Taylor series and local approximations
- Probability fundamentals
- Random variables and distributions
- Bayes theorem and its role in ML
- Information theory: entropy, KL divergence, cross-entropy
Every optimization algorithm in machine learning relies on one idea: if you zoom in close enough to any smooth function, it looks like a polynomial. Taylor series make that idea precise. They give you a recipe for building polynomial approximations of any order, and the higher the order, the better the fit near your chosen point.
Prerequisites
Before reading this, you should be comfortable with derivatives and the chain rule and Jacobians and Hessians.
The core idea
Pick a function and a point . You want a polynomial that matches the function’s value, slope, curvature, and higher-order behavior at . Taylor’s theorem says:
Each term captures one more “layer” of the function’s shape at . The zeroth-order term is just the function value. The first-order term adds the slope. The second-order term adds curvature. And so on.
Successive Taylor approximations get closer to the true function:
graph LR T0["0th order: constant"] --> T1["1st order: tangent line"] T1 --> T2["2nd order: parabola"] T2 --> T3["3rd order: cubic"] T3 --> TN["Higher orders..."] TN --> TF["True function"]
When , this is called a Maclaurin series, but the idea is identical.
First-order approximation (linear)
Drop everything after the first derivative term:
This is a tangent line. It is the best linear approximation to near . In optimization, this is what gradient descent uses: it treats the loss surface as locally flat and steps downhill along the gradient.
Second-order approximation (quadratic)
Keep one more term:
Now you have a parabola that matches the function’s value, slope, and curvature at . This is significantly more accurate for points near . In multiple dimensions, the second derivative becomes the Hessian, and the quadratic approximation becomes:
where is the Hessian matrix. This quadratic form is the backbone of Newton’s method.
Why Newton’s method uses the quadratic approximation
Newton’s method finds the minimum of the quadratic approximation, then moves there. If you take the derivative of the quadratic approximation and set it to zero:
That is Newton’s update rule. In multiple dimensions:
The key insight: gradient descent uses the first-order approximation (linear), while Newton’s method uses the second-order approximation (quadratic). Newton’s method converges much faster near a minimum because the quadratic captures curvature information that the linear approximation misses.
When to use 1st-order vs 2nd-order Taylor approximations:
graph TD
A["Choose optimization method"] --> B{"Hessian affordable?"}
B -->|No| C["1st-order Taylor"]
B -->|Yes| D["2nd-order Taylor"]
C --> E["Gradient descent"]
E --> F1["Cheap per step, slow convergence"]
D --> G["Newton's method"]
G --> F2["Expensive per step, fast convergence"]
Example 1: Taylor expansion of around
The function has a nice property: all its derivatives are , and . So every derivative evaluated at equals 1.
Plugging into the Taylor formula:
Let’s check how well this works for . The true value is .
First-order (linear):
Error:
Second-order (quadratic):
Error:
Third-order (cubic):
Error:
Each additional term cuts the error by roughly a factor of 6. The approximation gets dramatically better with each order.
Taylor approximations of e^x around x = 0:
Example 2: Taylor expansion of around
The derivatives of cycle:
Evaluated at :
Only the odd-power terms survive:
Let’s approximate . The true value is .
First-order:
Error:
Third-order:
Error:
That is remarkably accurate. For small , the approximation works well, and adding the cubic term makes it nearly exact.
Example 3: Comparing first-order and second-order error
Consider near . We have:
First-order approximation:
Second-order approximation:
Let’s test at . The true value is .
First-order:
Error:
Second-order:
Error:
The second-order approximation cuts the error by roughly a factor of 4. Now test at . True value: .
First-order:
Error:
Second-order:
Error:
Closer to the expansion point, the improvement is even more dramatic. This is a general pattern: the error of an -th order Taylor approximation scales as .
Example 4: One step of Newton’s method
Let’s find the minimum of , starting from .
First, compute the derivatives:
At :
Newton’s update:
Second step, at :
The true minimum is near (which is ). Newton’s method is converging toward it. Compare this to gradient descent, which would use only and a fixed step size, converging more slowly. Newton’s method uses curvature to take smarter steps.
The remainder term
The error of a Taylor approximation has a precise bound. For an -th order approximation around :
for some between and . Two things control the error:
- Distance from the expansion point: shrinks fast when is close to .
- Order of approximation: in the denominator grows extremely fast, so higher-order terms contribute less.
This is why Taylor approximations are called “local” approximations. They work great near but can break down far away.
Approximation quality depends on distance from the expansion point:
graph TD A["Expansion point a"] --> B["Near a: error is small"] A --> C["Far from a: error grows fast"] B --> D["Higher-order terms shrink quickly"] C --> E["Higher-order terms blow up"] E --> F["Solution: re-expand around a closer point"]
Multivariate Taylor expansion
For a function of two variables expanded around :
First order:
Second order:
where all partial derivatives are evaluated at . The second-order terms form a quadratic form using the Hessian matrix:
The eigenvalues of this Hessian tell you the curvature in each principal direction. If all eigenvalues are positive, the point is a local minimum. If all are negative, it is a local maximum. Mixed signs mean a saddle point.
Connection to convexity
A function is convex if and only if it always lies above its first-order Taylor approximation:
Equivalently, the second-order term is always non-negative, which means everywhere. For multivariable functions, this requires the Hessian to be positive semidefinite everywhere. Convex functions are “easy” to optimize because every local minimum is a global minimum.
Python: visualizing Taylor approximations
import numpy as np
import matplotlib.pyplot as plt
x = np.linspace(-2, 2, 300)
f = np.exp(x)
# Taylor approximations of e^x around x=0
order1 = 1 + x
order2 = 1 + x + x**2 / 2
order3 = 1 + x + x**2 / 2 + x**3 / 6
plt.plot(x, f, label="e^x (true)", linewidth=2)
plt.plot(x, order1, "--", label="1st order")
plt.plot(x, order2, "--", label="2nd order")
plt.plot(x, order3, "--", label="3rd order")
plt.legend()
plt.ylim(-1, 7)
plt.title("Taylor approximations of e^x")
plt.xlabel("x")
plt.ylabel("f(x)")
plt.grid(True)
plt.show()
Summary
| Concept | What it gives you | Used in |
|---|---|---|
| 1st-order Taylor | Tangent line (linear approx) | Gradient descent |
| 2nd-order Taylor | Quadratic approximation | Newton’s method |
| Remainder | Error bound | Convergence analysis |
| Hessian | Curvature matrix | Second-order optimization |
The key takeaway: Taylor series turn hard nonlinear problems into tractable polynomial problems, at the cost of accuracy far from the expansion point. First-order methods are cheap but slow; second-order methods are more expensive per step but converge faster. This trade-off shows up everywhere in ML optimization.
What comes next
The next article covers probability fundamentals, where we shift from calculus to the language of uncertainty, the other essential foundation for machine learning.