Calculus Review: Derivatives and the Chain Rule
In this series (15 parts)
- Why Maths Matters for ML: A Practical Overview
- Scalars, Vectors, and Vector Spaces
- Matrices and Matrix Operations
- Matrix Inverses and Systems of Linear Equations
- Eigenvalues and Eigenvectors
- Matrix Decompositions: LU, QR, SVD
- Norms, Distances, and Similarity
- Calculus Review: Derivatives and the Chain Rule
- Partial Derivatives and Gradients
- The Jacobian and Hessian Matrices
- Taylor series and local approximations
- Probability fundamentals
- Random variables and distributions
- Bayes theorem and its role in ML
- Information theory: entropy, KL divergence, cross-entropy
Training a neural network means adjusting millions of parameters to reduce a loss function. The tool that makes this possible is the derivative. It tells you which direction to push each parameter, and by how much.
If you remember only one thing from calculus, make it the chain rule. It is the mathematical backbone of backpropagation.
What is a derivative?
The derivative of a function at a point measures the instantaneous rate of change. Formally:
Geometrically, it is the slope of the tangent line to at .
Function f(x) = x^2 with its tangent line at x = 2
- If , the function is increasing at .
- If , the function is decreasing at .
- If , you are at a flat spot (possibly a minimum, maximum, or saddle point).
This is exactly the information gradient descent uses. To minimize , move in the direction where .
Basic differentiation rules
You rarely compute derivatives from the limit definition. Instead, you use rules.
Power rule
Works for any real , not just integers. So .
Constant multiple rule
Sum rule
Product rule
Quotient rule
Common derivatives
| Function | Derivative |
|---|---|
That last one, the sigmoid, shows up constantly in ML. Its derivative has a beautiful form that makes computation efficient.
Derivative computation process:
graph TD F["Start with function f"] --> ID["Identify structure<br/>Sum, product, composition?"] ID --> RULE["Apply the matching rule<br/>Power, product, chain, etc."] RULE --> SIMP["Simplify the expression<br/>Factor common terms"] SIMP --> EVAL["Evaluate at a point if needed"]
Example 1: Differentiating a polynomial
Find .
Apply the power rule and sum rule term by term:
Result:
Evaluate at :
The function is increasing at , with a slope of 9.
Example 2: Product rule
Find .
Let and . Then and .
Evaluate at :
The chain rule
Here is the rule that makes deep learning work.
If , that is, composed with , then:
In Leibniz notation, if and :
The chain rule says: differentiate the outer function (leaving the inner function alone), then multiply by the derivative of the inner function.
Why the chain rule matters for ML
A neural network is a composition of functions. Layer 1 feeds into layer 2, which feeds into layer 3, and so on. The loss function sits at the end. To compute how the loss changes with respect to a weight in layer 1, you need the chain rule. Repeatedly.
This repeated application of the chain rule is called backpropagation. It is not a separate algorithm. It is just the chain rule applied efficiently to a computation graph.
Chain rule as nested function boxes:
graph LR X["x"] --> G["g: inner function"] G -->|"g of x"| F["f: outer function"] F -->|"f of g of x"| Y["y"] Y -.->|"df/dg"| F F -.->|"dg/dx"| G G -.->|"dy/dx = df/dg * dg/dx"| X
Example 3: Chain rule, step by step
Differentiate .
Identify the layers:
- Outer function:
- Inner function:
Step 1: Differentiate the outer function.
Step 2: Differentiate the inner function.
Step 3: Multiply.
Evaluate at :
Example 4: Chain rule with exponentials
Differentiate .
Outer function: , so .
Inner function: , so .
Evaluate at :
The function is decreasing at . This makes sense: is a bell curve centered at zero, and it slopes downward for .
Example 5: Nested chain rule (three layers)
Differentiate .
This is a three-layer composition: wrapping wrapping .
Layer 1 (outermost): , so .
Layer 2: , so .
Layer 3 (innermost): , so .
Chain them together:
Evaluate at :
This nested pattern is exactly what happens in deep networks. Each layer is a link in the chain, and the chain rule connects them all.
Neural network as a chain of function compositions:
graph LR X["Input x"] --> L1["Layer 1<br/>z1 = W1 x + b1<br/>a1 = activation of z1"] L1 --> L2["Layer 2<br/>z2 = W2 a1 + b2<br/>a2 = activation of z2"] L2 --> LN["...<br/>More layers"] LN --> OUT["Output<br/>prediction y-hat"] OUT --> LOSS["Loss L"] LOSS -.->|"dL/da2"| LN LN -.->|"dL/da1"| L2 L2 -.->|"dL/dx"| L1
The chain rule in a neural network
Consider a tiny network: input , one hidden layer, one output, and a loss.
To update , we need . The chain rule gives:
Compute each piece:
Put it together:
After simplification (the sigmoid and cross-entropy loss are designed to simplify nicely):
The prediction error times the input . Clean and efficient. This simplification is not a coincidence; the sigmoid and log-loss were chosen partly because their derivatives interact so well.
Numerical check: Let , , , .
The negative gradient tells us: increase to reduce the loss. That matches intuition, since the model is underpredicting ().
Common pitfalls
Forgetting to multiply by the inner derivative. The most common chain rule mistake. If you differentiate and get instead of , you have missed the inner derivative.
Confusing with . These are different compositions:
- : outer is squaring, inner is . Derivative: .
- : outer is , inner is squaring. Derivative: .
Not simplifying. Chain rule expressions get messy fast. Factor common terms and simplify before evaluating. In backpropagation, frameworks handle this automatically, but understanding it helps you debug.
Derivatives of common activation functions
ML uses several activation functions beyond the sigmoid. Knowing their derivatives helps you understand training dynamics.
Comparing common activation functions: ReLU, sigmoid, and tanh
ReLU:
The derivative is undefined at , but in practice we set it to 0 or 1. ReLU’s constant gradient of 1 for positive inputs is why it trains faster than sigmoid, which saturates (gradient goes to 0 for large inputs).
Tanh:
Similar to sigmoid but centered at zero. The derivative is largest at (where ) and shrinks toward 0 as grows.
Softplus:
The derivative of softplus is the sigmoid. Softplus is a smooth approximation of ReLU.
Derivatives in Python
For symbolic derivatives (exact):
from sympy import symbols, diff, exp, sin
x = symbols('x')
f = 3*x**4 - 5*x**2 + 7*x - 2
print(diff(f, x)) # 12*x**3 - 10*x + 7
g = sin(exp(2*x))
print(diff(g, x)) # 2*exp(2*x)*cos(exp(2*x))
For numerical derivatives (automatic differentiation), PyTorch and JAX compute gradients through the chain rule automatically:
import torch
x = torch.tensor(2.0, requires_grad=True)
y = (3*x**2 + 1)**5
y.backward()
print(x.grad) # dy/dx evaluated at x=2
What comes next
Derivatives of single-variable functions are the starting point. In ML, functions almost always have multiple inputs (weights, biases, features). Head to Partial Derivatives and Gradients to learn how derivatives generalize to multiple dimensions, and why the gradient vector is central to optimization.