Search…
Maths for ML · Part 8

Calculus Review: Derivatives and the Chain Rule

In this series (15 parts)
  1. Why Maths Matters for ML: A Practical Overview
  2. Scalars, Vectors, and Vector Spaces
  3. Matrices and Matrix Operations
  4. Matrix Inverses and Systems of Linear Equations
  5. Eigenvalues and Eigenvectors
  6. Matrix Decompositions: LU, QR, SVD
  7. Norms, Distances, and Similarity
  8. Calculus Review: Derivatives and the Chain Rule
  9. Partial Derivatives and Gradients
  10. The Jacobian and Hessian Matrices
  11. Taylor series and local approximations
  12. Probability fundamentals
  13. Random variables and distributions
  14. Bayes theorem and its role in ML
  15. Information theory: entropy, KL divergence, cross-entropy

Training a neural network means adjusting millions of parameters to reduce a loss function. The tool that makes this possible is the derivative. It tells you which direction to push each parameter, and by how much.

If you remember only one thing from calculus, make it the chain rule. It is the mathematical backbone of backpropagation.


What is a derivative?

The derivative of a function f(x)f(x) at a point xx measures the instantaneous rate of change. Formally:

f(x)=limh0f(x+h)f(x)hf'(x) = \lim_{h \to 0} \frac{f(x + h) - f(x)}{h}

Geometrically, it is the slope of the tangent line to ff at xx.

Function f(x) = x^2 with its tangent line at x = 2

  • If f(x)>0f'(x) > 0, the function is increasing at xx.
  • If f(x)<0f'(x) < 0, the function is decreasing at xx.
  • If f(x)=0f'(x) = 0, you are at a flat spot (possibly a minimum, maximum, or saddle point).

This is exactly the information gradient descent uses. To minimize ff, move in the direction where f(x)<0f'(x) < 0.


Basic differentiation rules

You rarely compute derivatives from the limit definition. Instead, you use rules.

Power rule

ddxxn=nxn1\frac{d}{dx} x^n = n x^{n-1}

Works for any real nn, not just integers. So ddxx=ddxx1/2=12x1/2\frac{d}{dx} \sqrt{x} = \frac{d}{dx} x^{1/2} = \frac{1}{2} x^{-1/2}.

Constant multiple rule

ddx[cf(x)]=cf(x)\frac{d}{dx}[c \cdot f(x)] = c \cdot f'(x)

Sum rule

ddx[f(x)+g(x)]=f(x)+g(x)\frac{d}{dx}[f(x) + g(x)] = f'(x) + g'(x)

Product rule

ddx[f(x)g(x)]=f(x)g(x)+f(x)g(x)\frac{d}{dx}[f(x) \cdot g(x)] = f'(x) g(x) + f(x) g'(x)

Quotient rule

ddx[f(x)g(x)]=f(x)g(x)f(x)g(x)[g(x)]2\frac{d}{dx}\left[\frac{f(x)}{g(x)}\right] = \frac{f'(x)g(x) - f(x)g'(x)}{[g(x)]^2}

Common derivatives

FunctionDerivative
exe^xexe^x
ln(x)\ln(x)1/x1/x
sin(x)\sin(x)cos(x)\cos(x)
cos(x)\cos(x)sin(x)-\sin(x)
σ(x)=11+ex\sigma(x) = \frac{1}{1+e^{-x}}σ(x)(1σ(x))\sigma(x)(1 - \sigma(x))

That last one, the sigmoid, shows up constantly in ML. Its derivative has a beautiful form that makes computation efficient.

Derivative computation process:

graph TD
  F["Start with function f"] --> ID["Identify structure<br/>Sum, product, composition?"]
  ID --> RULE["Apply the matching rule<br/>Power, product, chain, etc."]
  RULE --> SIMP["Simplify the expression<br/>Factor common terms"]
  SIMP --> EVAL["Evaluate at a point if needed"]

Example 1: Differentiating a polynomial

Find ddx(3x45x2+7x2)\frac{d}{dx}(3x^4 - 5x^2 + 7x - 2).

Apply the power rule and sum rule term by term:

ddx(3x4)=34x3=12x3\frac{d}{dx}(3x^4) = 3 \cdot 4x^3 = 12x^3 ddx(5x2)=52x=10x\frac{d}{dx}(-5x^2) = -5 \cdot 2x = -10x ddx(7x)=7\frac{d}{dx}(7x) = 7 ddx(2)=0\frac{d}{dx}(-2) = 0

Result:

f(x)=12x310x+7f'(x) = 12x^3 - 10x + 7

Evaluate at x=1x = 1:

f(1)=12(1)10(1)+7=1210+7=9f'(1) = 12(1) - 10(1) + 7 = 12 - 10 + 7 = 9

The function is increasing at x=1x = 1, with a slope of 9.


Example 2: Product rule

Find ddx[x2ex]\frac{d}{dx}[x^2 \cdot e^x].

Let f(x)=x2f(x) = x^2 and g(x)=exg(x) = e^x. Then f(x)=2xf'(x) = 2x and g(x)=exg'(x) = e^x.

ddx[x2ex]=2xex+x2ex=ex(2x+x2)\frac{d}{dx}[x^2 e^x] = 2x \cdot e^x + x^2 \cdot e^x = e^x(2x + x^2)

Evaluate at x=2x = 2:

f(2)=e2(4+4)=8e28×7.38959.11f'(2) = e^2(4 + 4) = 8e^2 \approx 8 \times 7.389 \approx 59.11

The chain rule

Here is the rule that makes deep learning work.

If y=f(g(x))y = f(g(x)), that is, ff composed with gg, then:

dydx=f(g(x))g(x)\frac{dy}{dx} = f'(g(x)) \cdot g'(x)

In Leibniz notation, if y=f(u)y = f(u) and u=g(x)u = g(x):

dydx=dydududx\frac{dy}{dx} = \frac{dy}{du} \cdot \frac{du}{dx}

The chain rule says: differentiate the outer function (leaving the inner function alone), then multiply by the derivative of the inner function.

Why the chain rule matters for ML

A neural network is a composition of functions. Layer 1 feeds into layer 2, which feeds into layer 3, and so on. The loss function sits at the end. To compute how the loss changes with respect to a weight in layer 1, you need the chain rule. Repeatedly.

This repeated application of the chain rule is called backpropagation. It is not a separate algorithm. It is just the chain rule applied efficiently to a computation graph.

Chain rule as nested function boxes:

graph LR
  X["x"] --> G["g: inner function"]
  G -->|"g of x"| F["f: outer function"]
  F -->|"f of g of x"| Y["y"]
  Y -.->|"df/dg"| F
  F -.->|"dg/dx"| G
  G -.->|"dy/dx = df/dg * dg/dx"| X

Example 3: Chain rule, step by step

Differentiate y=(3x2+1)5y = (3x^2 + 1)^5.

Identify the layers:

  • Outer function: f(u)=u5f(u) = u^5
  • Inner function: u=g(x)=3x2+1u = g(x) = 3x^2 + 1

Step 1: Differentiate the outer function.

dydu=5u4\frac{dy}{du} = 5u^4

Step 2: Differentiate the inner function.

dudx=6x\frac{du}{dx} = 6x

Step 3: Multiply.

dydx=5u46x=5(3x2+1)46x=30x(3x2+1)4\frac{dy}{dx} = 5u^4 \cdot 6x = 5(3x^2 + 1)^4 \cdot 6x = 30x(3x^2 + 1)^4

Evaluate at x=1x = 1:

dydxx=1=30(1)(3+1)4=30256=7680\frac{dy}{dx}\bigg|_{x=1} = 30(1)(3 + 1)^4 = 30 \cdot 256 = 7680

Example 4: Chain rule with exponentials

Differentiate y=ex2y = e^{-x^2}.

Outer function: f(u)=euf(u) = e^u, so f(u)=euf'(u) = e^u.

Inner function: u=x2u = -x^2, so u=2xu' = -2x.

dydx=ex2(2x)=2xex2\frac{dy}{dx} = e^{-x^2} \cdot (-2x) = -2x \, e^{-x^2}

Evaluate at x=1x = 1:

dydxx=1=2(1)e1=2e12×0.368=0.736\frac{dy}{dx}\bigg|_{x=1} = -2(1) \cdot e^{-1} = -2e^{-1} \approx -2 \times 0.368 = -0.736

The function is decreasing at x=1x = 1. This makes sense: ex2e^{-x^2} is a bell curve centered at zero, and it slopes downward for x>0x > 0.


Example 5: Nested chain rule (three layers)

Differentiate y=sin(e2x)y = \sin(e^{2x}).

This is a three-layer composition: sin()\sin(\cdot) wrapping e()e^{(\cdot)} wrapping 2x2x.

Layer 1 (outermost): f(u)=sin(u)f(u) = \sin(u), so f(u)=cos(u)f'(u) = \cos(u).

Layer 2: u=evu = e^v, so dudv=ev\frac{du}{dv} = e^v.

Layer 3 (innermost): v=2xv = 2x, so dvdx=2\frac{dv}{dx} = 2.

Chain them together:

dydx=cos(e2x)e2x2=2e2xcos(e2x)\frac{dy}{dx} = \cos(e^{2x}) \cdot e^{2x} \cdot 2 = 2e^{2x}\cos(e^{2x})

Evaluate at x=0x = 0:

dydxx=0=2e0cos(e0)=21cos(1)2×0.540=1.080\frac{dy}{dx}\bigg|_{x=0} = 2 \cdot e^0 \cdot \cos(e^0) = 2 \cdot 1 \cdot \cos(1) \approx 2 \times 0.540 = 1.080

This nested pattern is exactly what happens in deep networks. Each layer is a link in the chain, and the chain rule connects them all.

Neural network as a chain of function compositions:

graph LR
  X["Input x"] --> L1["Layer 1<br/>z1 = W1 x + b1<br/>a1 = activation of z1"]
  L1 --> L2["Layer 2<br/>z2 = W2 a1 + b2<br/>a2 = activation of z2"]
  L2 --> LN["...<br/>More layers"]
  LN --> OUT["Output<br/>prediction y-hat"]
  OUT --> LOSS["Loss L"]
  LOSS -.->|"dL/da2"| LN
  LN -.->|"dL/da1"| L2
  L2 -.->|"dL/dx"| L1

The chain rule in a neural network

Consider a tiny network: input xx, one hidden layer, one output, and a loss.

z=wx+b(linear layer)z = wx + b \quad \text{(linear layer)} a=σ(z)(sigmoid activation)a = \sigma(z) \quad \text{(sigmoid activation)} L=[yln(a)+(1y)ln(1a)](binary cross-entropy loss)L = -[y \ln(a) + (1-y)\ln(1-a)] \quad \text{(binary cross-entropy loss)}

To update ww, we need Lw\frac{\partial L}{\partial w}. The chain rule gives:

Lw=Laazzw\frac{\partial L}{\partial w} = \frac{\partial L}{\partial a} \cdot \frac{\partial a}{\partial z} \cdot \frac{\partial z}{\partial w}

Compute each piece:

zw=x\frac{\partial z}{\partial w} = x az=σ(z)(1σ(z))\frac{\partial a}{\partial z} = \sigma(z)(1 - \sigma(z)) La=ya+1y1a\frac{\partial L}{\partial a} = -\frac{y}{a} + \frac{1-y}{1-a}

Put it together:

Lw=(ya+1y1a)σ(z)(1σ(z))x\frac{\partial L}{\partial w} = \left(-\frac{y}{a} + \frac{1-y}{1-a}\right) \cdot \sigma(z)(1-\sigma(z)) \cdot x

After simplification (the sigmoid and cross-entropy loss are designed to simplify nicely):

Lw=(ay)x\frac{\partial L}{\partial w} = (a - y) \cdot x

The prediction error (ay)(a - y) times the input xx. Clean and efficient. This simplification is not a coincidence; the sigmoid and log-loss were chosen partly because their derivatives interact so well.

Numerical check: Let w=0.5w = 0.5, b=0b = 0, x=2x = 2, y=1y = 1.

z=0.5×2+0=1z = 0.5 \times 2 + 0 = 1 a=σ(1)=11+e10.731a = \sigma(1) = \frac{1}{1 + e^{-1}} \approx 0.731 Lw=(0.7311)×2=0.269×2=0.538\frac{\partial L}{\partial w} = (0.731 - 1) \times 2 = -0.269 \times 2 = -0.538

The negative gradient tells us: increase ww to reduce the loss. That matches intuition, since the model is underpredicting (a<ya < y).


Common pitfalls

Forgetting to multiply by the inner derivative. The most common chain rule mistake. If you differentiate sin(3x)\sin(3x) and get cos(3x)\cos(3x) instead of 3cos(3x)3\cos(3x), you have missed the inner derivative.

Confusing f(x)2f(x)^2 with f(x2)f(x^2). These are different compositions:

  • (f(x))2(f(x))^2: outer is squaring, inner is ff. Derivative: 2f(x)f(x)2f(x) \cdot f'(x).
  • f(x2)f(x^2): outer is ff, inner is squaring. Derivative: f(x2)2xf'(x^2) \cdot 2x.

Not simplifying. Chain rule expressions get messy fast. Factor common terms and simplify before evaluating. In backpropagation, frameworks handle this automatically, but understanding it helps you debug.


Derivatives of common activation functions

ML uses several activation functions beyond the sigmoid. Knowing their derivatives helps you understand training dynamics.

Comparing common activation functions: ReLU, sigmoid, and tanh

ReLU: f(x)=max(0,x)f(x) = \max(0, x)

f(x)={1if x>00if x<0f'(x) = \begin{cases} 1 & \text{if } x > 0 \\ 0 & \text{if } x < 0 \end{cases}

The derivative is undefined at x=0x = 0, but in practice we set it to 0 or 1. ReLU’s constant gradient of 1 for positive inputs is why it trains faster than sigmoid, which saturates (gradient goes to 0 for large inputs).

Tanh: f(x)=tanh(x)f(x) = \tanh(x)

f(x)=1tanh2(x)f'(x) = 1 - \tanh^2(x)

Similar to sigmoid but centered at zero. The derivative is largest at x=0x = 0 (where f(0)=1f'(0) = 1) and shrinks toward 0 as x|x| grows.

Softplus: f(x)=ln(1+ex)f(x) = \ln(1 + e^x)

f(x)=ex1+ex=σ(x)f'(x) = \frac{e^x}{1 + e^x} = \sigma(x)

The derivative of softplus is the sigmoid. Softplus is a smooth approximation of ReLU.


Derivatives in Python

For symbolic derivatives (exact):

from sympy import symbols, diff, exp, sin

x = symbols('x')

f = 3*x**4 - 5*x**2 + 7*x - 2
print(diff(f, x))  # 12*x**3 - 10*x + 7

g = sin(exp(2*x))
print(diff(g, x))  # 2*exp(2*x)*cos(exp(2*x))

For numerical derivatives (automatic differentiation), PyTorch and JAX compute gradients through the chain rule automatically:

import torch

x = torch.tensor(2.0, requires_grad=True)
y = (3*x**2 + 1)**5

y.backward()
print(x.grad)  # dy/dx evaluated at x=2

What comes next

Derivatives of single-variable functions are the starting point. In ML, functions almost always have multiple inputs (weights, biases, features). Head to Partial Derivatives and Gradients to learn how derivatives generalize to multiple dimensions, and why the gradient vector is central to optimization.

Start typing to search across all content
navigate Enter open Esc close