Feb 25, 2026 · 18 min read · Maths for ML

Calculus Review: Derivatives and the Chain Rule

In this series (15 parts)

Training a neural network means adjusting millions of parameters to reduce a loss function. The tool that makes this possible is the derivative. It tells you which direction to push each parameter, and by how much.

If you remember only one thing from calculus, make it the chain rule. It is the mathematical backbone of backpropagation.

What is a derivative?

The derivative of a function $f(x)$ at a point $x$ measures the instantaneous rate of change. Formally:

f'(x) = \lim_{h \to 0} \frac{f(x + h) - f(x)}{h}

Geometrically, it is the slope of the tangent line to $f$ at $x$ .

Function f(x) = x^2 with its tangent line at x = 2

If $f'(x) > 0$ , the function is increasing at $x$ .
If $f'(x) < 0$ , the function is decreasing at $x$ .
If $f'(x) = 0$ , you are at a flat spot (possibly a minimum, maximum, or saddle point).

This is exactly the information gradient descent uses. To minimize $f$ , move in the direction where $f'(x) < 0$ .

Basic differentiation rules

You rarely compute derivatives from the limit definition. Instead, you use rules.

Power rule

\frac{d}{dx} x^n = n x^{n-1}

Works for any real $n$ , not just integers. So $\frac{d}{dx} \sqrt{x} = \frac{d}{dx} x^{1/2} = \frac{1}{2} x^{-1/2}$ .

Constant multiple rule

\frac{d}{dx}[c \cdot f(x)] = c \cdot f'(x)

Sum rule

\frac{d}{dx}[f(x) + g(x)] = f'(x) + g'(x)

Product rule

\frac{d}{dx}[f(x) \cdot g(x)] = f'(x) g(x) + f(x) g'(x)

Quotient rule

\frac{d}{dx}\left[\frac{f(x)}{g(x)}\right] = \frac{f'(x)g(x) - f(x)g'(x)}{[g(x)]^2}

Common derivatives

Function	Derivative
$e^x$	$e^x$
$\ln(x)$	$1/x$
$\sin(x)$	$\cos(x)$
$\cos(x)$	$-\sin(x)$
$\sigma(x) = \frac{1}{1+e^{-x}}$	$\sigma(x)(1 - \sigma(x))$

That last one, the sigmoid, shows up constantly in ML. Its derivative has a beautiful form that makes computation efficient.

Derivative computation process:

graph TD
  F["Start with function f"] --> ID["Identify structure<br/>Sum, product, composition?"]
  ID --> RULE["Apply the matching rule<br/>Power, product, chain, etc."]
  RULE --> SIMP["Simplify the expression<br/>Factor common terms"]
  SIMP --> EVAL["Evaluate at a point if needed"]

Example 1: Differentiating a polynomial

Find $\frac{d}{dx}(3x^4 - 5x^2 + 7x - 2)$ .

Apply the power rule and sum rule term by term:

\frac{d}{dx}(3x^4) = 3 \cdot 4x^3 = 12x^3

\frac{d}{dx}(-5x^2) = -5 \cdot 2x = -10x

\frac{d}{dx}(7x) = 7

\frac{d}{dx}(-2) = 0

Result:

f'(x) = 12x^3 - 10x + 7

Evaluate at $x = 1$ :

f'(1) = 12(1) - 10(1) + 7 = 12 - 10 + 7 = 9

The function is increasing at $x = 1$ , with a slope of 9.

Example 2: Product rule

Find $\frac{d}{dx}[x^2 \cdot e^x]$ .

Let $f(x) = x^2$ and $g(x) = e^x$ . Then $f'(x) = 2x$ and $g'(x) = e^x$ .

\frac{d}{dx}[x^2 e^x] = 2x \cdot e^x + x^2 \cdot e^x = e^x(2x + x^2)

Evaluate at $x = 2$ :

f'(2) = e^2(4 + 4) = 8e^2 \approx 8 \times 7.389 \approx 59.11

The chain rule

Here is the rule that makes deep learning work.

If $y = f(g(x))$ , that is, $f$ composed with $g$ , then:

\frac{dy}{dx} = f'(g(x)) \cdot g'(x)

In Leibniz notation, if $y = f(u)$ and $u = g(x)$ :

\frac{dy}{dx} = \frac{dy}{du} \cdot \frac{du}{dx}

The chain rule says: differentiate the outer function (leaving the inner function alone), then multiply by the derivative of the inner function.

Why the chain rule matters for ML

A neural network is a composition of functions. Layer 1 feeds into layer 2, which feeds into layer 3, and so on. The loss function sits at the end. To compute how the loss changes with respect to a weight in layer 1, you need the chain rule. Repeatedly.

This repeated application of the chain rule is called backpropagation. It is not a separate algorithm. It is just the chain rule applied efficiently to a computation graph.

Chain rule as nested function boxes:

graph LR
  X["x"] --> G["g: inner function"]
  G -->|"g of x"| F["f: outer function"]
  F -->|"f of g of x"| Y["y"]
  Y -.->|"df/dg"| F
  F -.->|"dg/dx"| G
  G -.->|"dy/dx = df/dg * dg/dx"| X

Example 3: Chain rule, step by step

Differentiate $y = (3x^2 + 1)^5$ .

Identify the layers:

Outer function: $f(u) = u^5$
Inner function: $u = g(x) = 3x^2 + 1$

Step 1: Differentiate the outer function.

\frac{dy}{du} = 5u^4

Step 2: Differentiate the inner function.

\frac{du}{dx} = 6x

Step 3: Multiply.

\frac{dy}{dx} = 5u^4 \cdot 6x = 5(3x^2 + 1)^4 \cdot 6x = 30x(3x^2 + 1)^4

Evaluate at $x = 1$ :

\frac{dy}{dx}\bigg|_{x=1} = 30(1)(3 + 1)^4 = 30 \cdot 256 = 7680

Example 4: Chain rule with exponentials

Differentiate $y = e^{-x^2}$ .

Outer function: $f(u) = e^u$ , so $f'(u) = e^u$ .

Inner function: $u = -x^2$ , so $u' = -2x$ .

\frac{dy}{dx} = e^{-x^2} \cdot (-2x) = -2x \, e^{-x^2}

Evaluate at $x = 1$ :

\frac{dy}{dx}\bigg|_{x=1} = -2(1) \cdot e^{-1} = -2e^{-1} \approx -2 \times 0.368 = -0.736

The function is decreasing at $x = 1$ . This makes sense: $e^{-x^2}$ is a bell curve centered at zero, and it slopes downward for $x > 0$ .

Example 5: Nested chain rule (three layers)

Differentiate $y = \sin(e^{2x})$ .

This is a three-layer composition: $\sin(\cdot)$ wrapping $e^{(\cdot)}$ wrapping $2x$ .

Layer 1 (outermost): $f(u) = \sin(u)$ , so $f'(u) = \cos(u)$ .

Layer 2: $u = e^v$ , so $\frac{du}{dv} = e^v$ .

Layer 3 (innermost): $v = 2x$ , so $\frac{dv}{dx} = 2$ .

Chain them together:

\frac{dy}{dx} = \cos(e^{2x}) \cdot e^{2x} \cdot 2 = 2e^{2x}\cos(e^{2x})

Evaluate at $x = 0$ :

\frac{dy}{dx}\bigg|_{x=0} = 2 \cdot e^0 \cdot \cos(e^0) = 2 \cdot 1 \cdot \cos(1) \approx 2 \times 0.540 = 1.080

This nested pattern is exactly what happens in deep networks. Each layer is a link in the chain, and the chain rule connects them all.

Neural network as a chain of function compositions:

graph LR
  X["Input x"] --> L1["Layer 1<br/>z1 = W1 x + b1<br/>a1 = activation of z1"]
  L1 --> L2["Layer 2<br/>z2 = W2 a1 + b2<br/>a2 = activation of z2"]
  L2 --> LN["...<br/>More layers"]
  LN --> OUT["Output<br/>prediction y-hat"]
  OUT --> LOSS["Loss L"]
  LOSS -.->|"dL/da2"| LN
  LN -.->|"dL/da1"| L2
  L2 -.->|"dL/dx"| L1

The chain rule in a neural network

Consider a tiny network: input $x$ , one hidden layer, one output, and a loss.

z = wx + b \quad \text{(linear layer)}

a = \sigma(z) \quad \text{(sigmoid activation)}

L = -[y \ln(a) + (1-y)\ln(1-a)] \quad \text{(binary cross-entropy loss)}

To update $w$ , we need $\frac{\partial L}{\partial w}$ . The chain rule gives:

\frac{\partial L}{\partial w} = \frac{\partial L}{\partial a} \cdot \frac{\partial a}{\partial z} \cdot \frac{\partial z}{\partial w}

Compute each piece:

\frac{\partial z}{\partial w} = x

\frac{\partial a}{\partial z} = \sigma(z)(1 - \sigma(z))

\frac{\partial L}{\partial a} = -\frac{y}{a} + \frac{1-y}{1-a}

Put it together:

\frac{\partial L}{\partial w} = \left(-\frac{y}{a} + \frac{1-y}{1-a}\right) \cdot \sigma(z)(1-\sigma(z)) \cdot x

After simplification (the sigmoid and cross-entropy loss are designed to simplify nicely):

\frac{\partial L}{\partial w} = (a - y) \cdot x

The prediction error $(a - y)$ times the input $x$ . Clean and efficient. This simplification is not a coincidence; the sigmoid and log-loss were chosen partly because their derivatives interact so well.

Numerical check: Let $w = 0.5$ , $b = 0$ , $x = 2$ , $y = 1$ .

z = 0.5 \times 2 + 0 = 1

a = \sigma(1) = \frac{1}{1 + e^{-1}} \approx 0.731

\frac{\partial L}{\partial w} = (0.731 - 1) \times 2 = -0.269 \times 2 = -0.538

The negative gradient tells us: increase $w$ to reduce the loss. That matches intuition, since the model is underpredicting ( $a < y$ ).

Common pitfalls

Forgetting to multiply by the inner derivative. The most common chain rule mistake. If you differentiate $\sin(3x)$ and get $\cos(3x)$ instead of $3\cos(3x)$ , you have missed the inner derivative.

Confusing $f(x)^2$ with $f(x^2)$ . These are different compositions:

$(f(x))^2$ : outer is squaring, inner is $f$ . Derivative: $2f(x) \cdot f'(x)$ .
$f(x^2)$ : outer is $f$ , inner is squaring. Derivative: $f'(x^2) \cdot 2x$ .

Not simplifying. Chain rule expressions get messy fast. Factor common terms and simplify before evaluating. In backpropagation, frameworks handle this automatically, but understanding it helps you debug.

Derivatives of common activation functions

ML uses several activation functions beyond the sigmoid. Knowing their derivatives helps you understand training dynamics.

Comparing common activation functions: ReLU, sigmoid, and tanh

ReLU: $f(x) = \max(0, x)$

f'(x) = \begin{cases} 1 & \text{if } x > 0 \\ 0 & \text{if } x < 0 \end{cases}

The derivative is undefined at $x = 0$ , but in practice we set it to 0 or 1. ReLU’s constant gradient of 1 for positive inputs is why it trains faster than sigmoid, which saturates (gradient goes to 0 for large inputs).

Tanh: $f(x) = \tanh(x)$

f'(x) = 1 - \tanh^2(x)

Similar to sigmoid but centered at zero. The derivative is largest at $x = 0$ (where $f'(0) = 1$ ) and shrinks toward 0 as $|x|$ grows.

Softplus: $f(x) = \ln(1 + e^x)$

f'(x) = \frac{e^x}{1 + e^x} = \sigma(x)

The derivative of softplus is the sigmoid. Softplus is a smooth approximation of ReLU.

Derivatives in Python

For symbolic derivatives (exact):

from sympy import symbols, diff, exp, sin

x = symbols('x')

f = 3*x**4 - 5*x**2 + 7*x - 2
print(diff(f, x))  # 12*x**3 - 10*x + 7

g = sin(exp(2*x))
print(diff(g, x))  # 2*exp(2*x)*cos(exp(2*x))

For numerical derivatives (automatic differentiation), PyTorch and JAX compute gradients through the chain rule automatically:

import torch

x = torch.tensor(2.0, requires_grad=True)
y = (3*x**2 + 1)**5

y.backward()
print(x.grad)  # dy/dx evaluated at x=2

What comes next

Derivatives of single-variable functions are the starting point. In ML, functions almost always have multiple inputs (weights, biases, features). Head to Partial Derivatives and Gradients to learn how derivatives generalize to multiple dimensions, and why the gradient vector is central to optimization.

← Back to all series