Search…
Maths for ML · Part 9

Partial Derivatives and Gradients

In this series (15 parts)
  1. Why Maths Matters for ML: A Practical Overview
  2. Scalars, Vectors, and Vector Spaces
  3. Matrices and Matrix Operations
  4. Matrix Inverses and Systems of Linear Equations
  5. Eigenvalues and Eigenvectors
  6. Matrix Decompositions: LU, QR, SVD
  7. Norms, Distances, and Similarity
  8. Calculus Review: Derivatives and the Chain Rule
  9. Partial Derivatives and Gradients
  10. The Jacobian and Hessian Matrices
  11. Taylor series and local approximations
  12. Probability fundamentals
  13. Random variables and distributions
  14. Bayes theorem and its role in ML
  15. Information theory: entropy, KL divergence, cross-entropy

Most functions in machine learning take many inputs. A loss function might depend on thousands of weights simultaneously. Partial derivatives let you isolate the effect of each input, one at a time. The gradient bundles all those partial derivatives into a single vector, giving you the direction of steepest increase.

Prerequisites: You should be comfortable with single-variable derivatives and the chain rule.


Partial derivatives

Given a function f(x,y)f(x, y) of two variables, the partial derivative with respect to xx is:

fx=limh0f(x+h,y)f(x,y)h\frac{\partial f}{\partial x} = \lim_{h \to 0} \frac{f(x + h, y) - f(x, y)}{h}

The key idea: treat every other variable as a constant and differentiate normally. If you can take a regular derivative, you can take a partial derivative.

Notation varies. You will see fx\frac{\partial f}{\partial x}, fxf_x, and xf\partial_x f. They all mean the same thing.


Example 1: Partial derivatives of a two-variable function

Let f(x,y)=x2y+3xy25y+2f(x, y) = x^2 y + 3xy^2 - 5y + 2.

Partial derivative with respect to xx (treat yy as a constant):

fx=2xy+3y2\frac{\partial f}{\partial x} = 2xy + 3y^2

Term by term: x2yx^2 y gives 2xy2xy (power rule on xx, yy is constant). 3xy23xy^2 gives 3y23y^2 (y2y^2 is constant, derivative of xx is 1). 5y-5y gives 00. The constant 22 gives 00.

Partial derivative with respect to yy (treat xx as a constant):

fy=x2+6xy5\frac{\partial f}{\partial y} = x^2 + 6xy - 5

Evaluate at the point (2,1)(2, 1):

fx(2,1)=2(2)(1)+3(1)2=4+3=7\frac{\partial f}{\partial x}\bigg|_{(2,1)} = 2(2)(1) + 3(1)^2 = 4 + 3 = 7 fy(2,1)=(2)2+6(2)(1)5=4+125=11\frac{\partial f}{\partial y}\bigg|_{(2,1)} = (2)^2 + 6(2)(1) - 5 = 4 + 12 - 5 = 11

At the point (2,1)(2, 1): if you nudge xx up a tiny bit (holding yy fixed), ff increases at rate 7. If you nudge yy up instead, ff increases at rate 11. The function is more sensitive to yy at this point.

Partial derivatives as separate slices through the surface:

graph LR
  subgraph sx["Slice along x"]
      A1["Hold y = 1 fixed"] --> A2["Differentiate in x"] --> A3["Slope = 7"]
  end
  subgraph sy["Slice along y"]
      B1["Hold x = 2 fixed"] --> B2["Differentiate in y"] --> B3["Slope = 11"]
  end

The gradient vector

The gradient of a scalar function f(x1,x2,,xn)f(x_1, x_2, \ldots, x_n) is the vector of all its partial derivatives:

f=[fx1fx2fxn]\nabla f = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \frac{\partial f}{\partial x_2} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{bmatrix}

The symbol \nabla is called “nabla” or “del.”

Key property: direction of steepest ascent

The gradient f\nabla f at a point points in the direction where ff increases the fastest. Its magnitude f\|\nabla f\| tells you the rate of increase in that direction.

This is why gradient descent moves in the direction f-\nabla f. You want to go downhill (decrease the loss), so you move opposite to the gradient.

The gradient descent loop:

graph LR
  A["Current point x"] --> B["Compute gradient"]
  B --> C["Negate direction"]
  C --> D["Step: x = x - lr * gradient"]
  D --> E{"Converged?"}
  E -->|No| A
  E -->|Yes| F["Local minimum found"]

Gradient from Example 1

Using f(x,y)=x2y+3xy25y+2f(x, y) = x^2 y + 3xy^2 - 5y + 2:

f=[2xy+3y2x2+6xy5]\nabla f = \begin{bmatrix} 2xy + 3y^2 \\ x^2 + 6xy - 5 \end{bmatrix}

At (2,1)(2, 1):

f(2,1)=[711]\nabla f(2, 1) = \begin{bmatrix} 7 \\ 11 \end{bmatrix}

The magnitude of the gradient:

f(2,1)=72+112=49+121=17013.04\|\nabla f(2, 1)\| = \sqrt{7^2 + 11^2} = \sqrt{49 + 121} = \sqrt{170} \approx 13.04

The direction of steepest ascent at (2,1)(2, 1) is roughly [7,11]T[7, 11]^T. To decrease ff fastest, you would move in the direction [7,11]T[-7, -11]^T.


Example 2: Gradient of a three-variable function

Let f(x,y,z)=x2z+eyz3xzf(x, y, z) = x^2 z + e^{yz} - 3xz.

Partial derivatives:

fx=2xz3z\frac{\partial f}{\partial x} = 2xz - 3z fy=zeyz\frac{\partial f}{\partial y} = ze^{yz} fz=x2+yeyz3x\frac{\partial f}{\partial z} = x^2 + ye^{yz} - 3x

The gradient:

f=[2xz3zzeyzx2+yeyz3x]\nabla f = \begin{bmatrix} 2xz - 3z \\ ze^{yz} \\ x^2 + ye^{yz} - 3x \end{bmatrix}

Evaluate at (1,0,2)(1, 0, 2):

fx(1,0,2)=2(1)(2)3(2)=46=2\frac{\partial f}{\partial x}\bigg|_{(1,0,2)} = 2(1)(2) - 3(2) = 4 - 6 = -2 fy(1,0,2)=2e0=2\frac{\partial f}{\partial y}\bigg|_{(1,0,2)} = 2 \cdot e^{0} = 2 fz(1,0,2)=1+0e03=1+03=2\frac{\partial f}{\partial z}\bigg|_{(1,0,2)} = 1 + 0 \cdot e^{0} - 3 = 1 + 0 - 3 = -2 f(1,0,2)=[222]\nabla f(1, 0, 2) = \begin{bmatrix} -2 \\ 2 \\ -2 \end{bmatrix}

The function is decreasing in the xx and zz directions but increasing in the yy direction at this point.


Example 3: Gradient of MSE loss

Here is where gradients meet machine learning directly. Consider a linear model with parameters ww and bb, predicting y^i=wxi+b\hat{y}_i = wx_i + b for data points (xi,yi)(x_i, y_i). The mean squared error loss is:

L(w,b)=1ni=1n(wxi+byi)2L(w, b) = \frac{1}{n}\sum_{i=1}^{n}(wx_i + b - y_i)^2

Partial derivative with respect to ww:

Lw=1ni=1n2(wxi+byi)xi=2ni=1n(y^iyi)xi\frac{\partial L}{\partial w} = \frac{1}{n}\sum_{i=1}^{n} 2(wx_i + b - y_i) \cdot x_i = \frac{2}{n}\sum_{i=1}^{n}(\hat{y}_i - y_i)x_i

Partial derivative with respect to bb:

Lb=1ni=1n2(wxi+byi)=2ni=1n(y^iyi)\frac{\partial L}{\partial b} = \frac{1}{n}\sum_{i=1}^{n} 2(wx_i + b - y_i) = \frac{2}{n}\sum_{i=1}^{n}(\hat{y}_i - y_i)

Numerical example: Suppose we have 3 data points and current parameters w=2,b=1w = 2, b = 1.

xix_iyiy_iy^i=2xi+1\hat{y}_i = 2x_i + 1error =y^iyi= \hat{y}_i - y_i
1431-1
25500
37700
Lw=23[(1)(1)+(0)(2)+(0)(3)]=23(1)=230.667\frac{\partial L}{\partial w} = \frac{2}{3}[(-1)(1) + (0)(2) + (0)(3)] = \frac{2}{3}(-1) = -\frac{2}{3} \approx -0.667 Lb=23[(1)+0+0]=230.667\frac{\partial L}{\partial b} = \frac{2}{3}[(-1) + 0 + 0] = -\frac{2}{3} \approx -0.667 L=[0.6670.667]\nabla L = \begin{bmatrix} -0.667 \\ -0.667 \end{bmatrix}

Both gradients are negative, meaning we should increase both ww and bb slightly to reduce the loss. Gradient descent would update:

wnew=2α(0.667)=2+0.667αw_{\text{new}} = 2 - \alpha(-0.667) = 2 + 0.667\alpha bnew=1α(0.667)=1+0.667αb_{\text{new}} = 1 - \alpha(-0.667) = 1 + 0.667\alpha

where α\alpha is the learning rate.


Geometric interpretation

Think of f(x,y)f(x, y) as a surface in 3D. At any point on the surface:

  • fx\frac{\partial f}{\partial x} is the slope if you walk in the xx direction
  • fy\frac{\partial f}{\partial y} is the slope if you walk in the yy direction
  • f\nabla f points in the direction of steepest climb on the surface
  • f-\nabla f points in the direction of steepest descent

The gradient is perpendicular to the level curves (contour lines) of ff. If you have seen a topographic map, the gradient at any point would be perpendicular to the contour line passing through that point, pointing uphill.

Cross-sections of f(x, y) = x^2 + y^2 along each axis:

Directional derivative

The gradient also lets you compute how fast ff changes in any arbitrary direction u\mathbf{u} (a unit vector):

Duf=fuD_{\mathbf{u}} f = \nabla f \cdot \mathbf{u}

This is the directional derivative. It is maximized when u\mathbf{u} points in the same direction as f\nabla f, which is why the gradient is the direction of steepest ascent.


Higher dimensions

In ML, functions often have thousands or millions of inputs (one per weight). The gradient is still the same idea, just a much longer vector:

L(w)=[Lw1Lw2Lwd]\nabla L(\mathbf{w}) = \begin{bmatrix} \frac{\partial L}{\partial w_1} \\ \frac{\partial L}{\partial w_2} \\ \vdots \\ \frac{\partial L}{\partial w_d} \end{bmatrix}

Computing this efficiently for deep networks is what automatic differentiation and backpropagation handle. You never compute partial derivatives by hand for a neural network, but understanding what the gradient means helps you debug training.

When the gradient is zero (f=0\nabla f = \mathbf{0}), you are at a critical point. It could be a minimum, maximum, or saddle point. To distinguish these, you need second-order information: the Hessian matrix.


The gradient and level curves

A level curve of f(x,y)f(x, y) is the set of points where ff has the same value: f(x,y)=cf(x, y) = c for some constant cc. Think of contour lines on a map.

The gradient is always perpendicular to the level curve at any point. If you stand on a contour line and look in the direction of f\nabla f, you are looking straight uphill. This is not just a mathematical curiosity. It explains why gradient descent trajectories cross contour lines at right angles (at least locally).

Example: For f(x,y)=x2+4y2f(x, y) = x^2 + 4y^2, the level curves are ellipses centered at the origin. At the point (2,1)(2, 1):

f(2,1)=[48]\nabla f(2, 1) = \begin{bmatrix} 4 \\ 8 \end{bmatrix}

The level curve through (2,1)(2, 1) is x2+4y2=8x^2 + 4y^2 = 8, which is an ellipse. The gradient [4,8]T[4, 8]^T points outward, perpendicular to the ellipse at that point. A gradient descent step would move in the direction [4,8]T[-4, -8]^T, heading toward the minimum at the origin.

Gradient vectors point perpendicular to level curves, toward higher values:

graph LR
  L1["Level curve f = 4"] -->|"grad f (perpendicular, uphill)"| L2["Level curve f = 8"]
  L2 -->|"grad f (perpendicular, uphill)"| L3["Level curve f = 12"]
  L3 -.->|"-grad f (downhill, descent)"| L1

Gradient of vector-valued functions

When a function outputs a vector instead of a scalar, the gradient generalizes to the Jacobian matrix. You will encounter this when you study neural network layers, where the function maps input activations to output activations. We cover this in the next article on the Jacobian and Hessian.


Gradients in Python

import torch

# Define parameters
w = torch.tensor(2.0, requires_grad=True)
b = torch.tensor(1.0, requires_grad=True)

# Data
x = torch.tensor([1.0, 2.0, 3.0])
y = torch.tensor([4.0, 5.0, 7.0])

# MSE loss
y_hat = w * x + b
loss = ((y_hat - y) ** 2).mean()

# Compute gradient
loss.backward()
print(f"dL/dw = {w.grad:.4f}")  # -0.6667
print(f"dL/db = {b.grad:.4f}")  # -0.6667

PyTorch applies the chain rule automatically through the computation graph. Every operation records how to differentiate itself, and .backward() walks the graph in reverse.


Summary

  • Partial derivative: derivative with respect to one variable, holding others constant
  • Gradient: vector of all partial derivatives. Points in the direction of steepest ascent.
  • Gradient descent: move in the direction f-\nabla f to minimize ff
  • At a critical point: f=0\nabla f = \mathbf{0}

The gradient tells you the direction. It does not tell you about curvature (how the surface bends). For that, you need second derivatives.


What comes next

The gradient gives you first-order information about a function. To understand curvature, classify critical points, and build better optimizers, you need the Jacobian and Hessian matrices. That is the next article in this series.

Start typing to search across all content
navigate Enter open Esc close