Feb 26, 2026 · 18 min read · Maths for ML

Partial Derivatives and Gradients

In this series (15 parts)

Most functions in machine learning take many inputs. A loss function might depend on thousands of weights simultaneously. Partial derivatives let you isolate the effect of each input, one at a time. The gradient bundles all those partial derivatives into a single vector, giving you the direction of steepest increase.

Prerequisites: You should be comfortable with single-variable derivatives and the chain rule.

Partial derivatives

Given a function $f(x, y)$ of two variables, the partial derivative with respect to $x$ is:

\frac{\partial f}{\partial x} = \lim_{h \to 0} \frac{f(x + h, y) - f(x, y)}{h}

The key idea: treat every other variable as a constant and differentiate normally. If you can take a regular derivative, you can take a partial derivative.

Notation varies. You will see $\frac{\partial f}{\partial x}$ , $f_x$ , and $\partial_x f$ . They all mean the same thing.

Example 1: Partial derivatives of a two-variable function

Let $f(x, y) = x^2 y + 3xy^2 - 5y + 2$ .

Partial derivative with respect to $x$ (treat $y$ as a constant):

\frac{\partial f}{\partial x} = 2xy + 3y^2

Term by term: $x^2 y$ gives $2xy$ (power rule on $x$ , $y$ is constant). $3xy^2$ gives $3y^2$ ( $y^2$ is constant, derivative of $x$ is 1). $-5y$ gives $0$ . The constant $2$ gives $0$ .

Partial derivative with respect to $y$ (treat $x$ as a constant):

\frac{\partial f}{\partial y} = x^2 + 6xy - 5

Evaluate at the point $(2, 1)$ :

\frac{\partial f}{\partial x}\bigg|_{(2,1)} = 2(2)(1) + 3(1)^2 = 4 + 3 = 7

\frac{\partial f}{\partial y}\bigg|_{(2,1)} = (2)^2 + 6(2)(1) - 5 = 4 + 12 - 5 = 11

At the point $(2, 1)$ : if you nudge $x$ up a tiny bit (holding $y$ fixed), $f$ increases at rate 7. If you nudge $y$ up instead, $f$ increases at rate 11. The function is more sensitive to $y$ at this point.

Partial derivatives as separate slices through the surface:

graph LR
  subgraph sx["Slice along x"]
      A1["Hold y = 1 fixed"] --> A2["Differentiate in x"] --> A3["Slope = 7"]
  end
  subgraph sy["Slice along y"]
      B1["Hold x = 2 fixed"] --> B2["Differentiate in y"] --> B3["Slope = 11"]
  end

The gradient vector

The gradient of a scalar function $f(x_1, x_2, \ldots, x_n)$ is the vector of all its partial derivatives:

\nabla f = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \frac{\partial f}{\partial x_2} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{bmatrix}

The symbol $\nabla$ is called “nabla” or “del.”

Key property: direction of steepest ascent

The gradient $\nabla f$ at a point points in the direction where $f$ increases the fastest. Its magnitude $\|\nabla f\|$ tells you the rate of increase in that direction.

This is why gradient descent moves in the direction $-\nabla f$ . You want to go downhill (decrease the loss), so you move opposite to the gradient.

The gradient descent loop:

graph LR
  A["Current point x"] --> B["Compute gradient"]
  B --> C["Negate direction"]
  C --> D["Step: x = x - lr * gradient"]
  D --> E{"Converged?"}
  E -->|No| A
  E -->|Yes| F["Local minimum found"]

Gradient from Example 1

Using $f(x, y) = x^2 y + 3xy^2 - 5y + 2$ :

\nabla f = \begin{bmatrix} 2xy + 3y^2 \\ x^2 + 6xy - 5 \end{bmatrix}

At $(2, 1)$ :

\nabla f(2, 1) = \begin{bmatrix} 7 \\ 11 \end{bmatrix}

The magnitude of the gradient:

\|\nabla f(2, 1)\| = \sqrt{7^2 + 11^2} = \sqrt{49 + 121} = \sqrt{170} \approx 13.04

The direction of steepest ascent at $(2, 1)$ is roughly $[7, 11]^T$ . To decrease $f$ fastest, you would move in the direction $[-7, -11]^T$ .

Example 2: Gradient of a three-variable function

Let $f(x, y, z) = x^2 z + e^{yz} - 3xz$ .

Partial derivatives:

\frac{\partial f}{\partial x} = 2xz - 3z

\frac{\partial f}{\partial y} = ze^{yz}

\frac{\partial f}{\partial z} = x^2 + ye^{yz} - 3x

The gradient:

\nabla f = \begin{bmatrix} 2xz - 3z \\ ze^{yz} \\ x^2 + ye^{yz} - 3x \end{bmatrix}

Evaluate at $(1, 0, 2)$ :

\frac{\partial f}{\partial x}\bigg|_{(1,0,2)} = 2(1)(2) - 3(2) = 4 - 6 = -2

\frac{\partial f}{\partial y}\bigg|_{(1,0,2)} = 2 \cdot e^{0} = 2

\frac{\partial f}{\partial z}\bigg|_{(1,0,2)} = 1 + 0 \cdot e^{0} - 3 = 1 + 0 - 3 = -2

\nabla f(1, 0, 2) = \begin{bmatrix} -2 \\ 2 \\ -2 \end{bmatrix}

The function is decreasing in the $x$ and $z$ directions but increasing in the $y$ direction at this point.

Example 3: Gradient of MSE loss

Here is where gradients meet machine learning directly. Consider a linear model with parameters $w$ and $b$ , predicting $\hat{y}_i = wx_i + b$ for data points $(x_i, y_i)$ . The mean squared error loss is:

L(w, b) = \frac{1}{n}\sum_{i=1}^{n}(wx_i + b - y_i)^2

Partial derivative with respect to $w$ :

\frac{\partial L}{\partial w} = \frac{1}{n}\sum_{i=1}^{n} 2(wx_i + b - y_i) \cdot x_i = \frac{2}{n}\sum_{i=1}^{n}(\hat{y}_i - y_i)x_i

Partial derivative with respect to $b$ :

\frac{\partial L}{\partial b} = \frac{1}{n}\sum_{i=1}^{n} 2(wx_i + b - y_i) = \frac{2}{n}\sum_{i=1}^{n}(\hat{y}_i - y_i)

Numerical example: Suppose we have 3 data points and current parameters $w = 2, b = 1$ .

$x_i$	$y_i$	$\hat{y}_i = 2x_i + 1$	error $= \hat{y}_i - y_i$
1	4	3	$-1$
2	5	5	$0$
3	7	7	$0$

\frac{\partial L}{\partial w} = \frac{2}{3}[(-1)(1) + (0)(2) + (0)(3)] = \frac{2}{3}(-1) = -\frac{2}{3} \approx -0.667

\frac{\partial L}{\partial b} = \frac{2}{3}[(-1) + 0 + 0] = -\frac{2}{3} \approx -0.667

\nabla L = \begin{bmatrix} -0.667 \\ -0.667 \end{bmatrix}

Both gradients are negative, meaning we should increase both $w$ and $b$ slightly to reduce the loss. Gradient descent would update:

w_{\text{new}} = 2 - \alpha(-0.667) = 2 + 0.667\alpha

b_{\text{new}} = 1 - \alpha(-0.667) = 1 + 0.667\alpha

where $\alpha$ is the learning rate.

Geometric interpretation

Think of $f(x, y)$ as a surface in 3D. At any point on the surface:

$\frac{\partial f}{\partial x}$ is the slope if you walk in the $x$ direction
$\frac{\partial f}{\partial y}$ is the slope if you walk in the $y$ direction
$\nabla f$ points in the direction of steepest climb on the surface
$-\nabla f$ points in the direction of steepest descent

The gradient is perpendicular to the level curves (contour lines) of $f$ . If you have seen a topographic map, the gradient at any point would be perpendicular to the contour line passing through that point, pointing uphill.

Cross-sections of f(x, y) = x^2 + y^2 along each axis:

Directional derivative

The gradient also lets you compute how fast $f$ changes in any arbitrary direction $\mathbf{u}$ (a unit vector):

D_{\mathbf{u}} f = \nabla f \cdot \mathbf{u}

This is the directional derivative. It is maximized when $\mathbf{u}$ points in the same direction as $\nabla f$ , which is why the gradient is the direction of steepest ascent.

Higher dimensions

In ML, functions often have thousands or millions of inputs (one per weight). The gradient is still the same idea, just a much longer vector:

\nabla L(\mathbf{w}) = \begin{bmatrix} \frac{\partial L}{\partial w_1} \\ \frac{\partial L}{\partial w_2} \\ \vdots \\ \frac{\partial L}{\partial w_d} \end{bmatrix}

Computing this efficiently for deep networks is what automatic differentiation and backpropagation handle. You never compute partial derivatives by hand for a neural network, but understanding what the gradient means helps you debug training.

When the gradient is zero ( $\nabla f = \mathbf{0}$ ), you are at a critical point. It could be a minimum, maximum, or saddle point. To distinguish these, you need second-order information: the Hessian matrix.

The gradient and level curves

A level curve of $f(x, y)$ is the set of points where $f$ has the same value: $f(x, y) = c$ for some constant $c$ . Think of contour lines on a map.

The gradient is always perpendicular to the level curve at any point. If you stand on a contour line and look in the direction of $\nabla f$ , you are looking straight uphill. This is not just a mathematical curiosity. It explains why gradient descent trajectories cross contour lines at right angles (at least locally).

Example: For $f(x, y) = x^2 + 4y^2$ , the level curves are ellipses centered at the origin. At the point $(2, 1)$ :

\nabla f(2, 1) = \begin{bmatrix} 4 \\ 8 \end{bmatrix}

The level curve through $(2, 1)$ is $x^2 + 4y^2 = 8$ , which is an ellipse. The gradient $[4, 8]^T$ points outward, perpendicular to the ellipse at that point. A gradient descent step would move in the direction $[-4, -8]^T$ , heading toward the minimum at the origin.

Gradient vectors point perpendicular to level curves, toward higher values:

graph LR
  L1["Level curve f = 4"] -->|"grad f (perpendicular, uphill)"| L2["Level curve f = 8"]
  L2 -->|"grad f (perpendicular, uphill)"| L3["Level curve f = 12"]
  L3 -.->|"-grad f (downhill, descent)"| L1

Gradient of vector-valued functions

When a function outputs a vector instead of a scalar, the gradient generalizes to the Jacobian matrix. You will encounter this when you study neural network layers, where the function maps input activations to output activations. We cover this in the next article on the Jacobian and Hessian.

Gradients in Python

import torch

# Define parameters
w = torch.tensor(2.0, requires_grad=True)
b = torch.tensor(1.0, requires_grad=True)

# Data
x = torch.tensor([1.0, 2.0, 3.0])
y = torch.tensor([4.0, 5.0, 7.0])

# MSE loss
y_hat = w * x + b
loss = ((y_hat - y) ** 2).mean()

# Compute gradient
loss.backward()
print(f"dL/dw = {w.grad:.4f}")  # -0.6667
print(f"dL/db = {b.grad:.4f}")  # -0.6667

PyTorch applies the chain rule automatically through the computation graph. Every operation records how to differentiate itself, and .backward() walks the graph in reverse.

Summary

Partial derivative: derivative with respect to one variable, holding others constant
Gradient: vector of all partial derivatives. Points in the direction of steepest ascent.
Gradient descent: move in the direction $-\nabla f$ to minimize $f$
At a critical point: $\nabla f = \mathbf{0}$

The gradient tells you the direction. It does not tell you about curvature (how the surface bends). For that, you need second derivatives.

What comes next

The gradient gives you first-order information about a function. To understand curvature, classify critical points, and build better optimizers, you need the Jacobian and Hessian matrices. That is the next article in this series.

← Back to all series