Partial Derivatives and Gradients
In this series (15 parts)
- Why Maths Matters for ML: A Practical Overview
- Scalars, Vectors, and Vector Spaces
- Matrices and Matrix Operations
- Matrix Inverses and Systems of Linear Equations
- Eigenvalues and Eigenvectors
- Matrix Decompositions: LU, QR, SVD
- Norms, Distances, and Similarity
- Calculus Review: Derivatives and the Chain Rule
- Partial Derivatives and Gradients
- The Jacobian and Hessian Matrices
- Taylor series and local approximations
- Probability fundamentals
- Random variables and distributions
- Bayes theorem and its role in ML
- Information theory: entropy, KL divergence, cross-entropy
Most functions in machine learning take many inputs. A loss function might depend on thousands of weights simultaneously. Partial derivatives let you isolate the effect of each input, one at a time. The gradient bundles all those partial derivatives into a single vector, giving you the direction of steepest increase.
Prerequisites: You should be comfortable with single-variable derivatives and the chain rule.
Partial derivatives
Given a function of two variables, the partial derivative with respect to is:
The key idea: treat every other variable as a constant and differentiate normally. If you can take a regular derivative, you can take a partial derivative.
Notation varies. You will see , , and . They all mean the same thing.
Example 1: Partial derivatives of a two-variable function
Let .
Partial derivative with respect to (treat as a constant):
Term by term: gives (power rule on , is constant). gives ( is constant, derivative of is 1). gives . The constant gives .
Partial derivative with respect to (treat as a constant):
Evaluate at the point :
At the point : if you nudge up a tiny bit (holding fixed), increases at rate 7. If you nudge up instead, increases at rate 11. The function is more sensitive to at this point.
Partial derivatives as separate slices through the surface:
graph LR
subgraph sx["Slice along x"]
A1["Hold y = 1 fixed"] --> A2["Differentiate in x"] --> A3["Slope = 7"]
end
subgraph sy["Slice along y"]
B1["Hold x = 2 fixed"] --> B2["Differentiate in y"] --> B3["Slope = 11"]
end
The gradient vector
The gradient of a scalar function is the vector of all its partial derivatives:
The symbol is called “nabla” or “del.”
Key property: direction of steepest ascent
The gradient at a point points in the direction where increases the fastest. Its magnitude tells you the rate of increase in that direction.
This is why gradient descent moves in the direction . You want to go downhill (decrease the loss), so you move opposite to the gradient.
The gradient descent loop:
graph LR
A["Current point x"] --> B["Compute gradient"]
B --> C["Negate direction"]
C --> D["Step: x = x - lr * gradient"]
D --> E{"Converged?"}
E -->|No| A
E -->|Yes| F["Local minimum found"]
Gradient from Example 1
Using :
At :
The magnitude of the gradient:
The direction of steepest ascent at is roughly . To decrease fastest, you would move in the direction .
Example 2: Gradient of a three-variable function
Let .
Partial derivatives:
The gradient:
Evaluate at :
The function is decreasing in the and directions but increasing in the direction at this point.
Example 3: Gradient of MSE loss
Here is where gradients meet machine learning directly. Consider a linear model with parameters and , predicting for data points . The mean squared error loss is:
Partial derivative with respect to :
Partial derivative with respect to :
Numerical example: Suppose we have 3 data points and current parameters .
| error | |||
|---|---|---|---|
| 1 | 4 | 3 | |
| 2 | 5 | 5 | |
| 3 | 7 | 7 |
Both gradients are negative, meaning we should increase both and slightly to reduce the loss. Gradient descent would update:
where is the learning rate.
Geometric interpretation
Think of as a surface in 3D. At any point on the surface:
- is the slope if you walk in the direction
- is the slope if you walk in the direction
- points in the direction of steepest climb on the surface
- points in the direction of steepest descent
The gradient is perpendicular to the level curves (contour lines) of . If you have seen a topographic map, the gradient at any point would be perpendicular to the contour line passing through that point, pointing uphill.
Cross-sections of f(x, y) = x^2 + y^2 along each axis:
Directional derivative
The gradient also lets you compute how fast changes in any arbitrary direction (a unit vector):
This is the directional derivative. It is maximized when points in the same direction as , which is why the gradient is the direction of steepest ascent.
Higher dimensions
In ML, functions often have thousands or millions of inputs (one per weight). The gradient is still the same idea, just a much longer vector:
Computing this efficiently for deep networks is what automatic differentiation and backpropagation handle. You never compute partial derivatives by hand for a neural network, but understanding what the gradient means helps you debug training.
When the gradient is zero (), you are at a critical point. It could be a minimum, maximum, or saddle point. To distinguish these, you need second-order information: the Hessian matrix.
The gradient and level curves
A level curve of is the set of points where has the same value: for some constant . Think of contour lines on a map.
The gradient is always perpendicular to the level curve at any point. If you stand on a contour line and look in the direction of , you are looking straight uphill. This is not just a mathematical curiosity. It explains why gradient descent trajectories cross contour lines at right angles (at least locally).
Example: For , the level curves are ellipses centered at the origin. At the point :
The level curve through is , which is an ellipse. The gradient points outward, perpendicular to the ellipse at that point. A gradient descent step would move in the direction , heading toward the minimum at the origin.
Gradient vectors point perpendicular to level curves, toward higher values:
graph LR L1["Level curve f = 4"] -->|"grad f (perpendicular, uphill)"| L2["Level curve f = 8"] L2 -->|"grad f (perpendicular, uphill)"| L3["Level curve f = 12"] L3 -.->|"-grad f (downhill, descent)"| L1
Gradient of vector-valued functions
When a function outputs a vector instead of a scalar, the gradient generalizes to the Jacobian matrix. You will encounter this when you study neural network layers, where the function maps input activations to output activations. We cover this in the next article on the Jacobian and Hessian.
Gradients in Python
import torch
# Define parameters
w = torch.tensor(2.0, requires_grad=True)
b = torch.tensor(1.0, requires_grad=True)
# Data
x = torch.tensor([1.0, 2.0, 3.0])
y = torch.tensor([4.0, 5.0, 7.0])
# MSE loss
y_hat = w * x + b
loss = ((y_hat - y) ** 2).mean()
# Compute gradient
loss.backward()
print(f"dL/dw = {w.grad:.4f}") # -0.6667
print(f"dL/db = {b.grad:.4f}") # -0.6667
PyTorch applies the chain rule automatically through the computation graph. Every operation records how to differentiate itself, and .backward() walks the graph in reverse.
Summary
- Partial derivative: derivative with respect to one variable, holding others constant
- Gradient: vector of all partial derivatives. Points in the direction of steepest ascent.
- Gradient descent: move in the direction to minimize
- At a critical point:
The gradient tells you the direction. It does not tell you about curvature (how the surface bends). For that, you need second derivatives.
What comes next
The gradient gives you first-order information about a function. To understand curvature, classify critical points, and build better optimizers, you need the Jacobian and Hessian matrices. That is the next article in this series.