Mar 22, 2026 · 22 min read · ML Optimization

Stochastic gradient descent and variants

In this series (18 parts)

Prerequisites

You should understand gradient descent and how the gradient tells you the direction of steepest ascent. Everything in this article builds on that foundation.

Why not just use gradient descent?

Standard (batch) gradient descent computes the gradient using the entire dataset:

\theta_{k+1} = \theta_k - \alpha \nabla f(\theta_k) = \theta_k - \alpha \frac{1}{N} \sum_{i=1}^{N} \nabla f_i(\theta_k)

where $f(\theta) = \frac{1}{N} \sum_{i=1}^{N} f_i(\theta)$ is the average loss over $N$ training examples.

For a dataset with $N = 10$ million images, computing the full gradient requires a forward and backward pass through all 10 million examples. That is one gradient step. You might need hundreds of steps to converge. This is too slow.

Stochastic gradient descent (SGD) fixes this by using a single example (or a small batch) to estimate the gradient. Each step is noisy, but you get to take many more steps in the same wall-clock time.

Stochastic gradient descent

At each step, sample a random index $i$ uniformly from $\{1, \dots, N\}$ and update:

\theta_{k+1} = \theta_k - \alpha_k \nabla f_i(\theta_k)

The stochastic gradient $\nabla f_i(\theta_k)$ is an unbiased estimate of the true gradient:

\mathbb{E}_i[\nabla f_i(\theta)] = \nabla f(\theta)

On average, you move in the right direction. But any single step might be off. The noise is both a blessing (helps escape shallow local minima, provides implicit regularization) and a curse (prevents exact convergence without decreasing the learning rate).

Learning rate schedule

For SGD to converge, you need:

\sum_{k=1}^{\infty} \alpha_k = \infty \quad \text{and} \quad \sum_{k=1}^{\infty} \alpha_k^2 < \infty

The classic choice is $\alpha_k = \alpha_0 / k$ . In practice, people use step decay (halve the learning rate every few epochs), cosine annealing, or warmup followed by decay.

Mini-batch SGD

Pure SGD uses one sample. Mini-batch SGD uses a batch of $B$ samples:

\theta_{k+1} = \theta_k - \alpha_k \frac{1}{B} \sum_{j=1}^{B} \nabla f_{i_j}(\theta_k)

The mini-batch gradient has lower variance than the single-sample gradient by a factor of $B$ :

\text{Var}\left[\frac{1}{B}\sum_{j=1}^B \nabla f_{i_j}\right] = \frac{1}{B} \text{Var}[\nabla f_i]

Typical batch sizes: 32, 64, 128, 256. Larger batches give smoother gradients but less frequent updates. There is also a practical reason: GPU parallelism. A batch of 256 processes almost as fast as a batch of 1 on modern hardware, so you get 256x variance reduction nearly for free.

Example 1: SGD vs batch GD on a quadratic

Problem: Minimize $f(\theta) = \frac{1}{2}\theta^2$ . Suppose we have $N = 3$ “data points” with individual losses:

f_1(\theta) = \frac{1}{2}(1.5\theta)^2 = 1.125\theta^2

f_2(\theta) = \frac{1}{2}(0.5\theta)^2 = 0.125\theta^2

f_3(\theta) = \frac{1}{2}(1.0\theta)^2 = 0.5\theta^2

Average: $f(\theta) = \frac{1}{3}(1.125 + 0.125 + 0.5)\theta^2 = \frac{1.75}{3}\theta^2 \approx 0.583\theta^2$ .

True gradient: $\nabla f(\theta) = 1.167\theta$ .

Stochastic gradients: $\nabla f_1 = 2.25\theta$ , $\nabla f_2 = 0.25\theta$ , $\nabla f_3 = \theta$ .

Batch GD with $\alpha = 0.5$ , starting at $\theta_0 = 4$ :

\theta_1 = 4 - 0.5(1.167)(4) = 4 - 2.333 = 1.667

\theta_2 = 1.667 - 0.5(1.167)(1.667) = 1.667 - 0.972 = 0.695

\theta_3 = 0.695 - 0.5(1.167)(0.695) = 0.695 - 0.406 = 0.289

Smooth, steady convergence: $4.0 \to 1.667 \to 0.695 \to 0.289$ .

SGD with $\alpha = 0.5$ , starting at $\theta_0 = 4$ . Random sample order: $i = 1, 3, 2$ .

\theta_1 = 4 - 0.5(2.25)(4) = 4 - 4.5 = -0.5

\theta_2 = -0.5 - 0.5(1.0)(-0.5) = -0.5 + 0.25 = -0.25

\theta_3 = -0.25 - 0.5(0.25)(-0.25) = -0.25 + 0.03125 = -0.219

Noisy path: $4.0 \to -0.5 \to -0.25 \to -0.219$ . The first step overshoots dramatically (the stochastic gradient was 2x the true gradient), but we still make progress. After three steps, $|\theta|$ went from 4 to 0.219, which is actually better than batch GD’s 0.289.

This is SGD in a nutshell: noisy individual steps, but fast overall progress because each step is cheap.

Momentum

SGD oscillates, especially in narrow valleys where the gradient direction swings back and forth. Momentum smooths this out by maintaining a running average of past gradients:

m_{k+1} = \beta m_k + \nabla f_{i_k}(\theta_k)

\theta_{k+1} = \theta_k - \alpha \, m_{k+1}

Here $\beta \in [0, 1)$ is the momentum coefficient, typically $0.9$ . The velocity $m$ accumulates gradient history: consistent gradient directions build up speed, while oscillating directions cancel out.

Think of a ball rolling down a hill. Without momentum, it changes direction instantly based on the local slope. With momentum, it has inertia. It rolls faster down consistent slopes and resists sudden direction changes.

Nesterov momentum

A slight but important twist: compute the gradient at the “lookahead” position instead of the current position:

m_{k+1} = \beta m_k + \nabla f_{i_k}(\theta_k - \alpha \beta m_k)

\theta_{k+1} = \theta_k - \alpha \, m_{k+1}

Nesterov momentum has better theoretical convergence for convex problems and tends to work slightly better in practice.

Example 2: Momentum vs plain SGD

Problem: Minimize $f(\theta_1, \theta_2) = 10\theta_1^2 + \theta_2^2$ .

This is a narrow valley: the curvature in $\theta_1$ is 10x the curvature in $\theta_2$ . Gradient descent oscillates in $\theta_1$ and converges slowly in $\theta_2$ .

Starting point: $(\theta_1, \theta_2) = (1, 10)$ . Learning rate $\alpha = 0.05$ .

For simplicity, we use the full gradient (not stochastic) to isolate the effect of momentum.

$\nabla f = (20\theta_1, 2\theta_2)$ .

Plain GD, 5 steps:

Step	$\theta_1$	$\theta_2$	$f$
0	1.000	10.000	110.0
1	$1 - 0.05(20) = 0.0$	$10 - 0.05(20) = 9.0$	81.0
2	$0 - 0 = 0.0$	$9 - 0.05(18) = 8.1$	65.61
3	0.0	$8.1 - 0.05(16.2) = 7.29$	53.14
4	0.0	$7.29 - 0.81 \times 0.05... = 6.561$	43.05
5	0.0	5.905	34.87

$\theta_1$ converged in one step (lucky with $\alpha = 0.05$ and curvature 20, giving step $= 1.0$ ). $\theta_2$ converges slowly: $10 \to 9 \to 8.1 \to 7.29 \to 6.56 \to 5.90$ .

SGD with momentum ( $\beta = 0.9$ ), same setup:

Initialize $m = (0, 0)$ .

Step 0 to 1:

m_1 = 0.9(0, 0) + (20, 20) = (20, 20)

\theta_1 = (1, 10) - 0.05(20, 20) = (0, 9)

Step 1 to 2:

\nabla f = (0, 18)

m_2 = 0.9(20, 20) + (0, 18) = (18, 36)

\theta_2 = (0, 9) - 0.05(18, 36) = (-0.9, 7.2)

Step 2 to 3:

\nabla f = (-18, 14.4)

m_3 = 0.9(18, 36) + (-18, 14.4) = (-1.8, 46.8)

\theta_3 = (-0.9, 7.2) - 0.05(-1.8, 46.8) = (-0.81, 4.86)

Step 3 to 4:

\nabla f = (-16.2, 9.72)

m_4 = 0.9(-1.8, 46.8) + (-16.2, 9.72) = (-17.82, 51.84)

\theta_4 = (-0.81, 4.86) - 0.05(-17.82, 51.84) = (0.081, 2.268)

Step 4 to 5:

\nabla f = (1.62, 4.536)

m_5 = 0.9(-17.82, 51.84) + (1.62, 4.536) = (-14.418, 51.192)

\theta_5 = (0.081, 2.268) - 0.05(-14.418, 51.192) = (0.802, -0.292)

With momentum, $\theta_2$ progress: $10 \to 9 \to 7.2 \to 4.86 \to 2.27 \to -0.29$ . Much faster convergence in $\theta_2$ ! But $\theta_1$ oscillates a bit: $1 \to 0 \to -0.9 \to -0.81 \to 0.08 \to 0.80$ . The momentum builds up speed in the consistent direction ( $\theta_2$ shrinking) while the oscillations in $\theta_1$ partially cancel.

After 5 steps: plain GD has $f = 34.87$ , momentum has $f \approx 6.51$ . Momentum wins by a wide margin.

Adaptive learning rates

The learning rate $\alpha$ is the most important hyperparameter in SGD. Too large, you diverge. Too small, you converge painfully slowly. Different parameters might need different learning rates (a bias term vs a weight matrix, for instance).

Adaptive methods give each parameter its own effective learning rate based on the history of its gradients.

AdaGrad

Accumulate the sum of squared gradients for each parameter:

v_k = v_{k-1} + (\nabla f_{i_k})^2 \quad \text{(element-wise square)}

\theta_{k+1} = \theta_k - \frac{\alpha}{\sqrt{v_k} + \epsilon} \odot \nabla f_{i_k}

Parameters with large accumulated gradients get smaller learning rates. This is great for sparse features (NLP, recommendations) where some features are rare and need larger updates. The downside: $v_k$ only grows, so the learning rate monotonically decreases and can become too small.

RMSProp

Fix AdaGrad’s shrinking learning rate by using an exponential moving average instead of a sum:

v_k = \rho \, v_{k-1} + (1 - \rho)(\nabla f_{i_k})^2

\theta_{k+1} = \theta_k - \frac{\alpha}{\sqrt{v_k} + \epsilon} \odot \nabla f_{i_k}

The decay rate $\rho$ (typically 0.99) controls how far back the moving average looks. Old gradients are exponentially forgotten, preventing the learning rate from collapsing.

Adam

Adam combines momentum and RMSProp:

m_k = \beta_1 m_{k-1} + (1 - \beta_1) \nabla f_{i_k} \quad \text{(first moment, like momentum)}

v_k = \beta_2 v_{k-1} + (1 - \beta_2) (\nabla f_{i_k})^2 \quad \text{(second moment, like RMSProp)}

Bias correction (important in early steps when $m$ and $v$ are biased toward zero):

\hat{m}_k = \frac{m_k}{1 - \beta_1^k}, \quad \hat{v}_k = \frac{v_k}{1 - \beta_2^k}

Update:

\theta_{k+1} = \theta_k - \frac{\alpha}{\sqrt{\hat{v}_k} + \epsilon} \odot \hat{m}_k

Default hyperparameters: $\beta_1 = 0.9$ , $\beta_2 = 0.999$ , $\epsilon = 10^{-8}$ , $\alpha = 0.001$ .

Adam is the default optimizer for most deep learning tasks. It usually works well out of the box with the default parameters.

Example 3: Five steps of Adam vs plain SGD

Problem: Minimize $f(\theta) = \theta^2$ . True gradient: $\nabla f = 2\theta$ . Starting point: $\theta_0 = 10$ .

We use the true gradient (not stochastic) to focus on the optimizer behavior.

Plain SGD with $\alpha = 0.1$ :

$k$	$\theta_k$	$\nabla f$	$\theta_{k+1}$
0	10.0	20.0	$10 - 0.1(20) = 8.0$
1	8.0	16.0	$8 - 1.6 = 6.4$
2	6.4	12.8	$6.4 - 1.28 = 5.12$
3	5.12	10.24	$5.12 - 1.024 = 4.096$
4	4.096	8.192	$4.096 - 0.819 = 3.277$

After 5 steps: $\theta = 3.277$ , $f = 10.74$ .

Adam with $\alpha = 1.0$ (Adam’s default is 0.001, but we use 1.0 to see clear steps), $\beta_1 = 0.9$ , $\beta_2 = 0.999$ :

$k$	$\theta$	$g = 2\theta$	$m$	$v$	$\hat{m}$	$\hat{v}$	$\theta_{k+1}$
0	10.0	20.0

Step 1: $g_1 = 20.0$

m_1 = 0.9(0) + 0.1(20) = 2.0

v_1 = 0.999(0) + 0.001(400) = 0.4

\hat{m}_1 = \frac{2.0}{1 - 0.9} = 20.0

\hat{v}_1 = \frac{0.4}{1 - 0.999} = 400.0

\theta_1 = 10.0 - \frac{1.0}{\sqrt{400} + 10^{-8}} \cdot 20.0 = 10.0 - \frac{20}{20} = 9.0

Step 2: $g_2 = 18.0$

m_2 = 0.9(2.0) + 0.1(18.0) = 1.8 + 1.8 = 3.6

v_2 = 0.999(0.4) + 0.001(324) = 0.3996 + 0.324 = 0.7236

\hat{m}_2 = \frac{3.6}{1 - 0.81} = \frac{3.6}{0.19} = 18.95

\hat{v}_2 = \frac{0.7236}{1 - 0.998} = \frac{0.7236}{0.002} = 361.8

\theta_2 = 9.0 - \frac{18.95}{\sqrt{361.8} + 10^{-8}} = 9.0 - \frac{18.95}{19.02} = 9.0 - 0.996 = 8.004

Step 3: $g_3 = 16.008$

m_3 = 0.9(3.6) + 0.1(16.008) = 3.24 + 1.601 = 4.841

v_3 = 0.999(0.7236) + 0.001(256.3) = 0.7229 + 0.2563 = 0.9792

\hat{m}_3 = \frac{4.841}{1 - 0.729} = \frac{4.841}{0.271} = 17.86

\hat{v}_3 = \frac{0.9792}{1 - 0.997} = \frac{0.9792}{0.003} = 326.4

\theta_3 = 8.004 - \frac{17.86}{\sqrt{326.4}} = 8.004 - \frac{17.86}{18.07} = 8.004 - 0.988 = 7.016

Step 4: $g_4 = 14.032$

m_4 = 0.9(4.841) + 0.1(14.032) = 4.357 + 1.403 = 5.760

v_4 = 0.999(0.9792) + 0.001(196.9) = 0.9782 + 0.1969 = 1.1751

\hat{m}_4 = \frac{5.760}{1 - 0.6561} = \frac{5.760}{0.3439} = 16.75

\hat{v}_4 = \frac{1.1751}{1 - 0.996} = \frac{1.1751}{0.004} = 293.8

\theta_4 = 7.016 - \frac{16.75}{\sqrt{293.8}} = 7.016 - \frac{16.75}{17.14} = 7.016 - 0.977 = 6.039

Step 5: $g_5 = 12.078$

m_5 = 0.9(5.760) + 0.1(12.078) = 5.184 + 1.208 = 6.392

v_5 = 0.999(1.1751) + 0.001(145.9) = 1.1739 + 0.1459 = 1.3198

\hat{m}_5 = \frac{6.392}{1 - 0.5905} = \frac{6.392}{0.4095} = 15.61

\hat{v}_5 = \frac{1.3198}{1 - 0.995} = \frac{1.3198}{0.005} = 264.0

\theta_5 = 6.039 - \frac{15.61}{\sqrt{264.0}} = 6.039 - \frac{15.61}{16.25} = 6.039 - 0.961 = 5.078

Summary after 5 steps:

Method	$\theta_5$	$f(\theta_5)$
Plain SGD ( $\alpha = 0.1$ )	3.277	10.74
Adam ( $\alpha = 1.0$ )	5.078	25.79

In this simple case, plain SGD actually wins because it has a well-tuned learning rate for this specific problem. Adam takes steps of roughly constant size (close to $\alpha$ ) regardless of gradient magnitude, which is a feature for complex loss landscapes but a disadvantage on simple quadratics. Adam shines on problems with:

Very different scales across parameters
Noisy gradients
Saddle points and flat regions

Learning rate schedules

Even with adaptive methods, the learning rate schedule matters. Common choices:

Step decay

\alpha_k = \alpha_0 \times \gamma^{\lfloor k / s \rfloor}

Drop the learning rate by factor $\gamma$ (e.g., 0.1) every $s$ epochs. Simple and effective. Used in most ResNet training recipes.

Cosine annealing

\alpha_k = \alpha_{\min} + \frac{1}{2}(\alpha_0 - \alpha_{\min})\left(1 + \cos\left(\frac{k\pi}{T}\right)\right)

Smoothly decreases from $\alpha_0$ to $\alpha_{\min}$ over $T$ steps. Popular in modern training (vision transformers, language models).

Warmup

Start with a very small learning rate and linearly increase to the target over the first few thousand steps:

\alpha_k = \alpha_0 \cdot \min\left(1, \frac{k}{k_{\text{warmup}}}\right)

Warmup stabilizes training when the initial gradients are large and unreliable (common with random initialization in transformers).

One-cycle policy

Increase the learning rate from small to large over the first half of training, then decrease it back down. This is surprisingly effective and was popularized by Leslie Smith.

Which optimizer to use

Scenario	Recommendation
Starting a new project	Adam with defaults ( $\alpha = 0.001$ , $\beta_1 = 0.9$ , $\beta_2 = 0.999$ )
Training CNNs (image classification)	SGD + momentum ( $\beta = 0.9$ ) + step decay schedule
Training transformers	Adam or AdamW with warmup + cosine schedule
Sparse features (NLP, recommendations)	Adam or AdaGrad
Need best final accuracy	SGD + momentum (often beats Adam at convergence with proper tuning)
Quick prototyping	Adam (less tuning required)

The general pattern: Adam converges faster initially, but well-tuned SGD with momentum often reaches a better final solution. This has been observed repeatedly in computer vision, though Adam with weight decay (AdamW) has narrowed the gap.

Convergence comparison of SGD variants. Adam and RMSProp adapt their learning rates and converge faster than plain SGD. Momentum smooths out oscillations for faster convergence than vanilla SGD.

Optimization paths on f(x,y) = x^2 + 10y^2. SGD oscillates in the high-curvature y direction. Momentum damps the oscillations. Adam takes a more direct path to the minimum.

AdamW: weight decay done right

Standard $\ell_2$ regularization adds $\frac{\lambda}{2}\|\theta\|^2$ to the loss. In Adam, this interacts poorly with the adaptive learning rate: the regularization gradient gets scaled by $1/\sqrt{v}$ , weakening it for parameters with large gradients.

AdamW decouples weight decay from the gradient:

\theta_{k+1} = \theta_k - \alpha\left(\frac{\hat{m}_k}{\sqrt{\hat{v}_k} + \epsilon} + \lambda \theta_k\right)

The $\lambda \theta_k$ term is applied directly, not through the adaptive mechanism. This is now the standard optimizer for training language models.

Gradient noise and generalization

A surprising finding in deep learning: SGD with small batches often generalizes better than full-batch gradient descent, even when both reach the same training loss. The noise in SGD acts as an implicit regularizer, pushing the optimizer toward “flat” minima that generalize well.

This connects to the bias-variance tradeoff:

Large batch, low noise: converges to sharp minima. Low training loss, potentially high test loss.
Small batch, high noise: finds flat minima. Slightly higher training loss, better test loss.

The optimal batch size depends on the problem. Too small wastes GPU parallelism. Too large hurts generalization. Most practitioners use batch sizes of 32 to 512 and tune from there.

Practical tips

Start with Adam, $\alpha = 0.001$ . If results are not good enough, try SGD + momentum with a carefully tuned schedule.
Use gradient clipping (clip the gradient norm to a maximum value, say 1.0) to prevent exploding gradients, especially in RNNs and transformers.
Monitor the gradient norm during training. If it spikes, your learning rate is too high or there is a data issue.
Learning rate finder: sweep the learning rate from $10^{-7}$ to $10$ over one epoch and plot the loss. The best learning rate is usually about 10x smaller than the one where loss starts increasing.
Weight decay ( $10^{-4}$ to $10^{-2}$ ) usually helps. Use AdamW, not Adam + L2 regularization.

Python: implementing the optimizers

import numpy as np

class SGD:
    def __init__(self, lr=0.01):
        self.lr = lr
    
    def step(self, params, grads):
        return params - self.lr * grads

class SGDMomentum:
    def __init__(self, lr=0.01, beta=0.9):
        self.lr = lr
        self.beta = beta
        self.v = None
    
    def step(self, params, grads):
        if self.v is None:
            self.v = np.zeros_like(params)
        self.v = self.beta * self.v + grads
        return params - self.lr * self.v

class Adam:
    def __init__(self, lr=0.001, beta1=0.9, beta2=0.999, eps=1e-8):
        self.lr = lr
        self.beta1 = beta1
        self.beta2 = beta2
        self.eps = eps
        self.m = None
        self.v = None
        self.t = 0
    
    def step(self, params, grads):
        if self.m is None:
            self.m = np.zeros_like(params)
            self.v = np.zeros_like(params)
        
        self.t += 1
        self.m = self.beta1 * self.m + (1 - self.beta1) * grads
        self.v = self.beta2 * self.v + (1 - self.beta2) * grads**2
        
        m_hat = self.m / (1 - self.beta1**self.t)
        v_hat = self.v / (1 - self.beta2**self.t)
        
        return params - self.lr * m_hat / (np.sqrt(v_hat) + self.eps)

# Compare on f(x) = x^2, starting at x = 10
for name, opt in [("SGD", SGD(0.1)), 
                   ("Momentum", SGDMomentum(0.05, 0.9)),
                   ("Adam", Adam(1.0))]:
    x = 10.0
    trajectory = [x]
    for _ in range(20):
        g = 2 * x  # gradient of x^2
        x = opt.step(x, g)
        trajectory.append(float(x))
    print(f"{name:10s}: final x = {x:.4f}, f(x) = {x**2:.4f}")

Summary table

Optimizer	Update rule (simplified)	Memory	Good for
SGD	$\theta - \alpha g$	$O(1)$	Simple problems, theory
Momentum	$\theta - \alpha(\beta m + g)$	$O(d)$	CNNs, well-tuned training
AdaGrad	$\theta - \frac{\alpha}{\sqrt{v}} g$	$O(d)$	Sparse features
RMSProp	$\theta - \frac{\alpha}{\sqrt{\text{EMA}(g^2)}} g$	$O(d)$	RNNs, non-stationary
Adam	$\theta - \frac{\alpha \cdot \hat{m}}{\sqrt{\hat{v}}}$	$O(2d)$	Default choice for DL
AdamW	Adam + decoupled weight decay	$O(2d)$	Transformers, LLMs

What comes next

With SGD and its variants, you have the optimizers that power all of modern machine learning. The next step is to see how these tools get applied: from loss functions and regularization to training pipelines for real models. Head over to what is machine learning to start the ML series, where optimization meets data.

← Back to all series