Search…

Stochastic gradient descent and variants

In this series (18 parts)
  1. What is optimization and why ML needs it
  2. Convex sets and convex functions
  3. Optimality conditions: first order
  4. Optimality conditions: second order
  5. Line search methods
  6. Least squares: the closed-form solution
  7. Steepest descent (gradient descent)
  8. Newton's method for optimization
  9. Quasi-Newton methods: BFGS and L-BFGS
  10. Conjugate gradient methods
  11. Constrained optimization and Lagrangian duality
  12. KKT conditions
  13. Penalty and barrier methods
  14. Interior point methods
  15. The simplex method
  16. Frank-Wolfe method
  17. Optimization in dynamic programming and optimal control
  18. Stochastic gradient descent and variants

Prerequisites

You should understand gradient descent and how the gradient tells you the direction of steepest ascent. Everything in this article builds on that foundation.


Why not just use gradient descent?

Standard (batch) gradient descent computes the gradient using the entire dataset:

θk+1=θkαf(θk)=θkα1Ni=1Nfi(θk)\theta_{k+1} = \theta_k - \alpha \nabla f(\theta_k) = \theta_k - \alpha \frac{1}{N} \sum_{i=1}^{N} \nabla f_i(\theta_k)

where f(θ)=1Ni=1Nfi(θ)f(\theta) = \frac{1}{N} \sum_{i=1}^{N} f_i(\theta) is the average loss over NN training examples.

For a dataset with N=10N = 10 million images, computing the full gradient requires a forward and backward pass through all 10 million examples. That is one gradient step. You might need hundreds of steps to converge. This is too slow.

Stochastic gradient descent (SGD) fixes this by using a single example (or a small batch) to estimate the gradient. Each step is noisy, but you get to take many more steps in the same wall-clock time.


Stochastic gradient descent

At each step, sample a random index ii uniformly from {1,,N}\{1, \dots, N\} and update:

θk+1=θkαkfi(θk)\theta_{k+1} = \theta_k - \alpha_k \nabla f_i(\theta_k)

The stochastic gradient fi(θk)\nabla f_i(\theta_k) is an unbiased estimate of the true gradient:

Ei[fi(θ)]=f(θ)\mathbb{E}_i[\nabla f_i(\theta)] = \nabla f(\theta)

On average, you move in the right direction. But any single step might be off. The noise is both a blessing (helps escape shallow local minima, provides implicit regularization) and a curse (prevents exact convergence without decreasing the learning rate).

Learning rate schedule

For SGD to converge, you need:

k=1αk=andk=1αk2<\sum_{k=1}^{\infty} \alpha_k = \infty \quad \text{and} \quad \sum_{k=1}^{\infty} \alpha_k^2 < \infty

The classic choice is αk=α0/k\alpha_k = \alpha_0 / k. In practice, people use step decay (halve the learning rate every few epochs), cosine annealing, or warmup followed by decay.


Mini-batch SGD

Pure SGD uses one sample. Mini-batch SGD uses a batch of BB samples:

θk+1=θkαk1Bj=1Bfij(θk)\theta_{k+1} = \theta_k - \alpha_k \frac{1}{B} \sum_{j=1}^{B} \nabla f_{i_j}(\theta_k)

The mini-batch gradient has lower variance than the single-sample gradient by a factor of BB:

Var[1Bj=1Bfij]=1BVar[fi]\text{Var}\left[\frac{1}{B}\sum_{j=1}^B \nabla f_{i_j}\right] = \frac{1}{B} \text{Var}[\nabla f_i]

Typical batch sizes: 32, 64, 128, 256. Larger batches give smoother gradients but less frequent updates. There is also a practical reason: GPU parallelism. A batch of 256 processes almost as fast as a batch of 1 on modern hardware, so you get 256x variance reduction nearly for free.


Example 1: SGD vs batch GD on a quadratic

Problem: Minimize f(θ)=12θ2f(\theta) = \frac{1}{2}\theta^2. Suppose we have N=3N = 3 “data points” with individual losses:

f1(θ)=12(1.5θ)2=1.125θ2f_1(\theta) = \frac{1}{2}(1.5\theta)^2 = 1.125\theta^2 f2(θ)=12(0.5θ)2=0.125θ2f_2(\theta) = \frac{1}{2}(0.5\theta)^2 = 0.125\theta^2 f3(θ)=12(1.0θ)2=0.5θ2f_3(\theta) = \frac{1}{2}(1.0\theta)^2 = 0.5\theta^2

Average: f(θ)=13(1.125+0.125+0.5)θ2=1.753θ20.583θ2f(\theta) = \frac{1}{3}(1.125 + 0.125 + 0.5)\theta^2 = \frac{1.75}{3}\theta^2 \approx 0.583\theta^2.

True gradient: f(θ)=1.167θ\nabla f(\theta) = 1.167\theta.

Stochastic gradients: f1=2.25θ\nabla f_1 = 2.25\theta, f2=0.25θ\nabla f_2 = 0.25\theta, f3=θ\nabla f_3 = \theta.

Batch GD with α=0.5\alpha = 0.5, starting at θ0=4\theta_0 = 4:

θ1=40.5(1.167)(4)=42.333=1.667\theta_1 = 4 - 0.5(1.167)(4) = 4 - 2.333 = 1.667 θ2=1.6670.5(1.167)(1.667)=1.6670.972=0.695\theta_2 = 1.667 - 0.5(1.167)(1.667) = 1.667 - 0.972 = 0.695 θ3=0.6950.5(1.167)(0.695)=0.6950.406=0.289\theta_3 = 0.695 - 0.5(1.167)(0.695) = 0.695 - 0.406 = 0.289

Smooth, steady convergence: 4.01.6670.6950.2894.0 \to 1.667 \to 0.695 \to 0.289.

SGD with α=0.5\alpha = 0.5, starting at θ0=4\theta_0 = 4. Random sample order: i=1,3,2i = 1, 3, 2.

θ1=40.5(2.25)(4)=44.5=0.5\theta_1 = 4 - 0.5(2.25)(4) = 4 - 4.5 = -0.5 θ2=0.50.5(1.0)(0.5)=0.5+0.25=0.25\theta_2 = -0.5 - 0.5(1.0)(-0.5) = -0.5 + 0.25 = -0.25 θ3=0.250.5(0.25)(0.25)=0.25+0.03125=0.219\theta_3 = -0.25 - 0.5(0.25)(-0.25) = -0.25 + 0.03125 = -0.219

Noisy path: 4.00.50.250.2194.0 \to -0.5 \to -0.25 \to -0.219. The first step overshoots dramatically (the stochastic gradient was 2x the true gradient), but we still make progress. After three steps, θ|\theta| went from 4 to 0.219, which is actually better than batch GD’s 0.289.

This is SGD in a nutshell: noisy individual steps, but fast overall progress because each step is cheap.


Momentum

SGD oscillates, especially in narrow valleys where the gradient direction swings back and forth. Momentum smooths this out by maintaining a running average of past gradients:

mk+1=βmk+fik(θk)m_{k+1} = \beta m_k + \nabla f_{i_k}(\theta_k) θk+1=θkαmk+1\theta_{k+1} = \theta_k - \alpha \, m_{k+1}

Here β[0,1)\beta \in [0, 1) is the momentum coefficient, typically 0.90.9. The velocity mm accumulates gradient history: consistent gradient directions build up speed, while oscillating directions cancel out.

Think of a ball rolling down a hill. Without momentum, it changes direction instantly based on the local slope. With momentum, it has inertia. It rolls faster down consistent slopes and resists sudden direction changes.

Nesterov momentum

A slight but important twist: compute the gradient at the “lookahead” position instead of the current position:

mk+1=βmk+fik(θkαβmk)m_{k+1} = \beta m_k + \nabla f_{i_k}(\theta_k - \alpha \beta m_k) θk+1=θkαmk+1\theta_{k+1} = \theta_k - \alpha \, m_{k+1}

Nesterov momentum has better theoretical convergence for convex problems and tends to work slightly better in practice.


Example 2: Momentum vs plain SGD

Problem: Minimize f(θ1,θ2)=10θ12+θ22f(\theta_1, \theta_2) = 10\theta_1^2 + \theta_2^2.

This is a narrow valley: the curvature in θ1\theta_1 is 10x the curvature in θ2\theta_2. Gradient descent oscillates in θ1\theta_1 and converges slowly in θ2\theta_2.

Starting point: (θ1,θ2)=(1,10)(\theta_1, \theta_2) = (1, 10). Learning rate α=0.05\alpha = 0.05.

For simplicity, we use the full gradient (not stochastic) to isolate the effect of momentum.

f=(20θ1,2θ2)\nabla f = (20\theta_1, 2\theta_2).

Plain GD, 5 steps:

Stepθ1\theta_1θ2\theta_2ff
01.00010.000110.0
110.05(20)=0.01 - 0.05(20) = 0.0100.05(20)=9.010 - 0.05(20) = 9.081.0
200=0.00 - 0 = 0.090.05(18)=8.19 - 0.05(18) = 8.165.61
30.08.10.05(16.2)=7.298.1 - 0.05(16.2) = 7.2953.14
40.07.290.81×0.05...=6.5617.29 - 0.81 \times 0.05... = 6.56143.05
50.05.90534.87

θ1\theta_1 converged in one step (lucky with α=0.05\alpha = 0.05 and curvature 20, giving step =1.0= 1.0). θ2\theta_2 converges slowly: 1098.17.296.565.9010 \to 9 \to 8.1 \to 7.29 \to 6.56 \to 5.90.

SGD with momentum (β=0.9\beta = 0.9), same setup:

Initialize m=(0,0)m = (0, 0).

Step 0 to 1:

m1=0.9(0,0)+(20,20)=(20,20)m_1 = 0.9(0, 0) + (20, 20) = (20, 20) θ1=(1,10)0.05(20,20)=(0,9)\theta_1 = (1, 10) - 0.05(20, 20) = (0, 9)

Step 1 to 2:

f=(0,18)\nabla f = (0, 18) m2=0.9(20,20)+(0,18)=(18,36)m_2 = 0.9(20, 20) + (0, 18) = (18, 36) θ2=(0,9)0.05(18,36)=(0.9,7.2)\theta_2 = (0, 9) - 0.05(18, 36) = (-0.9, 7.2)

Step 2 to 3:

f=(18,14.4)\nabla f = (-18, 14.4) m3=0.9(18,36)+(18,14.4)=(1.8,46.8)m_3 = 0.9(18, 36) + (-18, 14.4) = (-1.8, 46.8) θ3=(0.9,7.2)0.05(1.8,46.8)=(0.81,4.86)\theta_3 = (-0.9, 7.2) - 0.05(-1.8, 46.8) = (-0.81, 4.86)

Step 3 to 4:

f=(16.2,9.72)\nabla f = (-16.2, 9.72) m4=0.9(1.8,46.8)+(16.2,9.72)=(17.82,51.84)m_4 = 0.9(-1.8, 46.8) + (-16.2, 9.72) = (-17.82, 51.84) θ4=(0.81,4.86)0.05(17.82,51.84)=(0.081,2.268)\theta_4 = (-0.81, 4.86) - 0.05(-17.82, 51.84) = (0.081, 2.268)

Step 4 to 5:

f=(1.62,4.536)\nabla f = (1.62, 4.536) m5=0.9(17.82,51.84)+(1.62,4.536)=(14.418,51.192)m_5 = 0.9(-17.82, 51.84) + (1.62, 4.536) = (-14.418, 51.192) θ5=(0.081,2.268)0.05(14.418,51.192)=(0.802,0.292)\theta_5 = (0.081, 2.268) - 0.05(-14.418, 51.192) = (0.802, -0.292)

With momentum, θ2\theta_2 progress: 1097.24.862.270.2910 \to 9 \to 7.2 \to 4.86 \to 2.27 \to -0.29. Much faster convergence in θ2\theta_2! But θ1\theta_1 oscillates a bit: 100.90.810.080.801 \to 0 \to -0.9 \to -0.81 \to 0.08 \to 0.80. The momentum builds up speed in the consistent direction (θ2\theta_2 shrinking) while the oscillations in θ1\theta_1 partially cancel.

After 5 steps: plain GD has f=34.87f = 34.87, momentum has f6.51f \approx 6.51. Momentum wins by a wide margin.


Adaptive learning rates

The learning rate α\alpha is the most important hyperparameter in SGD. Too large, you diverge. Too small, you converge painfully slowly. Different parameters might need different learning rates (a bias term vs a weight matrix, for instance).

Adaptive methods give each parameter its own effective learning rate based on the history of its gradients.

AdaGrad

Accumulate the sum of squared gradients for each parameter:

vk=vk1+(fik)2(element-wise square)v_k = v_{k-1} + (\nabla f_{i_k})^2 \quad \text{(element-wise square)} θk+1=θkαvk+ϵfik\theta_{k+1} = \theta_k - \frac{\alpha}{\sqrt{v_k} + \epsilon} \odot \nabla f_{i_k}

Parameters with large accumulated gradients get smaller learning rates. This is great for sparse features (NLP, recommendations) where some features are rare and need larger updates. The downside: vkv_k only grows, so the learning rate monotonically decreases and can become too small.

RMSProp

Fix AdaGrad’s shrinking learning rate by using an exponential moving average instead of a sum:

vk=ρvk1+(1ρ)(fik)2v_k = \rho \, v_{k-1} + (1 - \rho)(\nabla f_{i_k})^2 θk+1=θkαvk+ϵfik\theta_{k+1} = \theta_k - \frac{\alpha}{\sqrt{v_k} + \epsilon} \odot \nabla f_{i_k}

The decay rate ρ\rho (typically 0.99) controls how far back the moving average looks. Old gradients are exponentially forgotten, preventing the learning rate from collapsing.

Adam

Adam combines momentum and RMSProp:

mk=β1mk1+(1β1)fik(first moment, like momentum)m_k = \beta_1 m_{k-1} + (1 - \beta_1) \nabla f_{i_k} \quad \text{(first moment, like momentum)} vk=β2vk1+(1β2)(fik)2(second moment, like RMSProp)v_k = \beta_2 v_{k-1} + (1 - \beta_2) (\nabla f_{i_k})^2 \quad \text{(second moment, like RMSProp)}

Bias correction (important in early steps when mm and vv are biased toward zero):

m^k=mk1β1k,v^k=vk1β2k\hat{m}_k = \frac{m_k}{1 - \beta_1^k}, \quad \hat{v}_k = \frac{v_k}{1 - \beta_2^k}

Update:

θk+1=θkαv^k+ϵm^k\theta_{k+1} = \theta_k - \frac{\alpha}{\sqrt{\hat{v}_k} + \epsilon} \odot \hat{m}_k

Default hyperparameters: β1=0.9\beta_1 = 0.9, β2=0.999\beta_2 = 0.999, ϵ=108\epsilon = 10^{-8}, α=0.001\alpha = 0.001.

Adam is the default optimizer for most deep learning tasks. It usually works well out of the box with the default parameters.


Example 3: Five steps of Adam vs plain SGD

Problem: Minimize f(θ)=θ2f(\theta) = \theta^2. True gradient: f=2θ\nabla f = 2\theta. Starting point: θ0=10\theta_0 = 10.

We use the true gradient (not stochastic) to focus on the optimizer behavior.

Plain SGD with α=0.1\alpha = 0.1:

kkθk\theta_kf\nabla fθk+1\theta_{k+1}
010.020.0100.1(20)=8.010 - 0.1(20) = 8.0
18.016.081.6=6.48 - 1.6 = 6.4
26.412.86.41.28=5.126.4 - 1.28 = 5.12
35.1210.245.121.024=4.0965.12 - 1.024 = 4.096
44.0968.1924.0960.819=3.2774.096 - 0.819 = 3.277

After 5 steps: θ=3.277\theta = 3.277, f=10.74f = 10.74.

Adam with α=1.0\alpha = 1.0 (Adam’s default is 0.001, but we use 1.0 to see clear steps), β1=0.9\beta_1 = 0.9, β2=0.999\beta_2 = 0.999:

kkθ\thetag=2θg = 2\thetammvvm^\hat{m}v^\hat{v}θk+1\theta_{k+1}
010.020.0

Step 1: g1=20.0g_1 = 20.0

m1=0.9(0)+0.1(20)=2.0m_1 = 0.9(0) + 0.1(20) = 2.0 v1=0.999(0)+0.001(400)=0.4v_1 = 0.999(0) + 0.001(400) = 0.4 m^1=2.010.9=20.0\hat{m}_1 = \frac{2.0}{1 - 0.9} = 20.0 v^1=0.410.999=400.0\hat{v}_1 = \frac{0.4}{1 - 0.999} = 400.0 θ1=10.01.0400+10820.0=10.02020=9.0\theta_1 = 10.0 - \frac{1.0}{\sqrt{400} + 10^{-8}} \cdot 20.0 = 10.0 - \frac{20}{20} = 9.0

Step 2: g2=18.0g_2 = 18.0

m2=0.9(2.0)+0.1(18.0)=1.8+1.8=3.6m_2 = 0.9(2.0) + 0.1(18.0) = 1.8 + 1.8 = 3.6 v2=0.999(0.4)+0.001(324)=0.3996+0.324=0.7236v_2 = 0.999(0.4) + 0.001(324) = 0.3996 + 0.324 = 0.7236 m^2=3.610.81=3.60.19=18.95\hat{m}_2 = \frac{3.6}{1 - 0.81} = \frac{3.6}{0.19} = 18.95 v^2=0.723610.998=0.72360.002=361.8\hat{v}_2 = \frac{0.7236}{1 - 0.998} = \frac{0.7236}{0.002} = 361.8 θ2=9.018.95361.8+108=9.018.9519.02=9.00.996=8.004\theta_2 = 9.0 - \frac{18.95}{\sqrt{361.8} + 10^{-8}} = 9.0 - \frac{18.95}{19.02} = 9.0 - 0.996 = 8.004

Step 3: g3=16.008g_3 = 16.008

m3=0.9(3.6)+0.1(16.008)=3.24+1.601=4.841m_3 = 0.9(3.6) + 0.1(16.008) = 3.24 + 1.601 = 4.841 v3=0.999(0.7236)+0.001(256.3)=0.7229+0.2563=0.9792v_3 = 0.999(0.7236) + 0.001(256.3) = 0.7229 + 0.2563 = 0.9792 m^3=4.84110.729=4.8410.271=17.86\hat{m}_3 = \frac{4.841}{1 - 0.729} = \frac{4.841}{0.271} = 17.86 v^3=0.979210.997=0.97920.003=326.4\hat{v}_3 = \frac{0.9792}{1 - 0.997} = \frac{0.9792}{0.003} = 326.4 θ3=8.00417.86326.4=8.00417.8618.07=8.0040.988=7.016\theta_3 = 8.004 - \frac{17.86}{\sqrt{326.4}} = 8.004 - \frac{17.86}{18.07} = 8.004 - 0.988 = 7.016

Step 4: g4=14.032g_4 = 14.032

m4=0.9(4.841)+0.1(14.032)=4.357+1.403=5.760m_4 = 0.9(4.841) + 0.1(14.032) = 4.357 + 1.403 = 5.760 v4=0.999(0.9792)+0.001(196.9)=0.9782+0.1969=1.1751v_4 = 0.999(0.9792) + 0.001(196.9) = 0.9782 + 0.1969 = 1.1751 m^4=5.76010.6561=5.7600.3439=16.75\hat{m}_4 = \frac{5.760}{1 - 0.6561} = \frac{5.760}{0.3439} = 16.75 v^4=1.175110.996=1.17510.004=293.8\hat{v}_4 = \frac{1.1751}{1 - 0.996} = \frac{1.1751}{0.004} = 293.8 θ4=7.01616.75293.8=7.01616.7517.14=7.0160.977=6.039\theta_4 = 7.016 - \frac{16.75}{\sqrt{293.8}} = 7.016 - \frac{16.75}{17.14} = 7.016 - 0.977 = 6.039

Step 5: g5=12.078g_5 = 12.078

m5=0.9(5.760)+0.1(12.078)=5.184+1.208=6.392m_5 = 0.9(5.760) + 0.1(12.078) = 5.184 + 1.208 = 6.392 v5=0.999(1.1751)+0.001(145.9)=1.1739+0.1459=1.3198v_5 = 0.999(1.1751) + 0.001(145.9) = 1.1739 + 0.1459 = 1.3198 m^5=6.39210.5905=6.3920.4095=15.61\hat{m}_5 = \frac{6.392}{1 - 0.5905} = \frac{6.392}{0.4095} = 15.61 v^5=1.319810.995=1.31980.005=264.0\hat{v}_5 = \frac{1.3198}{1 - 0.995} = \frac{1.3198}{0.005} = 264.0 θ5=6.03915.61264.0=6.03915.6116.25=6.0390.961=5.078\theta_5 = 6.039 - \frac{15.61}{\sqrt{264.0}} = 6.039 - \frac{15.61}{16.25} = 6.039 - 0.961 = 5.078

Summary after 5 steps:

Methodθ5\theta_5f(θ5)f(\theta_5)
Plain SGD (α=0.1\alpha = 0.1)3.27710.74
Adam (α=1.0\alpha = 1.0)5.07825.79

In this simple case, plain SGD actually wins because it has a well-tuned learning rate for this specific problem. Adam takes steps of roughly constant size (close to α\alpha) regardless of gradient magnitude, which is a feature for complex loss landscapes but a disadvantage on simple quadratics. Adam shines on problems with:

  • Very different scales across parameters
  • Noisy gradients
  • Saddle points and flat regions

Learning rate schedules

Even with adaptive methods, the learning rate schedule matters. Common choices:

Step decay

αk=α0×γk/s\alpha_k = \alpha_0 \times \gamma^{\lfloor k / s \rfloor}

Drop the learning rate by factor γ\gamma (e.g., 0.1) every ss epochs. Simple and effective. Used in most ResNet training recipes.

Cosine annealing

αk=αmin+12(α0αmin)(1+cos(kπT))\alpha_k = \alpha_{\min} + \frac{1}{2}(\alpha_0 - \alpha_{\min})\left(1 + \cos\left(\frac{k\pi}{T}\right)\right)

Smoothly decreases from α0\alpha_0 to αmin\alpha_{\min} over TT steps. Popular in modern training (vision transformers, language models).

Warmup

Start with a very small learning rate and linearly increase to the target over the first few thousand steps:

αk=α0min(1,kkwarmup)\alpha_k = \alpha_0 \cdot \min\left(1, \frac{k}{k_{\text{warmup}}}\right)

Warmup stabilizes training when the initial gradients are large and unreliable (common with random initialization in transformers).

One-cycle policy

Increase the learning rate from small to large over the first half of training, then decrease it back down. This is surprisingly effective and was popularized by Leslie Smith.


Which optimizer to use

ScenarioRecommendation
Starting a new projectAdam with defaults (α=0.001\alpha = 0.001, β1=0.9\beta_1 = 0.9, β2=0.999\beta_2 = 0.999)
Training CNNs (image classification)SGD + momentum (β=0.9\beta = 0.9) + step decay schedule
Training transformersAdam or AdamW with warmup + cosine schedule
Sparse features (NLP, recommendations)Adam or AdaGrad
Need best final accuracySGD + momentum (often beats Adam at convergence with proper tuning)
Quick prototypingAdam (less tuning required)

The general pattern: Adam converges faster initially, but well-tuned SGD with momentum often reaches a better final solution. This has been observed repeatedly in computer vision, though Adam with weight decay (AdamW) has narrowed the gap.

Convergence comparison of SGD variants. Adam and RMSProp adapt their learning rates and converge faster than plain SGD. Momentum smooths out oscillations for faster convergence than vanilla SGD.

Optimization paths on f(x,y) = x^2 + 10y^2. SGD oscillates in the high-curvature y direction. Momentum damps the oscillations. Adam takes a more direct path to the minimum.


AdamW: weight decay done right

Standard 2\ell_2 regularization adds λ2θ2\frac{\lambda}{2}\|\theta\|^2 to the loss. In Adam, this interacts poorly with the adaptive learning rate: the regularization gradient gets scaled by 1/v1/\sqrt{v}, weakening it for parameters with large gradients.

AdamW decouples weight decay from the gradient:

θk+1=θkα(m^kv^k+ϵ+λθk)\theta_{k+1} = \theta_k - \alpha\left(\frac{\hat{m}_k}{\sqrt{\hat{v}_k} + \epsilon} + \lambda \theta_k\right)

The λθk\lambda \theta_k term is applied directly, not through the adaptive mechanism. This is now the standard optimizer for training language models.


Gradient noise and generalization

A surprising finding in deep learning: SGD with small batches often generalizes better than full-batch gradient descent, even when both reach the same training loss. The noise in SGD acts as an implicit regularizer, pushing the optimizer toward “flat” minima that generalize well.

This connects to the bias-variance tradeoff:

  • Large batch, low noise: converges to sharp minima. Low training loss, potentially high test loss.
  • Small batch, high noise: finds flat minima. Slightly higher training loss, better test loss.

The optimal batch size depends on the problem. Too small wastes GPU parallelism. Too large hurts generalization. Most practitioners use batch sizes of 32 to 512 and tune from there.


Practical tips

  1. Start with Adam, α=0.001\alpha = 0.001. If results are not good enough, try SGD + momentum with a carefully tuned schedule.

  2. Use gradient clipping (clip the gradient norm to a maximum value, say 1.0) to prevent exploding gradients, especially in RNNs and transformers.

  3. Monitor the gradient norm during training. If it spikes, your learning rate is too high or there is a data issue.

  4. Learning rate finder: sweep the learning rate from 10710^{-7} to 1010 over one epoch and plot the loss. The best learning rate is usually about 10x smaller than the one where loss starts increasing.

  5. Weight decay (10410^{-4} to 10210^{-2}) usually helps. Use AdamW, not Adam + L2 regularization.


Python: implementing the optimizers

import numpy as np

class SGD:
    def __init__(self, lr=0.01):
        self.lr = lr
    
    def step(self, params, grads):
        return params - self.lr * grads

class SGDMomentum:
    def __init__(self, lr=0.01, beta=0.9):
        self.lr = lr
        self.beta = beta
        self.v = None
    
    def step(self, params, grads):
        if self.v is None:
            self.v = np.zeros_like(params)
        self.v = self.beta * self.v + grads
        return params - self.lr * self.v

class Adam:
    def __init__(self, lr=0.001, beta1=0.9, beta2=0.999, eps=1e-8):
        self.lr = lr
        self.beta1 = beta1
        self.beta2 = beta2
        self.eps = eps
        self.m = None
        self.v = None
        self.t = 0
    
    def step(self, params, grads):
        if self.m is None:
            self.m = np.zeros_like(params)
            self.v = np.zeros_like(params)
        
        self.t += 1
        self.m = self.beta1 * self.m + (1 - self.beta1) * grads
        self.v = self.beta2 * self.v + (1 - self.beta2) * grads**2
        
        m_hat = self.m / (1 - self.beta1**self.t)
        v_hat = self.v / (1 - self.beta2**self.t)
        
        return params - self.lr * m_hat / (np.sqrt(v_hat) + self.eps)

# Compare on f(x) = x^2, starting at x = 10
for name, opt in [("SGD", SGD(0.1)), 
                   ("Momentum", SGDMomentum(0.05, 0.9)),
                   ("Adam", Adam(1.0))]:
    x = 10.0
    trajectory = [x]
    for _ in range(20):
        g = 2 * x  # gradient of x^2
        x = opt.step(x, g)
        trajectory.append(float(x))
    print(f"{name:10s}: final x = {x:.4f}, f(x) = {x**2:.4f}")

Summary table

OptimizerUpdate rule (simplified)MemoryGood for
SGDθαg\theta - \alpha gO(1)O(1)Simple problems, theory
Momentumθα(βm+g)\theta - \alpha(\beta m + g)O(d)O(d)CNNs, well-tuned training
AdaGradθαvg\theta - \frac{\alpha}{\sqrt{v}} gO(d)O(d)Sparse features
RMSPropθαEMA(g2)g\theta - \frac{\alpha}{\sqrt{\text{EMA}(g^2)}} gO(d)O(d)RNNs, non-stationary
Adamθαm^v^\theta - \frac{\alpha \cdot \hat{m}}{\sqrt{\hat{v}}}O(2d)O(2d)Default choice for DL
AdamWAdam + decoupled weight decayO(2d)O(2d)Transformers, LLMs

What comes next

With SGD and its variants, you have the optimizers that power all of modern machine learning. The next step is to see how these tools get applied: from loss functions and regularization to training pipelines for real models. Head over to what is machine learning to start the ML series, where optimization meets data.

Start typing to search across all content
navigate Enter open Esc close