Nov 10, 2025 · 22 min read · Deep Learning

Optimization techniques for deep networks

In this series (25 parts)

Prerequisites: SGD and its variants, partial derivatives and gradients, and training neural networks.

Training a deep network means minimizing a loss function with millions of parameters. Standard gradient descent works fine on simple problems, but it breaks down on the complex, high-dimensional loss surfaces that deep networks produce. This article covers the optimizers that actually work: momentum, RMSProp, Adam, and the practical tricks (gradient clipping, learning rate schedules, weight decay) that make training stable.

The big picture

Gradient descent works, but it is slow and gets stuck. Better optimizers fix specific failure modes: oscillation, saddle points, and the need to tune one learning rate for millions of parameters.

Optimizer	How it works	Speed	Oscillation	Tuning needed
SGD	Follow the gradient, fixed step	Slow	High (zig-zags)	Must hand-tune lr
SGD + Momentum	Accumulate velocity from past gradients	Faster	Less (smoothed)	lr and momentum
RMSProp	Scale each parameter by its recent gradient size	Fast	Low	lr and decay
Adam	Momentum + per-parameter scaling + bias correction	Fast	Low	Defaults usually work

Momentum: a ball rolling downhill. Adam: a ball that also adjusts its step size per dimension.

graph LR
  SGD["SGD:
takes one step
in gradient direction"] --> MOM["Momentum:
accumulates velocity,
rolls through flat spots"]
  MOM --> ADAM["Adam:
adapts step size
per parameter,
combines both ideas"]

Now let’s see why standard gradient descent fails and how each optimizer addresses it.

Why standard gradient descent struggles

Contour plot of f(x, y) = x^2 + 50y^2. SGD zigzags along the narrow valley, while Adam takes a more direct path to the minimum.

A deep network’s loss surface is nothing like the smooth bowl you see in textbook examples. Three properties make it hard to optimize.

Ill-conditioning. The Hessian of the loss has eigenvalues that span many orders of magnitude. Some directions curve steeply while others are nearly flat. Gradient descent with a single learning rate oscillates along steep directions and crawls along flat ones. You cannot fix this by just picking a smaller learning rate, because that makes the flat directions even slower.

Saddle points. In high dimensions, saddle points vastly outnumber local minima. At a saddle point the gradient is zero, but the surface curves up in some directions and down in others. Standard gradient descent can get stuck near these points for many iterations because the gradient magnitude is tiny.

Flat regions. Parts of the loss surface have very small gradients. This is especially common with saturating activations like sigmoid or tanh. Backpropagation multiplies small numbers through many layers via the chain rule, and the gradient can shrink to near zero.

We need optimizers that handle these problems. The core ideas are: (1) use history of past gradients to build up speed in consistent directions, and (2) adapt the learning rate per parameter so each weight gets an appropriately sized update.

flowchart TD
  START["Initialize weights"] --> GD["Standard GD"]
  GD --> SADDLE["Stuck near saddle point
(tiny gradient, slow progress)"]
  GD --> OSCILLATE["Oscillates in steep directions
(zig-zag path, wasted steps)"]
  GD --> FLAT["Crawls through flat region
(vanishing gradient)"]
  SADDLE --> FIX["Solution: Momentum
builds velocity to escape"]
  OSCILLATE --> FIX2["Solution: Adaptive LR
per-parameter scaling"]
  FLAT --> FIX3["Solution: Adam
combines both ideas"]
  FIX --> CONVERGE["Faster convergence"]
  FIX2 --> CONVERGE
  FIX3 --> CONVERGE

Momentum

Plain SGD computes a gradient $g_t$ and takes one step. Momentum keeps a running average of past gradients, called the velocity $v_t$ . When the gradient points in the same direction for several steps, the velocity builds up and the optimizer moves faster. When the gradient oscillates, the velocity averages out the noise.

The update rules are:

v_t = \beta \, v_{t-1} + g_t

\theta_t = \theta_{t-1} - \alpha \, v_t

Here $\beta$ is the momentum coefficient (typically 0.9), $\alpha$ is the learning rate, and $g_t = \nabla_\theta L(\theta_{t-1})$ is the gradient at step $t$ .

Think of a ball rolling downhill. On a flat stretch it keeps rolling because of its built-up velocity. In a narrow valley it does not bounce side to side as much because the sideways components cancel out.

The trade-off is simple: higher $\beta$ means smoother updates but slower reaction to sudden changes in the loss surface. Lower $\beta$ tracks the current gradient more closely but smooths less.

Momentum accumulation over time

graph LR
  G1["Step 1
g = 2.0
v = 2.0"] --> G2["Step 2
g = 1.8
v = 3.6"]
  G2 --> G3["Step 3
g = 1.5
v = 4.7"]
  G3 --> G4["Step 4
g = -0.5
v = 3.7"]

When gradients point consistently in the same direction, the velocity builds up. When the gradient reverses (step 4), the accumulated velocity absorbs the change, preventing abrupt direction shifts.

RMSProp

Momentum helps with direction but does not fix the scale problem. Some parameters have consistently large gradients while others have small ones. RMSProp (Root Mean Square Propagation) tracks the running average of squared gradients and uses it to normalize the update.

s_t = \beta \, s_{t-1} + (1 - \beta) \, g_t^2

\theta_t = \theta_{t-1} - \frac{\alpha \, g_t}{\sqrt{s_t} + \epsilon}

The term $\sqrt{s_t}$ estimates the recent magnitude of the gradient for each parameter. Dividing by it means parameters with large gradients get smaller updates, and parameters with small gradients get larger updates. The small constant $\epsilon$ (typically $10^{-8}$ ) prevents division by zero.

RMSProp adapts the learning rate per parameter. You no longer need to manually tune different rates for different layers. Typical $\beta$ for RMSProp is 0.9 or 0.99.

Adam

Adam (Adaptive Moment Estimation) combines both ideas. It maintains a first moment estimate $m_t$ (like momentum) and a second moment estimate $v_t$ (like RMSProp).

m_t = \beta_1 \, m_{t-1} + (1 - \beta_1) \, g_t

v_t = \beta_2 \, v_{t-1} + (1 - \beta_2) \, g_t^2

Since $m_0 = 0$ and $v_0 = 0$ , the estimates are biased toward zero in early steps. Adam corrects for this:

\hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t}

The parameter update is:

\theta_t = \theta_{t-1} - \frac{\alpha \, \hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}

Default hyperparameters are $\beta_1 = 0.9$ , $\beta_2 = 0.999$ , $\epsilon = 10^{-8}$ , and $\alpha = 0.001$ . These defaults work well for most problems, which is one reason Adam is so popular.

The bias correction matters most in early training. At $t = 1$ , $m_1 = 0.1 \cdot g_1$ and $\hat{m}_1 = g_1$ . Without correction, the first few updates would be far too small.

Adam’s two moment estimates

graph TD
  G["Gradient g_t"] --> M["First moment m_t
(mean of gradients,
like momentum)"]
  G --> V["Second moment v_t
(mean of squared gradients,
like RMSProp)"]
  M --> BC1["Bias-corrected m-hat"]
  V --> BC2["Bias-corrected v-hat"]
  BC1 --> UPDATE["Update:
step = lr * m-hat / sqrt(v-hat)"]
  BC2 --> UPDATE

The first moment tracks which direction to go. The second moment tracks how big recent gradients have been, scaling the step accordingly.

flowchart LR
  SGD["SGD
(basic gradient step)"]
  MOM["+ Momentum
(velocity from past gradients)"]
  ADA["+ Adaptive LR
(RMSProp: per-param scaling)"]
  ADAM["= Adam
(both combined +
bias correction)"]
  SGD --> MOM --> ADA --> ADAM

AdaGrad

Before RMSProp, there was AdaGrad. It accumulates all past squared gradients from the start of training:

s_t = s_{t-1} + g_t^2

\theta_t = \theta_{t-1} - \frac{\alpha \, g_t}{\sqrt{s_t} + \epsilon}

Since $s_t$ only grows, the effective learning rate monotonically decreases. This is good for sparse gradients (like word embeddings in NLP) because rare features get larger updates. But for deep networks, the learning rate eventually shrinks to near zero and training stalls. RMSProp fixes this by using an exponential moving average instead of a running sum, so old gradients gradually fade out.

Optimizer comparison

Loss vs epoch for four optimizers. Adam converges fastest, followed by RMSProp, then Momentum, with vanilla SGD the slowest.

Name	Update rule (brief)	Extra memory	Key hyperparams	Best for	Known issue
SGD	$\theta - \alpha g$	None	$\alpha$	Simple models, convex problems	Slow on ill-conditioned surfaces
SGD + Momentum	$\theta - \alpha v_t$	$v$ (same size as $\theta$ )	$\alpha$ , $\beta$	CNNs with tuned schedule	Requires careful LR tuning
AdaGrad	$\theta - \alpha g / \sqrt{s_t}$	$s$ (same size as $\theta$ )	$\alpha$	Sparse gradients (NLP embeddings)	LR goes to zero over time
RMSProp	$\theta - \alpha g / \sqrt{s_t}$	$s$ (same size as $\theta$ )	$\alpha$ , $\beta$	RNNs, non-stationary objectives	No bias correction
Adam	$\theta - \alpha \hat{m}_t / \sqrt{\hat{v}_t}$	$m$ , $v$ (2x size of $\theta$ )	$\alpha$ , $\beta_1$ , $\beta_2$	Default choice for most tasks	Can generalize worse than SGD+momentum

Convergence paths: SGD vs momentum vs Adam

graph TD
  START["Start"] --> SGD_PATH["SGD:
zig-zag path,
many steps,
oscillates in narrow valleys"]
  START --> MOM_PATH["Momentum:
smoother curve,
fewer steps,
builds speed in
consistent directions"]
  START --> ADAM_PATH["Adam:
nearly direct path,
adapts per parameter,
fewest steps"]
  SGD_PATH --> GOAL["Minimum"]
  MOM_PATH --> GOAL
  ADAM_PATH --> GOAL

Weight decay and L2 regularization

L2 regularization adds a penalty $\frac{\lambda}{2} \|\theta\|^2$ to the loss. The gradient of this penalty is $\lambda \theta$ , so the SGD update becomes:

\theta_t = \theta_{t-1} - \alpha (g_t + \lambda \theta_{t-1}) = (1 - \alpha \lambda) \theta_{t-1} - \alpha g_t

The term $(1 - \alpha \lambda)$ shrinks the weights every step. For plain SGD, L2 regularization and weight decay are mathematically equivalent.

For Adam, they differ. L2 regularization adds $\lambda \theta$ to the gradient before the adaptive scaling. This means the regularization strength varies per parameter because of the $\sqrt{\hat{v}_t}$ denominator. Weight decay, in contrast, subtracts $\alpha \lambda \theta$ after the adaptive update:

\theta_t = \theta_{t-1} - \alpha \left( \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} + \lambda \theta_{t-1} \right)

This decoupled version is called AdamW. In practice, AdamW gives better generalization than Adam with L2, especially for large models. The bias-variance tradeoff favors the more uniform regularization that AdamW provides.

Gradient clipping

When gradients explode, a single update can ruin the model. This is common in RNNs and transformers where long sequences cause the gradient norm to grow exponentially through the chain rule.

Clip by value. Clamp each element of the gradient independently:

g_i' = \max(-c, \min(c, g_i))

This is simple but changes the direction of the gradient vector.

Clip by global norm. Compute the norm of the entire gradient vector. If it exceeds a threshold $c$ , scale the whole vector down:

g' = \begin{cases} g & \text{if } \|g\| \leq c \\ \frac{c}{\|g\|} \, g & \text{if } \|g\| > c \end{cases}

Clipping by norm preserves the gradient direction, which is usually what you want. A typical threshold is 1.0 or 5.0, depending on the model.

Learning rate warmup and cosine decay

The learning rate schedule controls how $\alpha$ changes during training. Two widely used techniques are warmup and cosine decay.

Warmup. Start with a very small learning rate and linearly increase it to the peak value over a fixed number of warmup steps. This prevents early instability when the model weights are still random and gradients are noisy. Warmup is especially important for transformers and large batch training.

During warmup (step $s \leq T_w$ ):

\alpha_s = \alpha_{\text{peak}} \cdot \frac{s}{T_w}

Cosine decay. After warmup, decrease the learning rate following a cosine curve. This gives a slow start to the decay, a faster middle phase, and a gentle landing at zero.

During decay (step $s > T_w$ , total steps $T$ ):

\alpha_s = \alpha_{\text{peak}} \cdot \frac{1}{2} \left(1 + \cos\left(\pi \cdot \frac{s - T_w}{T - T_w}\right)\right)

Learning rate schedule comparison

Name	Formula / description	Use case	Pros	Cons
Constant	$\alpha_s = \alpha_0$	Debugging, small experiments	Simple	Rarely optimal for full training
Step decay	Multiply $\alpha$ by 0.1 every $k$ epochs	CNNs (ResNet-style training)	Easy to implement	Requires manual milestone tuning
Cosine annealing	$\alpha_s = \frac{\alpha_0}{2}(1 + \cos(\pi s / T))$	Transformers, modern CNNs	Smooth, no milestones to set	Needs total step count in advance
Warmup + cosine	Linear ramp then cosine decay	Transformers, large batch	Prevents early instability	Two extra hyperparams ( $T_w$ , $T$ )
Reduce on plateau	Cut $\alpha$ when val loss stalls	General fine-tuning	Adaptive to training dynamics	Can react too slowly

Learning rate warmup and cosine decay

graph LR
  A["Step 0
lr = 0"] -->|"Linear ramp"| B["Warmup end
lr = peak"]
  B -->|"Slow decay"| C["Mid training
lr = 0.5 * peak"]
  C -->|"Faster decay"| D["Late training
lr = 0.1 * peak"]
  D -->|"Gentle landing"| E["End
lr near 0"]

Warmup prevents destructive early updates when Adam’s moment estimates are unreliable. Cosine decay avoids the sudden drops of step-decay schedules and gives a smooth landing.

Example 1: three steps of Adam on $f(x) = x^2$

We minimize $f(x) = x^2$ starting at $x_0 = 2.0$ . Hyperparameters: $\alpha = 0.01$ , $\beta_1 = 0.9$ , $\beta_2 = 0.999$ , $\epsilon = 10^{-8}$ . Initialize $m_0 = 0$ , $v_0 = 0$ .

The gradient of $f(x) = x^2$ is $g = 2x$ .

Step 1 ( $t = 1$ , $x_0 = 2.0$ ):

g_1 = 2 \times 2.0 = 4.0

m_1 = 0.9 \times 0 + 0.1 \times 4.0 = 0.4000

v_1 = 0.999 \times 0 + 0.001 \times 16.0 = 0.0160

Bias correction ( $\beta_1^1 = 0.9$ , $\beta_2^1 = 0.999$ ):

\hat{m}_1 = \frac{0.4000}{1 - 0.9} = \frac{0.4000}{0.1} = 4.0000

\hat{v}_1 = \frac{0.0160}{1 - 0.999} = \frac{0.0160}{0.001} = 16.0000

x_1 = 2.0 - 0.01 \times \frac{4.0000}{\sqrt{16.0000} + 10^{-8}} = 2.0 - 0.01 \times \frac{4.0000}{4.0000} = 2.0 - 0.01 = 1.9900

Step 2 ( $t = 2$ , $x_1 = 1.9900$ ):

g_2 = 2 \times 1.9900 = 3.9800

m_2 = 0.9 \times 0.4000 + 0.1 \times 3.9800 = 0.3600 + 0.3980 = 0.7580

v_2 = 0.999 \times 0.0160 + 0.001 \times 15.8404 = 0.015984 + 0.015840 = 0.031824

Bias correction ( $\beta_1^2 = 0.81$ , $\beta_2^2 = 0.998001$ ):

\hat{m}_2 = \frac{0.7580}{1 - 0.81} = \frac{0.7580}{0.19} = 3.9895

\hat{v}_2 = \frac{0.031824}{1 - 0.998001} = \frac{0.031824}{0.001999} = 15.9200

x_2 = 1.9900 - 0.01 \times \frac{3.9895}{\sqrt{15.9200} + 10^{-8}} = 1.9900 - 0.01 \times \frac{3.9895}{3.9900} = 1.9900 - 0.01 \times 0.9999 = 1.9800

Step 3 ( $t = 3$ , $x_2 = 1.9800$ ):

g_3 = 2 \times 1.9800 = 3.9600

m_3 = 0.9 \times 0.7580 + 0.1 \times 3.9600 = 0.6822 + 0.3960 = 1.0782

v_3 = 0.999 \times 0.031824 + 0.001 \times 15.6816 = 0.031792 + 0.015682 = 0.047474

Bias correction ( $\beta_1^3 = 0.729$ , $\beta_2^3 = 0.997003$ ):

\hat{m}_3 = \frac{1.0782}{1 - 0.729} = \frac{1.0782}{0.271} = 3.9786

\hat{v}_3 = \frac{0.047474}{1 - 0.997003} = \frac{0.047474}{0.002997} = 15.8406

x_3 = 1.9800 - 0.01 \times \frac{3.9786}{\sqrt{15.8406} + 10^{-8}} = 1.9800 - 0.01 \times \frac{3.9786}{3.9800} = 1.9800 - 0.01 \times 0.9996 = 1.9700

Notice the pattern: Adam takes steps of almost exactly 0.01 regardless of the gradient magnitude. The bias-corrected second moment $\hat{v}_t$ is close to $g_t^2$ , so $\hat{m}_t / \sqrt{\hat{v}_t} \approx \pm 1$ . The effective step size is governed by $\alpha$ , not the gradient scale. This is a key property of Adam.

Example 2: gradient clipping by global norm

Suppose we have a gradient vector $g = [6, -8, 3, -4, 2]$ and a clipping threshold $c = 5.0$ .

Step 1: compute the global norm.

\|g\| = \sqrt{6^2 + (-8)^2 + 3^2 + (-4)^2 + 2^2} = \sqrt{36 + 64 + 9 + 16 + 4} = \sqrt{129} \approx 11.3578

Step 2: check against threshold.

Since $11.3578 > 5.0$ , we need to clip.

Step 3: compute scale factor.

\text{scale} = \frac{c}{\|g\|} = \frac{5.0}{11.3578} \approx 0.4402

Step 4: multiply each element by the scale factor.

g' = 0.4402 \times [6, -8, 3, -4, 2] = [2.6414, -3.5218, 1.3207, -1.7609, 0.8805]

Verify: the norm of the clipped gradient should be 5.0:

\|g'\| = \sqrt{2.6414^2 + (-3.5218)^2 + 1.3207^2 + (-1.7609)^2 + 0.8805^2} = \sqrt{6.977 + 12.403 + 1.744 + 3.101 + 0.775} = \sqrt{25.0} = 5.0 \; \checkmark

The direction is preserved. Every element was scaled by the same factor, so the gradient still points the same way. Only the magnitude changed.

Example 3: learning rate with warmup and cosine decay

Total training steps $T = 1000$ . Warmup steps $T_w = 100$ . Peak learning rate $\alpha_{\text{peak}} = 0.001$ .

During warmup ( $s \leq 100$ ): $\alpha_s = 0.001 \times \frac{s}{100}$

During decay ( $s > 100$ ): $\alpha_s = 0.001 \times \frac{1}{2}\left(1 + \cos\left(\pi \cdot \frac{s - 100}{900}\right)\right)$

Step $s$	Phase	Calculation	$\alpha_s$
0	Warmup	$0.001 \times 0/100$	0.000000
50	Warmup	$0.001 \times 50/100$	0.000500
100	Warmup (peak)	$0.001 \times 100/100$	0.001000
300	Cosine decay	$0.001 \times 0.5 \times (1 + \cos(\pi \times 200/900))$ $= 0.001 \times 0.5 \times (1 + \cos(0.6981))$ $= 0.001 \times 0.5 \times 1.7660$	0.000883
700	Cosine decay	$0.001 \times 0.5 \times (1 + \cos(\pi \times 600/900))$ $= 0.001 \times 0.5 \times (1 + \cos(2.0944))$ $= 0.001 \times 0.5 \times 0.5000$	0.000250
1000	Cosine decay (end)	$0.001 \times 0.5 \times (1 + \cos(\pi))$ $= 0.001 \times 0.5 \times 0$	0.000000

The schedule ramps up quickly, holds near the peak for a while, then smoothly decays to zero. At step 300 (early in decay) the rate is still 88% of peak. By step 700, it has dropped to 25%. This smooth curve avoids the sudden drops of step decay schedules.

Practical advice: when to use what

Start with Adam (or AdamW). It works well out of the box for most tasks. Set $\alpha = 0.001$ , $\beta_1 = 0.9$ , $\beta_2 = 0.999$ . Add weight decay of 0.01. This is the baseline.

Consider SGD + momentum for CNNs. Research shows SGD with momentum can generalize better than Adam on image classification, but you need to tune the learning rate and schedule carefully. If you have the compute budget for hyperparameter search, try it.

Use gradient clipping for RNNs and transformers. Set the clipping threshold to 1.0 and adjust if you see training instability. For transformers, clipping by global norm is standard.

Use learning rate warmup for transformers. Without warmup, the initial updates can be large and destabilizing because the Adam second moment estimates are unreliable when $t$ is small. Typical warmup is 1% to 10% of total training steps.

Use weight decay always. Even a small amount (0.01 or 0.1) acts as regularization and usually improves generalization.

Batch normalization normalizes activations within each layer, which smooths the loss surface and allows higher learning rates. It is covered in detail in this article’s context of optimization, but its regularization effects are discussed in regularization for deep networks.

Dropout is not an optimizer, but it interacts with optimization. It adds noise to training, which can slow convergence. When using dropout, you may need a slightly higher learning rate.

What comes next

You now have the tools to train deep networks efficiently. But training fast is only half the problem. Deep networks overfit easily, especially when the model is large relative to the dataset. The next article, regularization for deep networks, covers dropout, batch normalization as a regularizer, data augmentation, early stopping, and other techniques that keep your model from memorizing the training set.

← Back to all series