Optimization techniques for deep networks
In this series (25 parts)
- Neural networks: the basic building block
- Forward pass and backpropagation
- Training neural networks: a practical guide
- Convolutional neural networks
- Recurrent neural networks and LSTMs
- Attention mechanism and transformers
- Word embeddings: from one-hot to dense representations
- Transfer learning and fine-tuning
- Optimization techniques for deep networks
- Regularization for deep networks
- Encoder-decoder architectures
- Generative models: an overview
- Restricted Boltzmann Machines
- Deep Belief Networks
- Variational Autoencoders
- Generative Adversarial Networks: training and theory
- DCGAN, conditional GANs, and GAN variants
- Representation learning and self-supervised learning
- Domain adaptation and fine-tuning strategies
- Distributed representations and latent spaces
- AutoML and hyperparameter optimization
- Neural architecture search
- Network compression and efficient inference
- Graph neural networks
- Practical deep learning: debugging and tuning
Prerequisites: SGD and its variants, partial derivatives and gradients, and training neural networks.
Training a deep network means minimizing a loss function with millions of parameters. Standard gradient descent works fine on simple problems, but it breaks down on the complex, high-dimensional loss surfaces that deep networks produce. This article covers the optimizers that actually work: momentum, RMSProp, Adam, and the practical tricks (gradient clipping, learning rate schedules, weight decay) that make training stable.
The big picture
Gradient descent works, but it is slow and gets stuck. Better optimizers fix specific failure modes: oscillation, saddle points, and the need to tune one learning rate for millions of parameters.
| Optimizer | How it works | Speed | Oscillation | Tuning needed |
|---|---|---|---|---|
| SGD | Follow the gradient, fixed step | Slow | High (zig-zags) | Must hand-tune lr |
| SGD + Momentum | Accumulate velocity from past gradients | Faster | Less (smoothed) | lr and momentum |
| RMSProp | Scale each parameter by its recent gradient size | Fast | Low | lr and decay |
| Adam | Momentum + per-parameter scaling + bias correction | Fast | Low | Defaults usually work |
Momentum: a ball rolling downhill. Adam: a ball that also adjusts its step size per dimension.
graph LR SGD["SGD: takes one step in gradient direction"] --> MOM["Momentum: accumulates velocity, rolls through flat spots"] MOM --> ADAM["Adam: adapts step size per parameter, combines both ideas"]
Now let’s see why standard gradient descent fails and how each optimizer addresses it.
Why standard gradient descent struggles
Contour plot of f(x, y) = x^2 + 50y^2. SGD zigzags along the narrow valley, while Adam takes a more direct path to the minimum.
A deep network’s loss surface is nothing like the smooth bowl you see in textbook examples. Three properties make it hard to optimize.
Ill-conditioning. The Hessian of the loss has eigenvalues that span many orders of magnitude. Some directions curve steeply while others are nearly flat. Gradient descent with a single learning rate oscillates along steep directions and crawls along flat ones. You cannot fix this by just picking a smaller learning rate, because that makes the flat directions even slower.
Saddle points. In high dimensions, saddle points vastly outnumber local minima. At a saddle point the gradient is zero, but the surface curves up in some directions and down in others. Standard gradient descent can get stuck near these points for many iterations because the gradient magnitude is tiny.
Flat regions. Parts of the loss surface have very small gradients. This is especially common with saturating activations like sigmoid or tanh. Backpropagation multiplies small numbers through many layers via the chain rule, and the gradient can shrink to near zero.
We need optimizers that handle these problems. The core ideas are: (1) use history of past gradients to build up speed in consistent directions, and (2) adapt the learning rate per parameter so each weight gets an appropriately sized update.
flowchart TD START["Initialize weights"] --> GD["Standard GD"] GD --> SADDLE["Stuck near saddle point (tiny gradient, slow progress)"] GD --> OSCILLATE["Oscillates in steep directions (zig-zag path, wasted steps)"] GD --> FLAT["Crawls through flat region (vanishing gradient)"] SADDLE --> FIX["Solution: Momentum builds velocity to escape"] OSCILLATE --> FIX2["Solution: Adaptive LR per-parameter scaling"] FLAT --> FIX3["Solution: Adam combines both ideas"] FIX --> CONVERGE["Faster convergence"] FIX2 --> CONVERGE FIX3 --> CONVERGE
Momentum
Plain SGD computes a gradient and takes one step. Momentum keeps a running average of past gradients, called the velocity . When the gradient points in the same direction for several steps, the velocity builds up and the optimizer moves faster. When the gradient oscillates, the velocity averages out the noise.
The update rules are:
Here is the momentum coefficient (typically 0.9), is the learning rate, and is the gradient at step .
Think of a ball rolling downhill. On a flat stretch it keeps rolling because of its built-up velocity. In a narrow valley it does not bounce side to side as much because the sideways components cancel out.
The trade-off is simple: higher means smoother updates but slower reaction to sudden changes in the loss surface. Lower tracks the current gradient more closely but smooths less.
Momentum accumulation over time
graph LR G1["Step 1 g = 2.0 v = 2.0"] --> G2["Step 2 g = 1.8 v = 3.6"] G2 --> G3["Step 3 g = 1.5 v = 4.7"] G3 --> G4["Step 4 g = -0.5 v = 3.7"]
When gradients point consistently in the same direction, the velocity builds up. When the gradient reverses (step 4), the accumulated velocity absorbs the change, preventing abrupt direction shifts.
RMSProp
Momentum helps with direction but does not fix the scale problem. Some parameters have consistently large gradients while others have small ones. RMSProp (Root Mean Square Propagation) tracks the running average of squared gradients and uses it to normalize the update.
The term estimates the recent magnitude of the gradient for each parameter. Dividing by it means parameters with large gradients get smaller updates, and parameters with small gradients get larger updates. The small constant (typically ) prevents division by zero.
RMSProp adapts the learning rate per parameter. You no longer need to manually tune different rates for different layers. Typical for RMSProp is 0.9 or 0.99.
Adam
Adam (Adaptive Moment Estimation) combines both ideas. It maintains a first moment estimate (like momentum) and a second moment estimate (like RMSProp).
Since and , the estimates are biased toward zero in early steps. Adam corrects for this:
The parameter update is:
Default hyperparameters are , , , and . These defaults work well for most problems, which is one reason Adam is so popular.
The bias correction matters most in early training. At , and . Without correction, the first few updates would be far too small.
Adam’s two moment estimates
graph TD G["Gradient g_t"] --> M["First moment m_t (mean of gradients, like momentum)"] G --> V["Second moment v_t (mean of squared gradients, like RMSProp)"] M --> BC1["Bias-corrected m-hat"] V --> BC2["Bias-corrected v-hat"] BC1 --> UPDATE["Update: step = lr * m-hat / sqrt(v-hat)"] BC2 --> UPDATE
The first moment tracks which direction to go. The second moment tracks how big recent gradients have been, scaling the step accordingly.
flowchart LR SGD["SGD (basic gradient step)"] MOM["+ Momentum (velocity from past gradients)"] ADA["+ Adaptive LR (RMSProp: per-param scaling)"] ADAM["= Adam (both combined + bias correction)"] SGD --> MOM --> ADA --> ADAM
AdaGrad
Before RMSProp, there was AdaGrad. It accumulates all past squared gradients from the start of training:
Since only grows, the effective learning rate monotonically decreases. This is good for sparse gradients (like word embeddings in NLP) because rare features get larger updates. But for deep networks, the learning rate eventually shrinks to near zero and training stalls. RMSProp fixes this by using an exponential moving average instead of a running sum, so old gradients gradually fade out.
Optimizer comparison
Loss vs epoch for four optimizers. Adam converges fastest, followed by RMSProp, then Momentum, with vanilla SGD the slowest.
| Name | Update rule (brief) | Extra memory | Key hyperparams | Best for | Known issue |
|---|---|---|---|---|---|
| SGD | None | Simple models, convex problems | Slow on ill-conditioned surfaces | ||
| SGD + Momentum | (same size as ) | , | CNNs with tuned schedule | Requires careful LR tuning | |
| AdaGrad | (same size as ) | Sparse gradients (NLP embeddings) | LR goes to zero over time | ||
| RMSProp | (same size as ) | , | RNNs, non-stationary objectives | No bias correction | |
| Adam | , (2x size of ) | , , | Default choice for most tasks | Can generalize worse than SGD+momentum |
Convergence paths: SGD vs momentum vs Adam
graph TD START["Start"] --> SGD_PATH["SGD: zig-zag path, many steps, oscillates in narrow valleys"] START --> MOM_PATH["Momentum: smoother curve, fewer steps, builds speed in consistent directions"] START --> ADAM_PATH["Adam: nearly direct path, adapts per parameter, fewest steps"] SGD_PATH --> GOAL["Minimum"] MOM_PATH --> GOAL ADAM_PATH --> GOAL
Weight decay and L2 regularization
L2 regularization adds a penalty to the loss. The gradient of this penalty is , so the SGD update becomes:
The term shrinks the weights every step. For plain SGD, L2 regularization and weight decay are mathematically equivalent.
For Adam, they differ. L2 regularization adds to the gradient before the adaptive scaling. This means the regularization strength varies per parameter because of the denominator. Weight decay, in contrast, subtracts after the adaptive update:
This decoupled version is called AdamW. In practice, AdamW gives better generalization than Adam with L2, especially for large models. The bias-variance tradeoff favors the more uniform regularization that AdamW provides.
Gradient clipping
When gradients explode, a single update can ruin the model. This is common in RNNs and transformers where long sequences cause the gradient norm to grow exponentially through the chain rule.
Clip by value. Clamp each element of the gradient independently:
This is simple but changes the direction of the gradient vector.
Clip by global norm. Compute the norm of the entire gradient vector. If it exceeds a threshold , scale the whole vector down:
Clipping by norm preserves the gradient direction, which is usually what you want. A typical threshold is 1.0 or 5.0, depending on the model.
Learning rate warmup and cosine decay
The learning rate schedule controls how changes during training. Two widely used techniques are warmup and cosine decay.
Warmup. Start with a very small learning rate and linearly increase it to the peak value over a fixed number of warmup steps. This prevents early instability when the model weights are still random and gradients are noisy. Warmup is especially important for transformers and large batch training.
During warmup (step ):
Cosine decay. After warmup, decrease the learning rate following a cosine curve. This gives a slow start to the decay, a faster middle phase, and a gentle landing at zero.
During decay (step , total steps ):
Learning rate schedule comparison
| Name | Formula / description | Use case | Pros | Cons |
|---|---|---|---|---|
| Constant | Debugging, small experiments | Simple | Rarely optimal for full training | |
| Step decay | Multiply by 0.1 every epochs | CNNs (ResNet-style training) | Easy to implement | Requires manual milestone tuning |
| Cosine annealing | Transformers, modern CNNs | Smooth, no milestones to set | Needs total step count in advance | |
| Warmup + cosine | Linear ramp then cosine decay | Transformers, large batch | Prevents early instability | Two extra hyperparams (, ) |
| Reduce on plateau | Cut when val loss stalls | General fine-tuning | Adaptive to training dynamics | Can react too slowly |
Learning rate warmup and cosine decay
graph LR A["Step 0 lr = 0"] -->|"Linear ramp"| B["Warmup end lr = peak"] B -->|"Slow decay"| C["Mid training lr = 0.5 * peak"] C -->|"Faster decay"| D["Late training lr = 0.1 * peak"] D -->|"Gentle landing"| E["End lr near 0"]
Warmup prevents destructive early updates when Adam’s moment estimates are unreliable. Cosine decay avoids the sudden drops of step-decay schedules and gives a smooth landing.
Example 1: three steps of Adam on
We minimize starting at . Hyperparameters: , , , . Initialize , .
The gradient of is .
Step 1 (, ):
Bias correction (, ):
Step 2 (, ):
Bias correction (, ):
Step 3 (, ):
Bias correction (, ):
Notice the pattern: Adam takes steps of almost exactly 0.01 regardless of the gradient magnitude. The bias-corrected second moment is close to , so . The effective step size is governed by , not the gradient scale. This is a key property of Adam.
Example 2: gradient clipping by global norm
Suppose we have a gradient vector and a clipping threshold .
Step 1: compute the global norm.
Step 2: check against threshold.
Since , we need to clip.
Step 3: compute scale factor.
Step 4: multiply each element by the scale factor.
Verify: the norm of the clipped gradient should be 5.0:
The direction is preserved. Every element was scaled by the same factor, so the gradient still points the same way. Only the magnitude changed.
Example 3: learning rate with warmup and cosine decay
Total training steps . Warmup steps . Peak learning rate .
During warmup ():
During decay ():
| Step | Phase | Calculation | |
|---|---|---|---|
| 0 | Warmup | 0.000000 | |
| 50 | Warmup | 0.000500 | |
| 100 | Warmup (peak) | 0.001000 | |
| 300 | Cosine decay | 0.000883 | |
| 700 | Cosine decay | 0.000250 | |
| 1000 | Cosine decay (end) | 0.000000 |
The schedule ramps up quickly, holds near the peak for a while, then smoothly decays to zero. At step 300 (early in decay) the rate is still 88% of peak. By step 700, it has dropped to 25%. This smooth curve avoids the sudden drops of step decay schedules.
Practical advice: when to use what
Start with Adam (or AdamW). It works well out of the box for most tasks. Set , , . Add weight decay of 0.01. This is the baseline.
Consider SGD + momentum for CNNs. Research shows SGD with momentum can generalize better than Adam on image classification, but you need to tune the learning rate and schedule carefully. If you have the compute budget for hyperparameter search, try it.
Use gradient clipping for RNNs and transformers. Set the clipping threshold to 1.0 and adjust if you see training instability. For transformers, clipping by global norm is standard.
Use learning rate warmup for transformers. Without warmup, the initial updates can be large and destabilizing because the Adam second moment estimates are unreliable when is small. Typical warmup is 1% to 10% of total training steps.
Use weight decay always. Even a small amount (0.01 or 0.1) acts as regularization and usually improves generalization.
Batch normalization normalizes activations within each layer, which smooths the loss surface and allows higher learning rates. It is covered in detail in this article’s context of optimization, but its regularization effects are discussed in regularization for deep networks.
Dropout is not an optimizer, but it interacts with optimization. It adds noise to training, which can slow convergence. When using dropout, you may need a slightly higher learning rate.
What comes next
You now have the tools to train deep networks efficiently. But training fast is only half the problem. Deep networks overfit easily, especially when the model is large relative to the dataset. The next article, regularization for deep networks, covers dropout, batch normalization as a regularizer, data augmentation, early stopping, and other techniques that keep your model from memorizing the training set.