Search…

Optimization techniques for deep networks

In this series (25 parts)
  1. Neural networks: the basic building block
  2. Forward pass and backpropagation
  3. Training neural networks: a practical guide
  4. Convolutional neural networks
  5. Recurrent neural networks and LSTMs
  6. Attention mechanism and transformers
  7. Word embeddings: from one-hot to dense representations
  8. Transfer learning and fine-tuning
  9. Optimization techniques for deep networks
  10. Regularization for deep networks
  11. Encoder-decoder architectures
  12. Generative models: an overview
  13. Restricted Boltzmann Machines
  14. Deep Belief Networks
  15. Variational Autoencoders
  16. Generative Adversarial Networks: training and theory
  17. DCGAN, conditional GANs, and GAN variants
  18. Representation learning and self-supervised learning
  19. Domain adaptation and fine-tuning strategies
  20. Distributed representations and latent spaces
  21. AutoML and hyperparameter optimization
  22. Neural architecture search
  23. Network compression and efficient inference
  24. Graph neural networks
  25. Practical deep learning: debugging and tuning

Prerequisites: SGD and its variants, partial derivatives and gradients, and training neural networks.

Training a deep network means minimizing a loss function with millions of parameters. Standard gradient descent works fine on simple problems, but it breaks down on the complex, high-dimensional loss surfaces that deep networks produce. This article covers the optimizers that actually work: momentum, RMSProp, Adam, and the practical tricks (gradient clipping, learning rate schedules, weight decay) that make training stable.

The big picture

Gradient descent works, but it is slow and gets stuck. Better optimizers fix specific failure modes: oscillation, saddle points, and the need to tune one learning rate for millions of parameters.

OptimizerHow it worksSpeedOscillationTuning needed
SGDFollow the gradient, fixed stepSlowHigh (zig-zags)Must hand-tune lr
SGD + MomentumAccumulate velocity from past gradientsFasterLess (smoothed)lr and momentum
RMSPropScale each parameter by its recent gradient sizeFastLowlr and decay
AdamMomentum + per-parameter scaling + bias correctionFastLowDefaults usually work

Momentum: a ball rolling downhill. Adam: a ball that also adjusts its step size per dimension.

graph LR
  SGD["SGD:
takes one step
in gradient direction"] --> MOM["Momentum:
accumulates velocity,
rolls through flat spots"]
  MOM --> ADAM["Adam:
adapts step size
per parameter,
combines both ideas"]

Now let’s see why standard gradient descent fails and how each optimizer addresses it.

Why standard gradient descent struggles

Contour plot of f(x, y) = x^2 + 50y^2. SGD zigzags along the narrow valley, while Adam takes a more direct path to the minimum.

A deep network’s loss surface is nothing like the smooth bowl you see in textbook examples. Three properties make it hard to optimize.

Ill-conditioning. The Hessian of the loss has eigenvalues that span many orders of magnitude. Some directions curve steeply while others are nearly flat. Gradient descent with a single learning rate oscillates along steep directions and crawls along flat ones. You cannot fix this by just picking a smaller learning rate, because that makes the flat directions even slower.

Saddle points. In high dimensions, saddle points vastly outnumber local minima. At a saddle point the gradient is zero, but the surface curves up in some directions and down in others. Standard gradient descent can get stuck near these points for many iterations because the gradient magnitude is tiny.

Flat regions. Parts of the loss surface have very small gradients. This is especially common with saturating activations like sigmoid or tanh. Backpropagation multiplies small numbers through many layers via the chain rule, and the gradient can shrink to near zero.

We need optimizers that handle these problems. The core ideas are: (1) use history of past gradients to build up speed in consistent directions, and (2) adapt the learning rate per parameter so each weight gets an appropriately sized update.

flowchart TD
  START["Initialize weights"] --> GD["Standard GD"]
  GD --> SADDLE["Stuck near saddle point
(tiny gradient, slow progress)"]
  GD --> OSCILLATE["Oscillates in steep directions
(zig-zag path, wasted steps)"]
  GD --> FLAT["Crawls through flat region
(vanishing gradient)"]
  SADDLE --> FIX["Solution: Momentum
builds velocity to escape"]
  OSCILLATE --> FIX2["Solution: Adaptive LR
per-parameter scaling"]
  FLAT --> FIX3["Solution: Adam
combines both ideas"]
  FIX --> CONVERGE["Faster convergence"]
  FIX2 --> CONVERGE
  FIX3 --> CONVERGE

Momentum

Plain SGD computes a gradient gtg_t and takes one step. Momentum keeps a running average of past gradients, called the velocity vtv_t. When the gradient points in the same direction for several steps, the velocity builds up and the optimizer moves faster. When the gradient oscillates, the velocity averages out the noise.

The update rules are:

vt=βvt1+gtv_t = \beta \, v_{t-1} + g_t θt=θt1αvt\theta_t = \theta_{t-1} - \alpha \, v_t

Here β\beta is the momentum coefficient (typically 0.9), α\alpha is the learning rate, and gt=θL(θt1)g_t = \nabla_\theta L(\theta_{t-1}) is the gradient at step tt.

Think of a ball rolling downhill. On a flat stretch it keeps rolling because of its built-up velocity. In a narrow valley it does not bounce side to side as much because the sideways components cancel out.

The trade-off is simple: higher β\beta means smoother updates but slower reaction to sudden changes in the loss surface. Lower β\beta tracks the current gradient more closely but smooths less.

Momentum accumulation over time

graph LR
  G1["Step 1
g = 2.0
v = 2.0"] --> G2["Step 2
g = 1.8
v = 3.6"]
  G2 --> G3["Step 3
g = 1.5
v = 4.7"]
  G3 --> G4["Step 4
g = -0.5
v = 3.7"]

When gradients point consistently in the same direction, the velocity builds up. When the gradient reverses (step 4), the accumulated velocity absorbs the change, preventing abrupt direction shifts.

RMSProp

Momentum helps with direction but does not fix the scale problem. Some parameters have consistently large gradients while others have small ones. RMSProp (Root Mean Square Propagation) tracks the running average of squared gradients and uses it to normalize the update.

st=βst1+(1β)gt2s_t = \beta \, s_{t-1} + (1 - \beta) \, g_t^2 θt=θt1αgtst+ϵ\theta_t = \theta_{t-1} - \frac{\alpha \, g_t}{\sqrt{s_t} + \epsilon}

The term st\sqrt{s_t} estimates the recent magnitude of the gradient for each parameter. Dividing by it means parameters with large gradients get smaller updates, and parameters with small gradients get larger updates. The small constant ϵ\epsilon (typically 10810^{-8}) prevents division by zero.

RMSProp adapts the learning rate per parameter. You no longer need to manually tune different rates for different layers. Typical β\beta for RMSProp is 0.9 or 0.99.

Adam

Adam (Adaptive Moment Estimation) combines both ideas. It maintains a first moment estimate mtm_t (like momentum) and a second moment estimate vtv_t (like RMSProp).

mt=β1mt1+(1β1)gtm_t = \beta_1 \, m_{t-1} + (1 - \beta_1) \, g_t vt=β2vt1+(1β2)gt2v_t = \beta_2 \, v_{t-1} + (1 - \beta_2) \, g_t^2

Since m0=0m_0 = 0 and v0=0v_0 = 0, the estimates are biased toward zero in early steps. Adam corrects for this:

m^t=mt1β1t,v^t=vt1β2t\hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t}

The parameter update is:

θt=θt1αm^tv^t+ϵ\theta_t = \theta_{t-1} - \frac{\alpha \, \hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}

Default hyperparameters are β1=0.9\beta_1 = 0.9, β2=0.999\beta_2 = 0.999, ϵ=108\epsilon = 10^{-8}, and α=0.001\alpha = 0.001. These defaults work well for most problems, which is one reason Adam is so popular.

The bias correction matters most in early training. At t=1t = 1, m1=0.1g1m_1 = 0.1 \cdot g_1 and m^1=g1\hat{m}_1 = g_1. Without correction, the first few updates would be far too small.

Adam’s two moment estimates

graph TD
  G["Gradient g_t"] --> M["First moment m_t
(mean of gradients,
like momentum)"]
  G --> V["Second moment v_t
(mean of squared gradients,
like RMSProp)"]
  M --> BC1["Bias-corrected m-hat"]
  V --> BC2["Bias-corrected v-hat"]
  BC1 --> UPDATE["Update:
step = lr * m-hat / sqrt(v-hat)"]
  BC2 --> UPDATE

The first moment tracks which direction to go. The second moment tracks how big recent gradients have been, scaling the step accordingly.

flowchart LR
  SGD["SGD
(basic gradient step)"]
  MOM["+ Momentum
(velocity from past gradients)"]
  ADA["+ Adaptive LR
(RMSProp: per-param scaling)"]
  ADAM["= Adam
(both combined +
bias correction)"]
  SGD --> MOM --> ADA --> ADAM

AdaGrad

Before RMSProp, there was AdaGrad. It accumulates all past squared gradients from the start of training:

st=st1+gt2s_t = s_{t-1} + g_t^2 θt=θt1αgtst+ϵ\theta_t = \theta_{t-1} - \frac{\alpha \, g_t}{\sqrt{s_t} + \epsilon}

Since sts_t only grows, the effective learning rate monotonically decreases. This is good for sparse gradients (like word embeddings in NLP) because rare features get larger updates. But for deep networks, the learning rate eventually shrinks to near zero and training stalls. RMSProp fixes this by using an exponential moving average instead of a running sum, so old gradients gradually fade out.

Optimizer comparison

Loss vs epoch for four optimizers. Adam converges fastest, followed by RMSProp, then Momentum, with vanilla SGD the slowest.

NameUpdate rule (brief)Extra memoryKey hyperparamsBest forKnown issue
SGDθαg\theta - \alpha gNoneα\alphaSimple models, convex problemsSlow on ill-conditioned surfaces
SGD + Momentumθαvt\theta - \alpha v_tvv (same size as θ\theta)α\alpha, β\betaCNNs with tuned scheduleRequires careful LR tuning
AdaGradθαg/st\theta - \alpha g / \sqrt{s_t}ss (same size as θ\theta)α\alphaSparse gradients (NLP embeddings)LR goes to zero over time
RMSPropθαg/st\theta - \alpha g / \sqrt{s_t}ss (same size as θ\theta)α\alpha, β\betaRNNs, non-stationary objectivesNo bias correction
Adamθαm^t/v^t\theta - \alpha \hat{m}_t / \sqrt{\hat{v}_t}mm, vv (2x size of θ\theta)α\alpha, β1\beta_1, β2\beta_2Default choice for most tasksCan generalize worse than SGD+momentum

Convergence paths: SGD vs momentum vs Adam

graph TD
  START["Start"] --> SGD_PATH["SGD:
zig-zag path,
many steps,
oscillates in narrow valleys"]
  START --> MOM_PATH["Momentum:
smoother curve,
fewer steps,
builds speed in
consistent directions"]
  START --> ADAM_PATH["Adam:
nearly direct path,
adapts per parameter,
fewest steps"]
  SGD_PATH --> GOAL["Minimum"]
  MOM_PATH --> GOAL
  ADAM_PATH --> GOAL

Weight decay and L2 regularization

L2 regularization adds a penalty λ2θ2\frac{\lambda}{2} \|\theta\|^2 to the loss. The gradient of this penalty is λθ\lambda \theta, so the SGD update becomes:

θt=θt1α(gt+λθt1)=(1αλ)θt1αgt\theta_t = \theta_{t-1} - \alpha (g_t + \lambda \theta_{t-1}) = (1 - \alpha \lambda) \theta_{t-1} - \alpha g_t

The term (1αλ)(1 - \alpha \lambda) shrinks the weights every step. For plain SGD, L2 regularization and weight decay are mathematically equivalent.

For Adam, they differ. L2 regularization adds λθ\lambda \theta to the gradient before the adaptive scaling. This means the regularization strength varies per parameter because of the v^t\sqrt{\hat{v}_t} denominator. Weight decay, in contrast, subtracts αλθ\alpha \lambda \theta after the adaptive update:

θt=θt1α(m^tv^t+ϵ+λθt1)\theta_t = \theta_{t-1} - \alpha \left( \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} + \lambda \theta_{t-1} \right)

This decoupled version is called AdamW. In practice, AdamW gives better generalization than Adam with L2, especially for large models. The bias-variance tradeoff favors the more uniform regularization that AdamW provides.

Gradient clipping

When gradients explode, a single update can ruin the model. This is common in RNNs and transformers where long sequences cause the gradient norm to grow exponentially through the chain rule.

Clip by value. Clamp each element of the gradient independently:

gi=max(c,min(c,gi))g_i' = \max(-c, \min(c, g_i))

This is simple but changes the direction of the gradient vector.

Clip by global norm. Compute the norm of the entire gradient vector. If it exceeds a threshold cc, scale the whole vector down:

g={gif gccggif g>cg' = \begin{cases} g & \text{if } \|g\| \leq c \\ \frac{c}{\|g\|} \, g & \text{if } \|g\| > c \end{cases}

Clipping by norm preserves the gradient direction, which is usually what you want. A typical threshold is 1.0 or 5.0, depending on the model.

Learning rate warmup and cosine decay

The learning rate schedule controls how α\alpha changes during training. Two widely used techniques are warmup and cosine decay.

Warmup. Start with a very small learning rate and linearly increase it to the peak value over a fixed number of warmup steps. This prevents early instability when the model weights are still random and gradients are noisy. Warmup is especially important for transformers and large batch training.

During warmup (step sTws \leq T_w):

αs=αpeaksTw\alpha_s = \alpha_{\text{peak}} \cdot \frac{s}{T_w}

Cosine decay. After warmup, decrease the learning rate following a cosine curve. This gives a slow start to the decay, a faster middle phase, and a gentle landing at zero.

During decay (step s>Tws > T_w, total steps TT):

αs=αpeak12(1+cos(πsTwTTw))\alpha_s = \alpha_{\text{peak}} \cdot \frac{1}{2} \left(1 + \cos\left(\pi \cdot \frac{s - T_w}{T - T_w}\right)\right)

Learning rate schedule comparison

NameFormula / descriptionUse caseProsCons
Constantαs=α0\alpha_s = \alpha_0Debugging, small experimentsSimpleRarely optimal for full training
Step decayMultiply α\alpha by 0.1 every kk epochsCNNs (ResNet-style training)Easy to implementRequires manual milestone tuning
Cosine annealingαs=α02(1+cos(πs/T))\alpha_s = \frac{\alpha_0}{2}(1 + \cos(\pi s / T))Transformers, modern CNNsSmooth, no milestones to setNeeds total step count in advance
Warmup + cosineLinear ramp then cosine decayTransformers, large batchPrevents early instabilityTwo extra hyperparams (TwT_w, TT)
Reduce on plateauCut α\alpha when val loss stallsGeneral fine-tuningAdaptive to training dynamicsCan react too slowly

Learning rate warmup and cosine decay

graph LR
  A["Step 0
lr = 0"] -->|"Linear ramp"| B["Warmup end
lr = peak"]
  B -->|"Slow decay"| C["Mid training
lr = 0.5 * peak"]
  C -->|"Faster decay"| D["Late training
lr = 0.1 * peak"]
  D -->|"Gentle landing"| E["End
lr near 0"]

Warmup prevents destructive early updates when Adam’s moment estimates are unreliable. Cosine decay avoids the sudden drops of step-decay schedules and gives a smooth landing.

Example 1: three steps of Adam on f(x)=x2f(x) = x^2

We minimize f(x)=x2f(x) = x^2 starting at x0=2.0x_0 = 2.0. Hyperparameters: α=0.01\alpha = 0.01, β1=0.9\beta_1 = 0.9, β2=0.999\beta_2 = 0.999, ϵ=108\epsilon = 10^{-8}. Initialize m0=0m_0 = 0, v0=0v_0 = 0.

The gradient of f(x)=x2f(x) = x^2 is g=2xg = 2x.

Step 1 (t=1t = 1, x0=2.0x_0 = 2.0):

g1=2×2.0=4.0g_1 = 2 \times 2.0 = 4.0 m1=0.9×0+0.1×4.0=0.4000m_1 = 0.9 \times 0 + 0.1 \times 4.0 = 0.4000 v1=0.999×0+0.001×16.0=0.0160v_1 = 0.999 \times 0 + 0.001 \times 16.0 = 0.0160

Bias correction (β11=0.9\beta_1^1 = 0.9, β21=0.999\beta_2^1 = 0.999):

m^1=0.400010.9=0.40000.1=4.0000\hat{m}_1 = \frac{0.4000}{1 - 0.9} = \frac{0.4000}{0.1} = 4.0000 v^1=0.016010.999=0.01600.001=16.0000\hat{v}_1 = \frac{0.0160}{1 - 0.999} = \frac{0.0160}{0.001} = 16.0000 x1=2.00.01×4.000016.0000+108=2.00.01×4.00004.0000=2.00.01=1.9900x_1 = 2.0 - 0.01 \times \frac{4.0000}{\sqrt{16.0000} + 10^{-8}} = 2.0 - 0.01 \times \frac{4.0000}{4.0000} = 2.0 - 0.01 = 1.9900

Step 2 (t=2t = 2, x1=1.9900x_1 = 1.9900):

g2=2×1.9900=3.9800g_2 = 2 \times 1.9900 = 3.9800 m2=0.9×0.4000+0.1×3.9800=0.3600+0.3980=0.7580m_2 = 0.9 \times 0.4000 + 0.1 \times 3.9800 = 0.3600 + 0.3980 = 0.7580 v2=0.999×0.0160+0.001×15.8404=0.015984+0.015840=0.031824v_2 = 0.999 \times 0.0160 + 0.001 \times 15.8404 = 0.015984 + 0.015840 = 0.031824

Bias correction (β12=0.81\beta_1^2 = 0.81, β22=0.998001\beta_2^2 = 0.998001):

m^2=0.758010.81=0.75800.19=3.9895\hat{m}_2 = \frac{0.7580}{1 - 0.81} = \frac{0.7580}{0.19} = 3.9895 v^2=0.03182410.998001=0.0318240.001999=15.9200\hat{v}_2 = \frac{0.031824}{1 - 0.998001} = \frac{0.031824}{0.001999} = 15.9200 x2=1.99000.01×3.989515.9200+108=1.99000.01×3.98953.9900=1.99000.01×0.9999=1.9800x_2 = 1.9900 - 0.01 \times \frac{3.9895}{\sqrt{15.9200} + 10^{-8}} = 1.9900 - 0.01 \times \frac{3.9895}{3.9900} = 1.9900 - 0.01 \times 0.9999 = 1.9800

Step 3 (t=3t = 3, x2=1.9800x_2 = 1.9800):

g3=2×1.9800=3.9600g_3 = 2 \times 1.9800 = 3.9600 m3=0.9×0.7580+0.1×3.9600=0.6822+0.3960=1.0782m_3 = 0.9 \times 0.7580 + 0.1 \times 3.9600 = 0.6822 + 0.3960 = 1.0782 v3=0.999×0.031824+0.001×15.6816=0.031792+0.015682=0.047474v_3 = 0.999 \times 0.031824 + 0.001 \times 15.6816 = 0.031792 + 0.015682 = 0.047474

Bias correction (β13=0.729\beta_1^3 = 0.729, β23=0.997003\beta_2^3 = 0.997003):

m^3=1.078210.729=1.07820.271=3.9786\hat{m}_3 = \frac{1.0782}{1 - 0.729} = \frac{1.0782}{0.271} = 3.9786 v^3=0.04747410.997003=0.0474740.002997=15.8406\hat{v}_3 = \frac{0.047474}{1 - 0.997003} = \frac{0.047474}{0.002997} = 15.8406 x3=1.98000.01×3.978615.8406+108=1.98000.01×3.97863.9800=1.98000.01×0.9996=1.9700x_3 = 1.9800 - 0.01 \times \frac{3.9786}{\sqrt{15.8406} + 10^{-8}} = 1.9800 - 0.01 \times \frac{3.9786}{3.9800} = 1.9800 - 0.01 \times 0.9996 = 1.9700

Notice the pattern: Adam takes steps of almost exactly 0.01 regardless of the gradient magnitude. The bias-corrected second moment v^t\hat{v}_t is close to gt2g_t^2, so m^t/v^t±1\hat{m}_t / \sqrt{\hat{v}_t} \approx \pm 1. The effective step size is governed by α\alpha, not the gradient scale. This is a key property of Adam.

Example 2: gradient clipping by global norm

Suppose we have a gradient vector g=[6,8,3,4,2]g = [6, -8, 3, -4, 2] and a clipping threshold c=5.0c = 5.0.

Step 1: compute the global norm.

g=62+(8)2+32+(4)2+22=36+64+9+16+4=12911.3578\|g\| = \sqrt{6^2 + (-8)^2 + 3^2 + (-4)^2 + 2^2} = \sqrt{36 + 64 + 9 + 16 + 4} = \sqrt{129} \approx 11.3578

Step 2: check against threshold.

Since 11.3578>5.011.3578 > 5.0, we need to clip.

Step 3: compute scale factor.

scale=cg=5.011.35780.4402\text{scale} = \frac{c}{\|g\|} = \frac{5.0}{11.3578} \approx 0.4402

Step 4: multiply each element by the scale factor.

g=0.4402×[6,8,3,4,2]=[2.6414,3.5218,1.3207,1.7609,0.8805]g' = 0.4402 \times [6, -8, 3, -4, 2] = [2.6414, -3.5218, 1.3207, -1.7609, 0.8805]

Verify: the norm of the clipped gradient should be 5.0:

g=2.64142+(3.5218)2+1.32072+(1.7609)2+0.88052=6.977+12.403+1.744+3.101+0.775=25.0=5.0  \|g'\| = \sqrt{2.6414^2 + (-3.5218)^2 + 1.3207^2 + (-1.7609)^2 + 0.8805^2} = \sqrt{6.977 + 12.403 + 1.744 + 3.101 + 0.775} = \sqrt{25.0} = 5.0 \; \checkmark

The direction is preserved. Every element was scaled by the same factor, so the gradient still points the same way. Only the magnitude changed.

Example 3: learning rate with warmup and cosine decay

Total training steps T=1000T = 1000. Warmup steps Tw=100T_w = 100. Peak learning rate αpeak=0.001\alpha_{\text{peak}} = 0.001.

During warmup (s100s \leq 100): αs=0.001×s100\alpha_s = 0.001 \times \frac{s}{100}

During decay (s>100s > 100): αs=0.001×12(1+cos(πs100900))\alpha_s = 0.001 \times \frac{1}{2}\left(1 + \cos\left(\pi \cdot \frac{s - 100}{900}\right)\right)

Step ssPhaseCalculationαs\alpha_s
0Warmup0.001×0/1000.001 \times 0/1000.000000
50Warmup0.001×50/1000.001 \times 50/1000.000500
100Warmup (peak)0.001×100/1000.001 \times 100/1000.001000
300Cosine decay0.001×0.5×(1+cos(π×200/900))0.001 \times 0.5 \times (1 + \cos(\pi \times 200/900)) =0.001×0.5×(1+cos(0.6981))= 0.001 \times 0.5 \times (1 + \cos(0.6981)) =0.001×0.5×1.7660= 0.001 \times 0.5 \times 1.76600.000883
700Cosine decay0.001×0.5×(1+cos(π×600/900))0.001 \times 0.5 \times (1 + \cos(\pi \times 600/900)) =0.001×0.5×(1+cos(2.0944))= 0.001 \times 0.5 \times (1 + \cos(2.0944)) =0.001×0.5×0.5000= 0.001 \times 0.5 \times 0.50000.000250
1000Cosine decay (end)0.001×0.5×(1+cos(π))0.001 \times 0.5 \times (1 + \cos(\pi)) =0.001×0.5×0= 0.001 \times 0.5 \times 00.000000

The schedule ramps up quickly, holds near the peak for a while, then smoothly decays to zero. At step 300 (early in decay) the rate is still 88% of peak. By step 700, it has dropped to 25%. This smooth curve avoids the sudden drops of step decay schedules.

Practical advice: when to use what

Start with Adam (or AdamW). It works well out of the box for most tasks. Set α=0.001\alpha = 0.001, β1=0.9\beta_1 = 0.9, β2=0.999\beta_2 = 0.999. Add weight decay of 0.01. This is the baseline.

Consider SGD + momentum for CNNs. Research shows SGD with momentum can generalize better than Adam on image classification, but you need to tune the learning rate and schedule carefully. If you have the compute budget for hyperparameter search, try it.

Use gradient clipping for RNNs and transformers. Set the clipping threshold to 1.0 and adjust if you see training instability. For transformers, clipping by global norm is standard.

Use learning rate warmup for transformers. Without warmup, the initial updates can be large and destabilizing because the Adam second moment estimates are unreliable when tt is small. Typical warmup is 1% to 10% of total training steps.

Use weight decay always. Even a small amount (0.01 or 0.1) acts as regularization and usually improves generalization.

Batch normalization normalizes activations within each layer, which smooths the loss surface and allows higher learning rates. It is covered in detail in this article’s context of optimization, but its regularization effects are discussed in regularization for deep networks.

Dropout is not an optimizer, but it interacts with optimization. It adds noise to training, which can slow convergence. When using dropout, you may need a slightly higher learning rate.

What comes next

You now have the tools to train deep networks efficiently. But training fast is only half the problem. Deep networks overfit easily, especially when the model is large relative to the dataset. The next article, regularization for deep networks, covers dropout, batch normalization as a regularizer, data augmentation, early stopping, and other techniques that keep your model from memorizing the training set.

Start typing to search across all content
navigate Enter open Esc close