Search…

Training neural networks: a practical guide

In this series (25 parts)
  1. Neural networks: the basic building block
  2. Forward pass and backpropagation
  3. Training neural networks: a practical guide
  4. Convolutional neural networks
  5. Recurrent neural networks and LSTMs
  6. Attention mechanism and transformers
  7. Word embeddings: from one-hot to dense representations
  8. Transfer learning and fine-tuning
  9. Optimization techniques for deep networks
  10. Regularization for deep networks
  11. Encoder-decoder architectures
  12. Generative models: an overview
  13. Restricted Boltzmann Machines
  14. Deep Belief Networks
  15. Variational Autoencoders
  16. Generative Adversarial Networks: training and theory
  17. DCGAN, conditional GANs, and GAN variants
  18. Representation learning and self-supervised learning
  19. Domain adaptation and fine-tuning strategies
  20. Distributed representations and latent spaces
  21. AutoML and hyperparameter optimization
  22. Neural architecture search
  23. Network compression and efficient inference
  24. Graph neural networks
  25. Practical deep learning: debugging and tuning

Prerequisites

Make sure you understand:


The big picture

Before you train, you need to set initial weights, pick a learning rate, and decide when to stop. Get any of these wrong and the network either fails to learn or blows up.

DecisionBad choiceWhat happensGood choice
Weight initAll zerosEvery neuron computes the same thing foreverSmall random (Xavier or He)
Weight initToo largeActivations saturate, gradients explodeScale based on layer size
Weight initToo smallSignals shrink to zero across layersScale based on layer size
Learning rateToo highLoss oscillates or divergesStart at 0.001, tune from there
Learning rateToo lowTraining crawls, gets stuckUse a schedule: start high, decay
StoppingToo earlyUnderfittingMonitor validation loss
StoppingToo lateOverfittingEarly stopping with patience

The full training loop in plain English

graph TD
  INIT["Initialize weights
(Xavier or He)"] --> SAMPLE["Pick a mini-batch"]
  SAMPLE --> FWD["Forward pass:
compute prediction"]
  FWD --> LOSS["Compute loss:
how wrong?"]
  LOSS --> BACK["Backward pass:
compute gradients"]
  BACK --> UPDATE["Update weights:
w = w - lr * gradient"]
  UPDATE --> MORE{"More batches
in this epoch?"}
  MORE -->|"Yes"| SAMPLE
  MORE -->|"No"| EVAL["Evaluate on
validation set"]
  EVAL --> DONE{"Converged?"}
  DONE -->|"No"| SAMPLE
  DONE -->|"Yes"| STOP["Done"]

Now let’s formalize each decision.


Weight initialization

Before training starts, you need to set the initial weights. This choice matters more than most people think.

Why zero initialization fails

If you set all weights to zero, every neuron in a layer computes the exact same output. During backpropagation, they all receive the exact same gradient. They update identically. They stay identical forever. The network can never break this symmetry, so it behaves as if each layer has just one neuron. This is called the symmetry problem.

The fix: initialize with small random values. But the scale matters.

Xavier (Glorot) initialization

Designed for sigmoid and tanh activations. The idea: keep the variance of activations roughly the same across layers. If variance grows, activations saturate. If it shrinks, signals die.

Var(w)=2nin+nout\text{Var}(w) = \frac{2}{n_{\text{in}} + n_{\text{out}}}

For uniform distribution: wU[6nin+nout,  6nin+nout]w \sim U\left[-\sqrt{\frac{6}{n_{\text{in}} + n_{\text{out}}}},\; \sqrt{\frac{6}{n_{\text{in}} + n_{\text{out}}}}\right]

For normal distribution: wN(0,  2nin+nout)w \sim \mathcal{N}\left(0,\; \frac{2}{n_{\text{in}} + n_{\text{out}}}\right)

He initialization

Designed for ReLU activations. ReLU zeros out roughly half the neurons, so we need to compensate with larger initial weights.

Var(w)=2nin\text{Var}(w) = \frac{2}{n_{\text{in}}}

For uniform distribution: wU[6nin,  6nin]w \sim U\left[-\sqrt{\frac{6}{n_{\text{in}}}},\; \sqrt{\frac{6}{n_{\text{in}}}}\right]

For normal distribution: wN(0,  2nin)w \sim \mathcal{N}\left(0,\; \frac{2}{n_{\text{in}}}\right)

Initialization methods comparison

MethodVariance formulaDesigned forKey assumption
Zero initVar=0\text{Var} = 0Nothing (do not use)None; breaks symmetry
Random smallVar0.01\text{Var} \approx 0.01Shallow networksAd hoc; no theoretical basis
Xavier / Glorot2nin+nout\frac{2}{n_{\text{in}} + n_{\text{out}}}Sigmoid, tanhLinear activation near zero
He / Kaiming2nin\frac{2}{n_{\text{in}}}ReLU, Leaky ReLUHalf of neurons are zeroed

Rule of thumb: use He init for ReLU networks (which is most modern networks). Use Xavier for sigmoid or tanh.

When to use which initialization

graph TD
  Q{"What activation
function?"}
  Q -->|"Sigmoid or tanh"| XAVIER["Xavier / Glorot
Var = 2 / (n_in + n_out)"]
  Q -->|"ReLU or Leaky ReLU"| HE["He / Kaiming
Var = 2 / n_in"]
  XAVIER --> R1["Assumes linear regime
near zero"]
  HE --> R2["Compensates for ReLU
zeroing half the neurons"]

The training loop

Training loss vs validation loss over 50 epochs. Validation loss starts increasing after epoch 25, signaling overfitting.

The training loop is the heartbeat of deep learning. Every iteration does the same four steps:

graph TD
  A["Sample mini-batch"] --> B["Forward pass: compute ŷ"]
  B --> C["Compute loss L(ŷ, y)"]
  C --> D["Backward pass: compute gradients"]
  D --> E["Update weights: w ← w - η∇L"]
  E --> F{"More batches?"}
  F -->|Yes| A
  F -->|No| G["End of epoch"]
  G --> H{"Converged?"}
  H -->|No| A
  H -->|Yes| I["Done"]

One pass through the entire training set is called an epoch. You typically train for many epochs. Within each epoch, you iterate over mini-batches.

Mini-batch vs full batch vs SGD

ModeBatch sizeGradient qualityMemorySpeed
Full batch GDEntire datasetExactVery highSlow per update
Mini-batch GD32, 64, 128, 256Noisy but usefulModerateBest tradeoff
Pure SGD1 sampleVery noisyMinimalFast but erratic

Mini-batch gradient descent is the standard. Batch sizes of 32 to 256 work well for most problems. The noise in mini-batch gradients actually helps escape shallow local minima, acting as a form of implicit regularization.


Learning rate

The learning rate η\eta controls how big each weight update is. It is the single most important hyperparameter in training.

Too large: the optimizer overshoots the minimum, bouncing back and forth or diverging entirely.

Too small: the optimizer creeps toward the minimum so slowly that training takes forever, and it may get stuck in a poor local minimum.

Just right: the loss decreases steadily, then levels off near a good solution.

A common starting point: try η=0.001\eta = 0.001 with Adam, or η=0.01\eta = 0.01 with plain SGD. Then adjust based on the loss curve. If the loss oscillates wildly, reduce it. If it plateaus too early, try increasing it or using a schedule.


Learning rate schedules

Learning rate schedule: linear warmup for 5 epochs followed by cosine decay.

A fixed learning rate rarely works best. You want to start large (for fast progress) and reduce it over time (for fine-grained convergence). Schedules automate this.

ScheduleFormulaWhen to use
Step decayηt=η0γt/s\eta_t = \eta_0 \cdot \gamma^{\lfloor t / s \rfloor}Simple baseline; drop by factor γ\gamma every ss epochs
Cosine annealingηt=η02(1+cosπtT)\eta_t = \frac{\eta_0}{2}\left(1 + \cos\frac{\pi t}{T}\right)Smooth decay; popular in vision tasks
Warmup + decayLinear ramp from 0 to η0\eta_0 over ww steps, then decayTransformer training; stabilizes early steps

Warmup is especially important for large models. At the start, the random weights produce large, unreliable gradients. A small learning rate during warmup prevents destructive early updates. After the model stabilizes, the rate ramps up to full speed.

Warmup then cosine decay

graph LR
  A["Step 0
lr = 0"] -->|"Linear warmup"| B["Warmup end
lr = peak"]
  B -->|"Cosine decay"| C["Mid training
lr = 0.5 * peak"]
  C -->|"Cosine decay"| D["End of training
lr near 0"]

Warmup prevents destructive updates when weights are still random. Cosine decay gives smooth convergence as the model approaches a good solution.


Gradient clipping

Sometimes gradients explode, especially in recurrent networks or very deep models. A single enormous gradient can destroy the weights in one update. Gradient clipping caps the gradient norm before the update.

There are two flavors:

  1. Clip by value: cap each gradient element to [τ,τ][-\tau, \tau]. Simple but can change the gradient direction.
  2. Clip by norm: if g>τ\|\mathbf{g}\| > \tau, rescale ggτg\mathbf{g} \leftarrow \mathbf{g} \cdot \frac{\tau}{\|\mathbf{g}\|}. Preserves direction; this is the standard approach.
graph TD
  A["Compute gradient g"] --> B{"‖g‖ > threshold τ?"}
  B -->|Yes| C["Rescale: g ← g × (τ / ‖g‖)"]
  B -->|No| D["Keep g unchanged"]
  C --> E["Update weights with g"]
  D --> E

A typical threshold is τ=1.0\tau = 1.0 or τ=5.0\tau = 5.0. You can monitor gradient norms during training to pick a good value.

Batch normalization: normalize, scale, shift

graph LR
  IN["Activations
from previous layer"] --> NORM["Normalize:
subtract mean,
divide by std"]
  NORM --> SCALE["Scale by
learned gamma"]
  SCALE --> SHIFT["Shift by
learned beta"]
  SHIFT --> OUT["Stable activations
for next layer"]

Batch normalization forces each layer’s activations to have zero mean and unit variance, then lets the network learn the optimal scale and shift. This smooths the loss surface and allows higher learning rates.


Practical checklist

Before you start training, check these:

  1. Overfit a single batch first. If the model cannot memorize 1 batch, something is broken.
  2. Use He init for ReLU, Xavier for sigmoid/tanh.
  3. Start with Adam at lr=0.001. Switch to SGD with momentum later if needed.
  4. Watch the loss curve. It should decrease smoothly. Spikes mean the learning rate is too high.
  5. Enable gradient clipping (norm = 1.0) if training is unstable.
  6. Normalize your inputs. Zero mean, unit variance. This helps every layer downstream.
  7. Use batch normalization or layer normalization for deep networks.
  8. Start simple. Train a small model first. Scale up once you know the pipeline works.

Example 1: Xavier vs He initialization

Compute the initialization bounds for a layer with nin=512n_{\text{in}} = 512 and nout=256n_{\text{out}} = 256.

Xavier (Glorot) initialization

Variance:

Var(w)=2nin+nout=2512+256=27680.00260\text{Var}(w) = \frac{2}{n_{\text{in}} + n_{\text{out}}} = \frac{2}{512 + 256} = \frac{2}{768} \approx 0.00260

Uniform bounds:

±6nin+nout=±6768=±0.00781±0.0884\pm\sqrt{\frac{6}{n_{\text{in}} + n_{\text{out}}}} = \pm\sqrt{\frac{6}{768}} = \pm\sqrt{0.00781} \approx \pm 0.0884

So weights are drawn from U[0.0884,0.0884]U[-0.0884, 0.0884].

Normal distribution: wN(0,0.05102)w \sim \mathcal{N}(0, 0.0510^2) where std =0.002600.0510= \sqrt{0.00260} \approx 0.0510.

He initialization

Variance:

Var(w)=2nin=2512=0.00391\text{Var}(w) = \frac{2}{n_{\text{in}}} = \frac{2}{512} = 0.00391

Uniform bounds:

±6nin=±6512=±0.01172±0.1082\pm\sqrt{\frac{6}{n_{\text{in}}}} = \pm\sqrt{\frac{6}{512}} = \pm\sqrt{0.01172} \approx \pm 0.1082

So weights are drawn from U[0.1082,0.1082]U[-0.1082, 0.1082].

Normal distribution: wN(0,0.06252)w \sim \mathcal{N}(0, 0.0625^2) where std =0.003910.0625= \sqrt{0.00391} \approx 0.0625.

He init gives wider bounds (0.1082 vs 0.0884) and larger std (0.0625 vs 0.0510). This compensates for ReLU zeroing out about half the neurons. Without this extra scale, activations would shrink layer by layer, making deep ReLU networks hard to train.


Example 2: Learning rate too large vs too small

Minimize f(x)=x2f(x) = x^2 using gradient descent. The gradient is f(x)=2xf'(x) = 2x. Start at x0=1x_0 = 1.

With η=2.0\eta = 2.0 (too large)

x1=x0ηf(x0)=12.0×2(1)=14=3x_1 = x_0 - \eta \cdot f'(x_0) = 1 - 2.0 \times 2(1) = 1 - 4 = -3

x2=32.0×2(3)=3+12=9x_2 = -3 - 2.0 \times 2(-3) = -3 + 12 = 9

x3=92.0×2(9)=936=27x_3 = 9 - 2.0 \times 2(9) = 9 - 36 = -27

The values are diverging: 139271 \to -3 \to 9 \to -27. Each step takes us further from the minimum at x=0x = 0. The learning rate is so large that the optimizer overshoots and bounces wildly.

With η=0.1\eta = 0.1 (good)

x1=10.1×2(1)=10.2=0.8x_1 = 1 - 0.1 \times 2(1) = 1 - 0.2 = 0.8

x2=0.80.1×2(0.8)=0.80.16=0.64x_2 = 0.8 - 0.1 \times 2(0.8) = 0.8 - 0.16 = 0.64

x3=0.640.1×2(0.64)=0.640.128=0.512x_3 = 0.64 - 0.1 \times 2(0.64) = 0.64 - 0.128 = 0.512

The values converge smoothly: 10.80.640.51201 \to 0.8 \to 0.64 \to 0.512 \to \cdots \to 0. Each step makes steady progress toward the minimum.

The function values tell the story even more clearly:

Stepη=2.0\eta = 2.0f(x)f(x)η=0.1\eta = 0.1f(x)f(x)
01.01.01.01.0
1-3.09.00.80.64
29.081.00.640.41
3-27.0729.00.5120.26

Example 3: Gradient clipping by norm

Given: gradient vector g=[3.0,4.0,1.5,2.5]\mathbf{g} = [3.0, -4.0, 1.5, -2.5], threshold τ=3.0\tau = 3.0.

Step 1: Compute the gradient norm.

g=3.02+(4.0)2+1.52+(2.5)2\|\mathbf{g}\| = \sqrt{3.0^2 + (-4.0)^2 + 1.5^2 + (-2.5)^2}

=9.0+16.0+2.25+6.25=33.55.788= \sqrt{9.0 + 16.0 + 2.25 + 6.25} = \sqrt{33.5} \approx 5.788

Step 2: Compare to threshold.

g=5.788>τ=3.0\|\mathbf{g}\| = 5.788 > \tau = 3.0

The norm exceeds the threshold, so we need to clip.

Step 3: Compute scaling factor.

scale=τg=3.05.7880.5183\text{scale} = \frac{\tau}{\|\mathbf{g}\|} = \frac{3.0}{5.788} \approx 0.5183

Step 4: Rescale the gradient.

gclipped=g×0.5183=[1.555,  2.073,  0.777,  1.296]\mathbf{g}_{\text{clipped}} = \mathbf{g} \times 0.5183 = [1.555,\; -2.073,\; 0.777,\; -1.296]

Verify: gclipped=1.5552+2.0732+0.7772+1.2962=2.418+4.297+0.604+1.680=8.9993.0\|\mathbf{g}_{\text{clipped}}\| = \sqrt{1.555^2 + 2.073^2 + 0.777^2 + 1.296^2} = \sqrt{2.418 + 4.297 + 0.604 + 1.680} = \sqrt{8.999} \approx 3.0

The gradient direction is preserved, but its magnitude is capped at 3.0. Without clipping, a gradient of norm 5.788 might cause a weight update nearly twice as large as intended, potentially destabilizing training.


What comes next

You now have the tools to train a basic neural network: proper initialization, a well-tuned learning rate, and gradient clipping as a safety net. The fully connected networks we have discussed so far treat every input feature independently.

But images, for example, have spatial structure. A pixel’s neighbors matter. The next article introduces convolutional neural networks, which exploit this structure by sharing weights across spatial positions, dramatically reducing parameters and improving performance on visual tasks.

Start typing to search across all content
navigate Enter open Esc close