Training neural networks: a practical guide
In this series (25 parts)
- Neural networks: the basic building block
- Forward pass and backpropagation
- Training neural networks: a practical guide
- Convolutional neural networks
- Recurrent neural networks and LSTMs
- Attention mechanism and transformers
- Word embeddings: from one-hot to dense representations
- Transfer learning and fine-tuning
- Optimization techniques for deep networks
- Regularization for deep networks
- Encoder-decoder architectures
- Generative models: an overview
- Restricted Boltzmann Machines
- Deep Belief Networks
- Variational Autoencoders
- Generative Adversarial Networks: training and theory
- DCGAN, conditional GANs, and GAN variants
- Representation learning and self-supervised learning
- Domain adaptation and fine-tuning strategies
- Distributed representations and latent spaces
- AutoML and hyperparameter optimization
- Neural architecture search
- Network compression and efficient inference
- Graph neural networks
- Practical deep learning: debugging and tuning
Prerequisites
Make sure you understand:
- Forward pass and backpropagation: how gradients flow through a network
- SGD and its variants: the optimizers that use those gradients to update weights
The big picture
Before you train, you need to set initial weights, pick a learning rate, and decide when to stop. Get any of these wrong and the network either fails to learn or blows up.
| Decision | Bad choice | What happens | Good choice |
|---|---|---|---|
| Weight init | All zeros | Every neuron computes the same thing forever | Small random (Xavier or He) |
| Weight init | Too large | Activations saturate, gradients explode | Scale based on layer size |
| Weight init | Too small | Signals shrink to zero across layers | Scale based on layer size |
| Learning rate | Too high | Loss oscillates or diverges | Start at 0.001, tune from there |
| Learning rate | Too low | Training crawls, gets stuck | Use a schedule: start high, decay |
| Stopping | Too early | Underfitting | Monitor validation loss |
| Stopping | Too late | Overfitting | Early stopping with patience |
The full training loop in plain English
graph TD
INIT["Initialize weights
(Xavier or He)"] --> SAMPLE["Pick a mini-batch"]
SAMPLE --> FWD["Forward pass:
compute prediction"]
FWD --> LOSS["Compute loss:
how wrong?"]
LOSS --> BACK["Backward pass:
compute gradients"]
BACK --> UPDATE["Update weights:
w = w - lr * gradient"]
UPDATE --> MORE{"More batches
in this epoch?"}
MORE -->|"Yes"| SAMPLE
MORE -->|"No"| EVAL["Evaluate on
validation set"]
EVAL --> DONE{"Converged?"}
DONE -->|"No"| SAMPLE
DONE -->|"Yes"| STOP["Done"]
Now let’s formalize each decision.
Weight initialization
Before training starts, you need to set the initial weights. This choice matters more than most people think.
Why zero initialization fails
If you set all weights to zero, every neuron in a layer computes the exact same output. During backpropagation, they all receive the exact same gradient. They update identically. They stay identical forever. The network can never break this symmetry, so it behaves as if each layer has just one neuron. This is called the symmetry problem.
The fix: initialize with small random values. But the scale matters.
Xavier (Glorot) initialization
Designed for sigmoid and tanh activations. The idea: keep the variance of activations roughly the same across layers. If variance grows, activations saturate. If it shrinks, signals die.
For uniform distribution:
For normal distribution:
He initialization
Designed for ReLU activations. ReLU zeros out roughly half the neurons, so we need to compensate with larger initial weights.
For uniform distribution:
For normal distribution:
Initialization methods comparison
| Method | Variance formula | Designed for | Key assumption |
|---|---|---|---|
| Zero init | Nothing (do not use) | None; breaks symmetry | |
| Random small | Shallow networks | Ad hoc; no theoretical basis | |
| Xavier / Glorot | Sigmoid, tanh | Linear activation near zero | |
| He / Kaiming | ReLU, Leaky ReLU | Half of neurons are zeroed |
Rule of thumb: use He init for ReLU networks (which is most modern networks). Use Xavier for sigmoid or tanh.
When to use which initialization
graph TD
Q{"What activation
function?"}
Q -->|"Sigmoid or tanh"| XAVIER["Xavier / Glorot
Var = 2 / (n_in + n_out)"]
Q -->|"ReLU or Leaky ReLU"| HE["He / Kaiming
Var = 2 / n_in"]
XAVIER --> R1["Assumes linear regime
near zero"]
HE --> R2["Compensates for ReLU
zeroing half the neurons"]
The training loop
Training loss vs validation loss over 50 epochs. Validation loss starts increasing after epoch 25, signaling overfitting.
The training loop is the heartbeat of deep learning. Every iteration does the same four steps:
graph TD
A["Sample mini-batch"] --> B["Forward pass: compute ŷ"]
B --> C["Compute loss L(ŷ, y)"]
C --> D["Backward pass: compute gradients"]
D --> E["Update weights: w ← w - η∇L"]
E --> F{"More batches?"}
F -->|Yes| A
F -->|No| G["End of epoch"]
G --> H{"Converged?"}
H -->|No| A
H -->|Yes| I["Done"]
One pass through the entire training set is called an epoch. You typically train for many epochs. Within each epoch, you iterate over mini-batches.
Mini-batch vs full batch vs SGD
| Mode | Batch size | Gradient quality | Memory | Speed |
|---|---|---|---|---|
| Full batch GD | Entire dataset | Exact | Very high | Slow per update |
| Mini-batch GD | 32, 64, 128, 256 | Noisy but useful | Moderate | Best tradeoff |
| Pure SGD | 1 sample | Very noisy | Minimal | Fast but erratic |
Mini-batch gradient descent is the standard. Batch sizes of 32 to 256 work well for most problems. The noise in mini-batch gradients actually helps escape shallow local minima, acting as a form of implicit regularization.
Learning rate
The learning rate controls how big each weight update is. It is the single most important hyperparameter in training.
Too large: the optimizer overshoots the minimum, bouncing back and forth or diverging entirely.
Too small: the optimizer creeps toward the minimum so slowly that training takes forever, and it may get stuck in a poor local minimum.
Just right: the loss decreases steadily, then levels off near a good solution.
A common starting point: try with Adam, or with plain SGD. Then adjust based on the loss curve. If the loss oscillates wildly, reduce it. If it plateaus too early, try increasing it or using a schedule.
Learning rate schedules
Learning rate schedule: linear warmup for 5 epochs followed by cosine decay.
A fixed learning rate rarely works best. You want to start large (for fast progress) and reduce it over time (for fine-grained convergence). Schedules automate this.
| Schedule | Formula | When to use |
|---|---|---|
| Step decay | Simple baseline; drop by factor every epochs | |
| Cosine annealing | Smooth decay; popular in vision tasks | |
| Warmup + decay | Linear ramp from 0 to over steps, then decay | Transformer training; stabilizes early steps |
Warmup is especially important for large models. At the start, the random weights produce large, unreliable gradients. A small learning rate during warmup prevents destructive early updates. After the model stabilizes, the rate ramps up to full speed.
Warmup then cosine decay
graph LR A["Step 0 lr = 0"] -->|"Linear warmup"| B["Warmup end lr = peak"] B -->|"Cosine decay"| C["Mid training lr = 0.5 * peak"] C -->|"Cosine decay"| D["End of training lr near 0"]
Warmup prevents destructive updates when weights are still random. Cosine decay gives smooth convergence as the model approaches a good solution.
Gradient clipping
Sometimes gradients explode, especially in recurrent networks or very deep models. A single enormous gradient can destroy the weights in one update. Gradient clipping caps the gradient norm before the update.
There are two flavors:
- Clip by value: cap each gradient element to . Simple but can change the gradient direction.
- Clip by norm: if , rescale . Preserves direction; this is the standard approach.
graph TD
A["Compute gradient g"] --> B{"‖g‖ > threshold τ?"}
B -->|Yes| C["Rescale: g ← g × (τ / ‖g‖)"]
B -->|No| D["Keep g unchanged"]
C --> E["Update weights with g"]
D --> E
A typical threshold is or . You can monitor gradient norms during training to pick a good value.
Batch normalization: normalize, scale, shift
graph LR IN["Activations from previous layer"] --> NORM["Normalize: subtract mean, divide by std"] NORM --> SCALE["Scale by learned gamma"] SCALE --> SHIFT["Shift by learned beta"] SHIFT --> OUT["Stable activations for next layer"]
Batch normalization forces each layer’s activations to have zero mean and unit variance, then lets the network learn the optimal scale and shift. This smooths the loss surface and allows higher learning rates.
Practical checklist
Before you start training, check these:
- ✓ Overfit a single batch first. If the model cannot memorize 1 batch, something is broken.
- ✓ Use He init for ReLU, Xavier for sigmoid/tanh.
- ✓ Start with Adam at lr=0.001. Switch to SGD with momentum later if needed.
- ✓ Watch the loss curve. It should decrease smoothly. Spikes mean the learning rate is too high.
- ✓ Enable gradient clipping (norm = 1.0) if training is unstable.
- ✓ Normalize your inputs. Zero mean, unit variance. This helps every layer downstream.
- ✓ Use batch normalization or layer normalization for deep networks.
- ✓ Start simple. Train a small model first. Scale up once you know the pipeline works.
Example 1: Xavier vs He initialization
Compute the initialization bounds for a layer with and .
Xavier (Glorot) initialization
Variance:
Uniform bounds:
So weights are drawn from .
Normal distribution: where std .
He initialization
Variance:
Uniform bounds:
So weights are drawn from .
Normal distribution: where std .
He init gives wider bounds (0.1082 vs 0.0884) and larger std (0.0625 vs 0.0510). This compensates for ReLU zeroing out about half the neurons. Without this extra scale, activations would shrink layer by layer, making deep ReLU networks hard to train.
Example 2: Learning rate too large vs too small
Minimize using gradient descent. The gradient is . Start at .
With (too large)
The values are diverging: . Each step takes us further from the minimum at . The learning rate is so large that the optimizer overshoots and bounces wildly.
With (good)
The values converge smoothly: . Each step makes steady progress toward the minimum.
The function values tell the story even more clearly:
| Step | ||||
|---|---|---|---|---|
| 0 | 1.0 | 1.0 | 1.0 | 1.0 |
| 1 | -3.0 | 9.0 | 0.8 | 0.64 |
| 2 | 9.0 | 81.0 | 0.64 | 0.41 |
| 3 | -27.0 | 729.0 | 0.512 | 0.26 |
Example 3: Gradient clipping by norm
Given: gradient vector , threshold .
Step 1: Compute the gradient norm.
Step 2: Compare to threshold.
The norm exceeds the threshold, so we need to clip.
Step 3: Compute scaling factor.
Step 4: Rescale the gradient.
Verify: ✓
The gradient direction is preserved, but its magnitude is capped at 3.0. Without clipping, a gradient of norm 5.788 might cause a weight update nearly twice as large as intended, potentially destabilizing training.
What comes next
You now have the tools to train a basic neural network: proper initialization, a well-tuned learning rate, and gradient clipping as a safety net. The fully connected networks we have discussed so far treat every input feature independently.
But images, for example, have spatial structure. A pixel’s neighbors matter. The next article introduces convolutional neural networks, which exploit this structure by sharing weights across spatial positions, dramatically reducing parameters and improving performance on visual tasks.