Oct 11, 2025 · 20 min read · Deep Learning

Training neural networks: a practical guide

In this series (25 parts)

Prerequisites

Make sure you understand:

Forward pass and backpropagation: how gradients flow through a network
SGD and its variants: the optimizers that use those gradients to update weights

The big picture

Before you train, you need to set initial weights, pick a learning rate, and decide when to stop. Get any of these wrong and the network either fails to learn or blows up.

Decision	Bad choice	What happens	Good choice
Weight init	All zeros	Every neuron computes the same thing forever	Small random (Xavier or He)
Weight init	Too large	Activations saturate, gradients explode	Scale based on layer size
Weight init	Too small	Signals shrink to zero across layers	Scale based on layer size
Learning rate	Too high	Loss oscillates or diverges	Start at 0.001, tune from there
Learning rate	Too low	Training crawls, gets stuck	Use a schedule: start high, decay
Stopping	Too early	Underfitting	Monitor validation loss
Stopping	Too late	Overfitting	Early stopping with patience

The full training loop in plain English

graph TD
  INIT["Initialize weights
(Xavier or He)"] --> SAMPLE["Pick a mini-batch"]
  SAMPLE --> FWD["Forward pass:
compute prediction"]
  FWD --> LOSS["Compute loss:
how wrong?"]
  LOSS --> BACK["Backward pass:
compute gradients"]
  BACK --> UPDATE["Update weights:
w = w - lr * gradient"]
  UPDATE --> MORE{"More batches
in this epoch?"}
  MORE -->|"Yes"| SAMPLE
  MORE -->|"No"| EVAL["Evaluate on
validation set"]
  EVAL --> DONE{"Converged?"}
  DONE -->|"No"| SAMPLE
  DONE -->|"Yes"| STOP["Done"]

Now let’s formalize each decision.

Weight initialization

Before training starts, you need to set the initial weights. This choice matters more than most people think.

Why zero initialization fails

If you set all weights to zero, every neuron in a layer computes the exact same output. During backpropagation, they all receive the exact same gradient. They update identically. They stay identical forever. The network can never break this symmetry, so it behaves as if each layer has just one neuron. This is called the symmetry problem.

The fix: initialize with small random values. But the scale matters.

Xavier (Glorot) initialization

Designed for sigmoid and tanh activations. The idea: keep the variance of activations roughly the same across layers. If variance grows, activations saturate. If it shrinks, signals die.

$\text{Var}(w) = \frac{2}{n_{\text{in}} + n_{\text{out}}}$

For uniform distribution: $w \sim U\left[-\sqrt{\frac{6}{n_{\text{in}} + n_{\text{out}}}},\; \sqrt{\frac{6}{n_{\text{in}} + n_{\text{out}}}}\right]$

For normal distribution: $w \sim \mathcal{N}\left(0,\; \frac{2}{n_{\text{in}} + n_{\text{out}}}\right)$

He initialization

Designed for ReLU activations. ReLU zeros out roughly half the neurons, so we need to compensate with larger initial weights.

$\text{Var}(w) = \frac{2}{n_{\text{in}}}$

For uniform distribution: $w \sim U\left[-\sqrt{\frac{6}{n_{\text{in}}}},\; \sqrt{\frac{6}{n_{\text{in}}}}\right]$

For normal distribution: $w \sim \mathcal{N}\left(0,\; \frac{2}{n_{\text{in}}}\right)$

Initialization methods comparison

Method	Variance formula	Designed for	Key assumption
Zero init	$\text{Var} = 0$	Nothing (do not use)	None; breaks symmetry
Random small	$\text{Var} \approx 0.01$	Shallow networks	Ad hoc; no theoretical basis
Xavier / Glorot	$\frac{2}{n_{\text{in}} + n_{\text{out}}}$	Sigmoid, tanh	Linear activation near zero
He / Kaiming	$\frac{2}{n_{\text{in}}}$	ReLU, Leaky ReLU	Half of neurons are zeroed

Rule of thumb: use He init for ReLU networks (which is most modern networks). Use Xavier for sigmoid or tanh.

When to use which initialization

graph TD
  Q{"What activation
function?"}
  Q -->|"Sigmoid or tanh"| XAVIER["Xavier / Glorot
Var = 2 / (n_in + n_out)"]
  Q -->|"ReLU or Leaky ReLU"| HE["He / Kaiming
Var = 2 / n_in"]
  XAVIER --> R1["Assumes linear regime
near zero"]
  HE --> R2["Compensates for ReLU
zeroing half the neurons"]

The training loop

Training loss vs validation loss over 50 epochs. Validation loss starts increasing after epoch 25, signaling overfitting.

The training loop is the heartbeat of deep learning. Every iteration does the same four steps:

graph TD
  A["Sample mini-batch"] --> B["Forward pass: compute ŷ"]
  B --> C["Compute loss L(ŷ, y)"]
  C --> D["Backward pass: compute gradients"]
  D --> E["Update weights: w ← w - η∇L"]
  E --> F{"More batches?"}
  F -->|Yes| A
  F -->|No| G["End of epoch"]
  G --> H{"Converged?"}
  H -->|No| A
  H -->|Yes| I["Done"]

One pass through the entire training set is called an epoch. You typically train for many epochs. Within each epoch, you iterate over mini-batches.

Mini-batch vs full batch vs SGD

Mode	Batch size	Gradient quality	Memory	Speed
Full batch GD	Entire dataset	Exact	Very high	Slow per update
Mini-batch GD	32, 64, 128, 256	Noisy but useful	Moderate	Best tradeoff
Pure SGD	1 sample	Very noisy	Minimal	Fast but erratic

Mini-batch gradient descent is the standard. Batch sizes of 32 to 256 work well for most problems. The noise in mini-batch gradients actually helps escape shallow local minima, acting as a form of implicit regularization.

Learning rate

The learning rate $\eta$ controls how big each weight update is. It is the single most important hyperparameter in training.

Too large: the optimizer overshoots the minimum, bouncing back and forth or diverging entirely.

Too small: the optimizer creeps toward the minimum so slowly that training takes forever, and it may get stuck in a poor local minimum.

Just right: the loss decreases steadily, then levels off near a good solution.

A common starting point: try $\eta = 0.001$ with Adam, or $\eta = 0.01$ with plain SGD. Then adjust based on the loss curve. If the loss oscillates wildly, reduce it. If it plateaus too early, try increasing it or using a schedule.

Learning rate schedules

Learning rate schedule: linear warmup for 5 epochs followed by cosine decay.

A fixed learning rate rarely works best. You want to start large (for fast progress) and reduce it over time (for fine-grained convergence). Schedules automate this.

Schedule	Formula	When to use
Step decay	$\eta_t = \eta_0 \cdot \gamma^{\lfloor t / s \rfloor}$	Simple baseline; drop by factor $\gamma$ every $s$ epochs
Cosine annealing	$\eta_t = \frac{\eta_0}{2}\left(1 + \cos\frac{\pi t}{T}\right)$	Smooth decay; popular in vision tasks
Warmup + decay	Linear ramp from 0 to $\eta_0$ over $w$ steps, then decay	Transformer training; stabilizes early steps

Warmup is especially important for large models. At the start, the random weights produce large, unreliable gradients. A small learning rate during warmup prevents destructive early updates. After the model stabilizes, the rate ramps up to full speed.

Warmup then cosine decay

graph LR
  A["Step 0
lr = 0"] -->|"Linear warmup"| B["Warmup end
lr = peak"]
  B -->|"Cosine decay"| C["Mid training
lr = 0.5 * peak"]
  C -->|"Cosine decay"| D["End of training
lr near 0"]

Warmup prevents destructive updates when weights are still random. Cosine decay gives smooth convergence as the model approaches a good solution.

Gradient clipping

Sometimes gradients explode, especially in recurrent networks or very deep models. A single enormous gradient can destroy the weights in one update. Gradient clipping caps the gradient norm before the update.

There are two flavors:

Clip by value: cap each gradient element to $[-\tau, \tau]$ . Simple but can change the gradient direction.
Clip by norm: if $\|\mathbf{g}\| > \tau$ , rescale $\mathbf{g} \leftarrow \mathbf{g} \cdot \frac{\tau}{\|\mathbf{g}\|}$ . Preserves direction; this is the standard approach.

graph TD
  A["Compute gradient g"] --> B{"‖g‖ > threshold τ?"}
  B -->|Yes| C["Rescale: g ← g × (τ / ‖g‖)"]
  B -->|No| D["Keep g unchanged"]
  C --> E["Update weights with g"]
  D --> E

A typical threshold is $\tau = 1.0$ or $\tau = 5.0$ . You can monitor gradient norms during training to pick a good value.

Batch normalization: normalize, scale, shift

graph LR
  IN["Activations
from previous layer"] --> NORM["Normalize:
subtract mean,
divide by std"]
  NORM --> SCALE["Scale by
learned gamma"]
  SCALE --> SHIFT["Shift by
learned beta"]
  SHIFT --> OUT["Stable activations
for next layer"]

Batch normalization forces each layer’s activations to have zero mean and unit variance, then lets the network learn the optimal scale and shift. This smooths the loss surface and allows higher learning rates.

Practical checklist

Before you start training, check these:

✓ Overfit a single batch first. If the model cannot memorize 1 batch, something is broken.
✓ Use He init for ReLU, Xavier for sigmoid/tanh.
✓ Start with Adam at lr=0.001. Switch to SGD with momentum later if needed.
✓ Watch the loss curve. It should decrease smoothly. Spikes mean the learning rate is too high.
✓ Enable gradient clipping (norm = 1.0) if training is unstable.
✓ Normalize your inputs. Zero mean, unit variance. This helps every layer downstream.
✓ Use batch normalization or layer normalization for deep networks.
✓ Start simple. Train a small model first. Scale up once you know the pipeline works.

Example 1: Xavier vs He initialization

Compute the initialization bounds for a layer with $n_{\text{in}} = 512$ and $n_{\text{out}} = 256$ .

Xavier (Glorot) initialization

Variance:

$\text{Var}(w) = \frac{2}{n_{\text{in}} + n_{\text{out}}} = \frac{2}{512 + 256} = \frac{2}{768} \approx 0.00260$

Uniform bounds:

$\pm\sqrt{\frac{6}{n_{\text{in}} + n_{\text{out}}}} = \pm\sqrt{\frac{6}{768}} = \pm\sqrt{0.00781} \approx \pm 0.0884$

So weights are drawn from $U[-0.0884, 0.0884]$ .

Normal distribution: $w \sim \mathcal{N}(0, 0.0510^2)$ where std $= \sqrt{0.00260} \approx 0.0510$ .

He initialization

Variance:

$\text{Var}(w) = \frac{2}{n_{\text{in}}} = \frac{2}{512} = 0.00391$

Uniform bounds:

$\pm\sqrt{\frac{6}{n_{\text{in}}}} = \pm\sqrt{\frac{6}{512}} = \pm\sqrt{0.01172} \approx \pm 0.1082$

So weights are drawn from $U[-0.1082, 0.1082]$ .

Normal distribution: $w \sim \mathcal{N}(0, 0.0625^2)$ where std $= \sqrt{0.00391} \approx 0.0625$ .

He init gives wider bounds (0.1082 vs 0.0884) and larger std (0.0625 vs 0.0510). This compensates for ReLU zeroing out about half the neurons. Without this extra scale, activations would shrink layer by layer, making deep ReLU networks hard to train.

Example 2: Learning rate too large vs too small

Minimize $f(x) = x^2$ using gradient descent. The gradient is $f'(x) = 2x$ . Start at $x_0 = 1$ .

With $\eta = 2.0$ (too large)

$x_1 = x_0 - \eta \cdot f'(x_0) = 1 - 2.0 \times 2(1) = 1 - 4 = -3$

$x_2 = -3 - 2.0 \times 2(-3) = -3 + 12 = 9$

$x_3 = 9 - 2.0 \times 2(9) = 9 - 36 = -27$

The values are diverging: $1 \to -3 \to 9 \to -27$ . Each step takes us further from the minimum at $x = 0$ . The learning rate is so large that the optimizer overshoots and bounces wildly.

With $\eta = 0.1$ (good)

$x_1 = 1 - 0.1 \times 2(1) = 1 - 0.2 = 0.8$

$x_2 = 0.8 - 0.1 \times 2(0.8) = 0.8 - 0.16 = 0.64$

$x_3 = 0.64 - 0.1 \times 2(0.64) = 0.64 - 0.128 = 0.512$

The values converge smoothly: $1 \to 0.8 \to 0.64 \to 0.512 \to \cdots \to 0$ . Each step makes steady progress toward the minimum.

The function values tell the story even more clearly:

Step	$\eta = 2.0$	$f(x)$	$\eta = 0.1$	$f(x)$
0	1.0	1.0	1.0	1.0
1	-3.0	9.0	0.8	0.64
2	9.0	81.0	0.64	0.41
3	-27.0	729.0	0.512	0.26

Example 3: Gradient clipping by norm

Given: gradient vector $\mathbf{g} = [3.0, -4.0, 1.5, -2.5]$ , threshold $\tau = 3.0$ .

Step 1: Compute the gradient norm.

$\|\mathbf{g}\| = \sqrt{3.0^2 + (-4.0)^2 + 1.5^2 + (-2.5)^2}$

$= \sqrt{9.0 + 16.0 + 2.25 + 6.25} = \sqrt{33.5} \approx 5.788$

Step 2: Compare to threshold.

$\|\mathbf{g}\| = 5.788 > \tau = 3.0$

The norm exceeds the threshold, so we need to clip.

Step 3: Compute scaling factor.

$\text{scale} = \frac{\tau}{\|\mathbf{g}\|} = \frac{3.0}{5.788} \approx 0.5183$

Step 4: Rescale the gradient.

$\mathbf{g}_{\text{clipped}} = \mathbf{g} \times 0.5183 = [1.555,\; -2.073,\; 0.777,\; -1.296]$

Verify: $\|\mathbf{g}_{\text{clipped}}\| = \sqrt{1.555^2 + 2.073^2 + 0.777^2 + 1.296^2} = \sqrt{2.418 + 4.297 + 0.604 + 1.680} = \sqrt{8.999} \approx 3.0$ ✓

The gradient direction is preserved, but its magnitude is capped at 3.0. Without clipping, a gradient of norm 5.788 might cause a weight update nearly twice as large as intended, potentially destabilizing training.

What comes next

You now have the tools to train a basic neural network: proper initialization, a well-tuned learning rate, and gradient clipping as a safety net. The fully connected networks we have discussed so far treat every input feature independently.

But images, for example, have spatial structure. A pixel’s neighbors matter. The next article introduces convolutional neural networks, which exploit this structure by sharing weights across spatial positions, dramatically reducing parameters and improving performance on visual tasks.

← Back to all series

Training neural networks: a practical guide

Prerequisites

The big picture

Weight initialization

Why zero initialization fails

Xavier (Glorot) initialization

He initialization

Initialization methods comparison

The training loop

Mini-batch vs full batch vs SGD

Learning rate

Learning rate schedules

Gradient clipping

Practical checklist

Example 1: Xavier vs He initialization

Xavier (Glorot) initialization

He initialization

Example 2: Learning rate too large vs too small

With η=2.0\eta = 2.0η=2.0 (too large)

With η=0.1\eta = 0.1η=0.1 (good)

Example 3: Gradient clipping by norm

What comes next

With $\eta = 2.0$ (too large)

With $\eta = 0.1$ (good)