Jan 29, 2026 · 18 min read · Deep Learning

Practical deep learning: debugging and tuning

In this series (25 parts)

Knowing how backpropagation works and what Adam does is necessary but not sufficient. The gap between understanding deep learning and getting a model to actually work is enormous. Most of your time will be spent debugging: figuring out why the loss won’t go down, why validation accuracy plateaus, why training suddenly diverges. This article covers the practical skills that close that gap.

Prerequisites

You should be familiar with training neural networks (forward/backward pass, loss functions), optimization techniques (batch normalization, learning rate schedules), and regularization (dropout, weight decay). Experience training at least one model yourself will make this much more concrete.

Before training: sanity checks

Before investing hours of GPU time, run these cheap checks. They catch the majority of bugs in minutes.

Overfit one batch

Take a single small batch (8-32 samples) and train on it. Your model should memorize it perfectly within a few hundred steps. If it can’t drive the loss to near zero on 8 examples, something is fundamentally broken. This test catches:

Incorrect loss function implementation
Wrong labels or data loading bugs
Architecture errors that prevent learning
Activation functions applied incorrectly

Check loss at initialization

Before any training, compute the loss on a random batch. For classification with $C$ classes using cross-entropy loss:

\text{Expected initial loss} = -\log\left(\frac{1}{C}\right) = \log(C)

For 10 classes: $\log(10) \approx 2.30$ . If your initial loss is 15.0, something is wrong (maybe the loss is not averaged, or labels are misaligned).

Verify data pipeline

Visualize a few training samples with their labels. Confirm that augmentations look correct. Check that normalization statistics are right. A shocking number of bugs come from the data pipeline: channels in the wrong order, labels shifted by one, normalization applied twice.

Loss curves: what they tell you

The training loss curve is your primary diagnostic tool. Learn to read it like a doctor reads an EKG.

Common loss curve patterns over 50 epochs

graph TD
  Start["Loss not
decreasing"] --> Q1{"Initial loss
correct?"}
  Q1 -->|"No"| F1["Check loss function,
labels, output layer"]
  Q1 -->|"Yes"| Q2{"Can overfit
one batch?"}
  Q2 -->|"No"| F2["Bug in model:
check architecture,
activations, data flow"]
  Q2 -->|"Yes"| Q3{"Learning rate
too high?"}
  Q3 -->|"Loss oscillates wildly"| F3["Reduce learning rate
by 10x"]
  Q3 -->|"Loss flat"| Q4{"Learning rate
too low?"}
  Q4 -->|"Maybe"| F4["Increase learning rate
by 10x"]
  Q4 -->|"Already tried"| Q5{"Gradient flow
blocked?"}
  Q5 -->|"Yes"| F5["Check for dead ReLUs,
vanishing gradients,
bad initialization"]
  Q5 -->|"No"| F6["Check data quality,
task difficulty,
model capacity"]

Five loss curve patterns

graph LR
  subgraph "Healthy"
      H["Train ↓ steadily
Val ↓ then levels off
Small gap"]
  end
  subgraph "Overfitting"
      O["Train ↓ keeps going
Val ↓ then ↑
Growing gap"]
  end
  subgraph "Underfitting"
      U["Train ↓ slowly
Val ↓ slowly
Both still high"]
  end
  subgraph "Diverging"
      D["Loss ↑ or explodes
to NaN/Inf"]
  end
  subgraph "Oscillating"
      OS["Loss bounces
up and down
No clear trend"]
  end

Healthy: training loss decreases smoothly, validation loss follows but levels off. The gap between them is small. This is what you want.

Overfitting: training loss keeps decreasing but validation loss starts increasing. The model memorizes training data instead of generalizing. Fixes: more data, regularization (dropout, weight decay), data augmentation, smaller model.

Underfitting: both training and validation loss are high and decrease slowly. The model doesn’t have enough capacity or isn’t learning fast enough. Fixes: larger model, more layers, higher learning rate, train longer.

Diverging: loss increases or explodes to NaN/Inf. Almost always a learning rate problem. Fixes: reduce learning rate, check for numerical issues (log of zero, division by zero), add gradient clipping.

Oscillating: loss bounces up and down without a clear downward trend. Usually means the learning rate is too high or the batch size is too small. Fixes: reduce learning rate, increase batch size, use learning rate warmup.

Gradient diagnostics

If loss curves don’t tell you enough, look at the gradients directly.

Gradient norms per layer

Compute the L2 norm of gradients for each layer after each backward pass. Healthy training shows gradient norms in a similar range across all layers.

Example 2: Diagnosing vanishing gradients

After one backward pass, you observe these gradient norms per layer:

Layer	Gradient norm
Layer 1 (closest to input)	0.0003
Layer 2	0.001
Layer 3	0.8
Layer 4 (closest to output)	2.1

The pattern is clear: gradients decrease by orders of magnitude as you go deeper (closer to the input). Layer 1’s gradient is 7,000 times smaller than Layer 4’s. This is vanishing gradients.

Diagnosis: the first layers are barely learning. By the chain rule, gradients multiply through each layer. If each layer’s Jacobian has spectral norm less than 1, the product shrinks exponentially with depth.

Three concrete fixes:

Use residual connections: skip connections let gradients flow directly to earlier layers, bypassing the vanishing chain of multiplications.
Switch activation functions: replace sigmoid/tanh with ReLU or its variants (LeakyReLU, GELU). Sigmoid saturates for large inputs, producing near-zero gradients.
Apply batch normalization: normalizing activations at each layer keeps them in a range where gradients flow well. Place batch norm before or after the activation function.

If gradient norms are large everywhere and increasing, you have exploding gradients instead. Apply gradient clipping: cap the global gradient norm to a maximum value (commonly 1.0 or 5.0).

The debugging checklist

When your model isn’t working, go through these 10 things in order. Most problems are caught by the first five.

#	Check	What to look for
1	Data loading	Correct labels, no off-by-one, proper normalization
2	Loss function	Matches the task (cross-entropy for classification, MSE for regression)
3	Output layer	Correct size, correct activation (softmax for classification, none for regression before loss)
4	Initial loss	Matches $\log(C)$ for cross-entropy with $C$ classes
5	Overfit one batch	Loss reaches near zero on a tiny batch
6	Learning rate	Try 1e-1, 1e-2, 1e-3, 1e-4 on a small subset
7	Gradient flow	Norms are reasonable and similar across layers
8	Shapes	Print tensor shapes at each layer; look for broadcasting bugs
9	Data augmentation	Visualize augmented samples; make sure they still look correct
10	Preprocessing	Train/val/test use the same normalization statistics (from training set only)

Example 1: Sanity check failure

Setup: 8 training samples, model with 10,000 parameters. Learning rate 0.001. After 1,000 steps, the loss is 1.8 (for a 10-class problem, initial should be ~2.3).

The loss decreased from 2.3 to 1.8 but didn’t reach zero. With 10K parameters and 8 samples, the model has more than enough capacity to memorize. What’s wrong?

Walk through 5 causes:

Wrong labels: the 8 samples might have incorrect or shuffled labels. Verify by printing each sample with its label.
Output dimension mismatch: if the model outputs 5 classes but labels go up to 9, the loss can’t reach zero. Check model.output_dim == num_classes.
Activation after output: if you apply softmax in the model and also in the loss function (CrossEntropyLoss in PyTorch includes softmax), you’re double-softmaxing. The gradients will be wrong.
Learning rate too low: 0.001 might be too conservative for memorizing 8 samples. Try 0.01 or 0.1.
Data not varying: if all 8 samples are nearly identical but have different labels, the model needs to memorize noise, which requires a very specific learning rate and many more steps. Check that the batch has diverse examples.

Ablation studies

An ablation study removes or disables one component at a time to measure its contribution. It answers the question: “Does this thing actually help?”

Example 3: Designing an ablation

You built a ResNet with attention pooling, data augmentation, and dropout. Test accuracy is 94.2%. Which components matter?

Experiment	Attention pooling	Augmentation	Dropout	Test accuracy
Full model	✓	✓	✓	94.2%
No attention	✗	✓	✓	93.1%
No augmentation	✓	✗	✓	91.5%
No dropout	✓	✓	✗	93.8%

Reading the results:

Removing augmentation causes the biggest drop (2.7 percentage points). Data augmentation is the most important component.
Removing attention pooling drops 1.1 points. Worth keeping.
Removing dropout drops only 0.4 points. It helps a little, but you might drop it if inference speed matters.

Run each experiment 3 times with different seeds and report mean and standard deviation. A 0.4% improvement is meaningless if the standard deviation is 0.5%.

Common failure modes

Symptom	Likely cause	Fix
Loss = NaN after a few steps	Learning rate too high, log(0), division by zero	Reduce LR by 10x, add epsilon to denominators
Loss stuck at $\log(C)$	Model outputting uniform predictions	Check gradients, verify output layer, increase LR
Train loss great, val loss terrible	Overfitting	Add regularization, more data, early stopping
Both losses high	Underfitting or data issue	Bigger model, check data quality, train longer
Val loss increases after epoch 1	Severe overfitting or data leak	Check for data leakage between train/val
Loss oscillates wildly	LR too high or batch size too small	Halve LR, double batch size
Accuracy = 1/num_classes	Random guessing	Model not learning at all, start from checklist item 1
Training suddenly diverges mid-run	LR schedule issue, data corruption	Check LR schedule, inspect data at that step

Hyperparameter sensitivity and priority

Not all hyperparameters deserve equal tuning effort. Here’s a practical priority ranking:

Parameter	Sensitivity	Tune first?	Typical range
Learning rate	Very high	✓ Yes, always	1e-4 to 1e-1
Batch size	High	✓ Early	16 to 512
Weight decay	Medium	After LR	1e-5 to 1e-2
Dropout rate	Medium	After LR	0.1 to 0.5
Number of layers	Medium-high	During arch search	Task dependent
Hidden units	Medium	During arch search	64 to 2048
Optimizer (Adam vs SGD)	Medium	Try both early	Adam first, SGD+momentum if time
LR schedule	Medium	After base LR	Cosine, step, or reduce-on-plateau
Data augmentation	High (for vision)	Early	Task dependent

Start with learning rate. Get that right and many other things fall into place. Use learning rate finder techniques: start very small, increase exponentially each step, plot loss vs LR, and pick the LR where loss decreases fastest.

Reproducibility

Results you can’t reproduce are not results. Three things to get right:

Random seeds: set seeds for Python, NumPy, PyTorch/TensorFlow, and CUDA. Note that some GPU operations are inherently non-deterministic.

import torch
import numpy as np
import random

random.seed(42)
np.random.seed(42)
torch.manual_seed(42)
torch.cuda.manual_seed_all(42)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

Version pinning: record exact versions of all libraries. A PyTorch update can change default initialization or optimizer behavior.
Logging everything: for every run, save the full config (hyperparameters, data paths, git commit hash), all metrics over time, and the best model checkpoint. You will need this when you try to figure out what worked three weeks ago.

Experiment tracking

Once you run more than a handful of experiments, spreadsheets break down. Use proper experiment tracking tools:

What to log: hyperparameters, training/validation metrics per epoch, gradient norms, learning rate schedule, system metrics (GPU utilization, memory), final test metrics
What to save: model checkpoints (at least the best one), training config, random seeds, git commit hash

A good experiment log lets you answer questions like:

“What was the best learning rate for the ResNet-50 experiments last week?”
“Did adding dropout actually help, or was it the augmentation change I made at the same time?”
“What hyperparameters produced the model currently in production?”

Tools like Weights & Biases, MLflow, or even a well-structured directory of JSON configs can work. The key is consistency: log the same things every time.

Practical workflow

Here’s the order I recommend for a new project:

Get a baseline working fast. Small model, default hyperparameters, no augmentation. Just make sure the pipeline runs end-to-end.
Sanity checks. Overfit one batch. Check initial loss. Visualize data.
Tune learning rate. This is the single most impactful hyperparameter.
Scale up the model until it overfits the training set. If it never overfits, your model is too small or your data is too noisy.
Add regularization to close the train-val gap. Dropout, weight decay, data augmentation.
Run ablations to confirm each component helps.
Final tuning with Bayesian optimization or Hyperband on the most sensitive hyperparameters.

What comes next

This is the final article in the Deep Learning from Scratch series. If you’ve followed along from the beginning, you now have a solid foundation: from single neurons to CNNs, RNNs, Transformers, graph neural networks, architecture search, compression, and the practical skills to make it all work.

The field moves fast, but the fundamentals change slowly. Backpropagation, gradient descent, regularization, and careful experimentation remain at the core of everything. Master those, and you can pick up any new architecture or technique by reading the paper and implementing it yourself.

For the full curriculum, including the mathematics, optimization, and classical machine learning foundations that support deep learning, see the complete series index.

← Back to all series