Practical deep learning: debugging and tuning
In this series (25 parts)
- Neural networks: the basic building block
- Forward pass and backpropagation
- Training neural networks: a practical guide
- Convolutional neural networks
- Recurrent neural networks and LSTMs
- Attention mechanism and transformers
- Word embeddings: from one-hot to dense representations
- Transfer learning and fine-tuning
- Optimization techniques for deep networks
- Regularization for deep networks
- Encoder-decoder architectures
- Generative models: an overview
- Restricted Boltzmann Machines
- Deep Belief Networks
- Variational Autoencoders
- Generative Adversarial Networks: training and theory
- DCGAN, conditional GANs, and GAN variants
- Representation learning and self-supervised learning
- Domain adaptation and fine-tuning strategies
- Distributed representations and latent spaces
- AutoML and hyperparameter optimization
- Neural architecture search
- Network compression and efficient inference
- Graph neural networks
- Practical deep learning: debugging and tuning
Knowing how backpropagation works and what Adam does is necessary but not sufficient. The gap between understanding deep learning and getting a model to actually work is enormous. Most of your time will be spent debugging: figuring out why the loss won’t go down, why validation accuracy plateaus, why training suddenly diverges. This article covers the practical skills that close that gap.
Prerequisites
You should be familiar with training neural networks (forward/backward pass, loss functions), optimization techniques (batch normalization, learning rate schedules), and regularization (dropout, weight decay). Experience training at least one model yourself will make this much more concrete.
Before training: sanity checks
Before investing hours of GPU time, run these cheap checks. They catch the majority of bugs in minutes.
Overfit one batch
Take a single small batch (8-32 samples) and train on it. Your model should memorize it perfectly within a few hundred steps. If it can’t drive the loss to near zero on 8 examples, something is fundamentally broken. This test catches:
- Incorrect loss function implementation
- Wrong labels or data loading bugs
- Architecture errors that prevent learning
- Activation functions applied incorrectly
Check loss at initialization
Before any training, compute the loss on a random batch. For classification with classes using cross-entropy loss:
For 10 classes: . If your initial loss is 15.0, something is wrong (maybe the loss is not averaged, or labels are misaligned).
Verify data pipeline
Visualize a few training samples with their labels. Confirm that augmentations look correct. Check that normalization statistics are right. A shocking number of bugs come from the data pipeline: channels in the wrong order, labels shifted by one, normalization applied twice.
Loss curves: what they tell you
The training loss curve is your primary diagnostic tool. Learn to read it like a doctor reads an EKG.
Common loss curve patterns over 50 epochs
graph TD
Start["Loss not
decreasing"] --> Q1{"Initial loss
correct?"}
Q1 -->|"No"| F1["Check loss function,
labels, output layer"]
Q1 -->|"Yes"| Q2{"Can overfit
one batch?"}
Q2 -->|"No"| F2["Bug in model:
check architecture,
activations, data flow"]
Q2 -->|"Yes"| Q3{"Learning rate
too high?"}
Q3 -->|"Loss oscillates wildly"| F3["Reduce learning rate
by 10x"]
Q3 -->|"Loss flat"| Q4{"Learning rate
too low?"}
Q4 -->|"Maybe"| F4["Increase learning rate
by 10x"]
Q4 -->|"Already tried"| Q5{"Gradient flow
blocked?"}
Q5 -->|"Yes"| F5["Check for dead ReLUs,
vanishing gradients,
bad initialization"]
Q5 -->|"No"| F6["Check data quality,
task difficulty,
model capacity"]
Five loss curve patterns
graph LR
subgraph "Healthy"
H["Train ↓ steadily
Val ↓ then levels off
Small gap"]
end
subgraph "Overfitting"
O["Train ↓ keeps going
Val ↓ then ↑
Growing gap"]
end
subgraph "Underfitting"
U["Train ↓ slowly
Val ↓ slowly
Both still high"]
end
subgraph "Diverging"
D["Loss ↑ or explodes
to NaN/Inf"]
end
subgraph "Oscillating"
OS["Loss bounces
up and down
No clear trend"]
end
Healthy: training loss decreases smoothly, validation loss follows but levels off. The gap between them is small. This is what you want.
Overfitting: training loss keeps decreasing but validation loss starts increasing. The model memorizes training data instead of generalizing. Fixes: more data, regularization (dropout, weight decay), data augmentation, smaller model.
Underfitting: both training and validation loss are high and decrease slowly. The model doesn’t have enough capacity or isn’t learning fast enough. Fixes: larger model, more layers, higher learning rate, train longer.
Diverging: loss increases or explodes to NaN/Inf. Almost always a learning rate problem. Fixes: reduce learning rate, check for numerical issues (log of zero, division by zero), add gradient clipping.
Oscillating: loss bounces up and down without a clear downward trend. Usually means the learning rate is too high or the batch size is too small. Fixes: reduce learning rate, increase batch size, use learning rate warmup.
Gradient diagnostics
If loss curves don’t tell you enough, look at the gradients directly.
Gradient norms per layer
Compute the L2 norm of gradients for each layer after each backward pass. Healthy training shows gradient norms in a similar range across all layers.
Example 2: Diagnosing vanishing gradients
After one backward pass, you observe these gradient norms per layer:
| Layer | Gradient norm |
|---|---|
| Layer 1 (closest to input) | 0.0003 |
| Layer 2 | 0.001 |
| Layer 3 | 0.8 |
| Layer 4 (closest to output) | 2.1 |
The pattern is clear: gradients decrease by orders of magnitude as you go deeper (closer to the input). Layer 1’s gradient is 7,000 times smaller than Layer 4’s. This is vanishing gradients.
Diagnosis: the first layers are barely learning. By the chain rule, gradients multiply through each layer. If each layer’s Jacobian has spectral norm less than 1, the product shrinks exponentially with depth.
Three concrete fixes:
- Use residual connections: skip connections let gradients flow directly to earlier layers, bypassing the vanishing chain of multiplications.
- Switch activation functions: replace sigmoid/tanh with ReLU or its variants (LeakyReLU, GELU). Sigmoid saturates for large inputs, producing near-zero gradients.
- Apply batch normalization: normalizing activations at each layer keeps them in a range where gradients flow well. Place batch norm before or after the activation function.
If gradient norms are large everywhere and increasing, you have exploding gradients instead. Apply gradient clipping: cap the global gradient norm to a maximum value (commonly 1.0 or 5.0).
The debugging checklist
When your model isn’t working, go through these 10 things in order. Most problems are caught by the first five.
| # | Check | What to look for |
|---|---|---|
| 1 | Data loading | Correct labels, no off-by-one, proper normalization |
| 2 | Loss function | Matches the task (cross-entropy for classification, MSE for regression) |
| 3 | Output layer | Correct size, correct activation (softmax for classification, none for regression before loss) |
| 4 | Initial loss | Matches for cross-entropy with classes |
| 5 | Overfit one batch | Loss reaches near zero on a tiny batch |
| 6 | Learning rate | Try 1e-1, 1e-2, 1e-3, 1e-4 on a small subset |
| 7 | Gradient flow | Norms are reasonable and similar across layers |
| 8 | Shapes | Print tensor shapes at each layer; look for broadcasting bugs |
| 9 | Data augmentation | Visualize augmented samples; make sure they still look correct |
| 10 | Preprocessing | Train/val/test use the same normalization statistics (from training set only) |
Example 1: Sanity check failure
Setup: 8 training samples, model with 10,000 parameters. Learning rate 0.001. After 1,000 steps, the loss is 1.8 (for a 10-class problem, initial should be ~2.3).
The loss decreased from 2.3 to 1.8 but didn’t reach zero. With 10K parameters and 8 samples, the model has more than enough capacity to memorize. What’s wrong?
Walk through 5 causes:
- Wrong labels: the 8 samples might have incorrect or shuffled labels. Verify by printing each sample with its label.
- Output dimension mismatch: if the model outputs 5 classes but labels go up to 9, the loss can’t reach zero. Check
model.output_dim == num_classes. - Activation after output: if you apply softmax in the model and also in the loss function (CrossEntropyLoss in PyTorch includes softmax), you’re double-softmaxing. The gradients will be wrong.
- Learning rate too low: 0.001 might be too conservative for memorizing 8 samples. Try 0.01 or 0.1.
- Data not varying: if all 8 samples are nearly identical but have different labels, the model needs to memorize noise, which requires a very specific learning rate and many more steps. Check that the batch has diverse examples.
Ablation studies
An ablation study removes or disables one component at a time to measure its contribution. It answers the question: “Does this thing actually help?”
Example 3: Designing an ablation
You built a ResNet with attention pooling, data augmentation, and dropout. Test accuracy is 94.2%. Which components matter?
| Experiment | Attention pooling | Augmentation | Dropout | Test accuracy |
|---|---|---|---|---|
| Full model | ✓ | ✓ | ✓ | 94.2% |
| No attention | ✗ | ✓ | ✓ | 93.1% |
| No augmentation | ✓ | ✗ | ✓ | 91.5% |
| No dropout | ✓ | ✓ | ✗ | 93.8% |
Reading the results:
- Removing augmentation causes the biggest drop (2.7 percentage points). Data augmentation is the most important component.
- Removing attention pooling drops 1.1 points. Worth keeping.
- Removing dropout drops only 0.4 points. It helps a little, but you might drop it if inference speed matters.
Run each experiment 3 times with different seeds and report mean and standard deviation. A 0.4% improvement is meaningless if the standard deviation is 0.5%.
Common failure modes
| Symptom | Likely cause | Fix |
|---|---|---|
| Loss = NaN after a few steps | Learning rate too high, log(0), division by zero | Reduce LR by 10x, add epsilon to denominators |
| Loss stuck at | Model outputting uniform predictions | Check gradients, verify output layer, increase LR |
| Train loss great, val loss terrible | Overfitting | Add regularization, more data, early stopping |
| Both losses high | Underfitting or data issue | Bigger model, check data quality, train longer |
| Val loss increases after epoch 1 | Severe overfitting or data leak | Check for data leakage between train/val |
| Loss oscillates wildly | LR too high or batch size too small | Halve LR, double batch size |
| Accuracy = 1/num_classes | Random guessing | Model not learning at all, start from checklist item 1 |
| Training suddenly diverges mid-run | LR schedule issue, data corruption | Check LR schedule, inspect data at that step |
Hyperparameter sensitivity and priority
Not all hyperparameters deserve equal tuning effort. Here’s a practical priority ranking:
| Parameter | Sensitivity | Tune first? | Typical range |
|---|---|---|---|
| Learning rate | Very high | ✓ Yes, always | 1e-4 to 1e-1 |
| Batch size | High | ✓ Early | 16 to 512 |
| Weight decay | Medium | After LR | 1e-5 to 1e-2 |
| Dropout rate | Medium | After LR | 0.1 to 0.5 |
| Number of layers | Medium-high | During arch search | Task dependent |
| Hidden units | Medium | During arch search | 64 to 2048 |
| Optimizer (Adam vs SGD) | Medium | Try both early | Adam first, SGD+momentum if time |
| LR schedule | Medium | After base LR | Cosine, step, or reduce-on-plateau |
| Data augmentation | High (for vision) | Early | Task dependent |
Start with learning rate. Get that right and many other things fall into place. Use learning rate finder techniques: start very small, increase exponentially each step, plot loss vs LR, and pick the LR where loss decreases fastest.
Reproducibility
Results you can’t reproduce are not results. Three things to get right:
- Random seeds: set seeds for Python, NumPy, PyTorch/TensorFlow, and CUDA. Note that some GPU operations are inherently non-deterministic.
import torch
import numpy as np
import random
random.seed(42)
np.random.seed(42)
torch.manual_seed(42)
torch.cuda.manual_seed_all(42)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
-
Version pinning: record exact versions of all libraries. A PyTorch update can change default initialization or optimizer behavior.
-
Logging everything: for every run, save the full config (hyperparameters, data paths, git commit hash), all metrics over time, and the best model checkpoint. You will need this when you try to figure out what worked three weeks ago.
Experiment tracking
Once you run more than a handful of experiments, spreadsheets break down. Use proper experiment tracking tools:
- What to log: hyperparameters, training/validation metrics per epoch, gradient norms, learning rate schedule, system metrics (GPU utilization, memory), final test metrics
- What to save: model checkpoints (at least the best one), training config, random seeds, git commit hash
A good experiment log lets you answer questions like:
- “What was the best learning rate for the ResNet-50 experiments last week?”
- “Did adding dropout actually help, or was it the augmentation change I made at the same time?”
- “What hyperparameters produced the model currently in production?”
Tools like Weights & Biases, MLflow, or even a well-structured directory of JSON configs can work. The key is consistency: log the same things every time.
Practical workflow
Here’s the order I recommend for a new project:
- Get a baseline working fast. Small model, default hyperparameters, no augmentation. Just make sure the pipeline runs end-to-end.
- Sanity checks. Overfit one batch. Check initial loss. Visualize data.
- Tune learning rate. This is the single most impactful hyperparameter.
- Scale up the model until it overfits the training set. If it never overfits, your model is too small or your data is too noisy.
- Add regularization to close the train-val gap. Dropout, weight decay, data augmentation.
- Run ablations to confirm each component helps.
- Final tuning with Bayesian optimization or Hyperband on the most sensitive hyperparameters.
What comes next
This is the final article in the Deep Learning from Scratch series. If you’ve followed along from the beginning, you now have a solid foundation: from single neurons to CNNs, RNNs, Transformers, graph neural networks, architecture search, compression, and the practical skills to make it all work.
The field moves fast, but the fundamentals change slowly. Backpropagation, gradient descent, regularization, and careful experimentation remain at the core of everything. Master those, and you can pick up any new architecture or technique by reading the paper and implementing it yourself.
For the full curriculum, including the mathematics, optimization, and classical machine learning foundations that support deep learning, see the complete series index.