Search…

Practical deep learning: debugging and tuning

In this series (25 parts)
  1. Neural networks: the basic building block
  2. Forward pass and backpropagation
  3. Training neural networks: a practical guide
  4. Convolutional neural networks
  5. Recurrent neural networks and LSTMs
  6. Attention mechanism and transformers
  7. Word embeddings: from one-hot to dense representations
  8. Transfer learning and fine-tuning
  9. Optimization techniques for deep networks
  10. Regularization for deep networks
  11. Encoder-decoder architectures
  12. Generative models: an overview
  13. Restricted Boltzmann Machines
  14. Deep Belief Networks
  15. Variational Autoencoders
  16. Generative Adversarial Networks: training and theory
  17. DCGAN, conditional GANs, and GAN variants
  18. Representation learning and self-supervised learning
  19. Domain adaptation and fine-tuning strategies
  20. Distributed representations and latent spaces
  21. AutoML and hyperparameter optimization
  22. Neural architecture search
  23. Network compression and efficient inference
  24. Graph neural networks
  25. Practical deep learning: debugging and tuning

Knowing how backpropagation works and what Adam does is necessary but not sufficient. The gap between understanding deep learning and getting a model to actually work is enormous. Most of your time will be spent debugging: figuring out why the loss won’t go down, why validation accuracy plateaus, why training suddenly diverges. This article covers the practical skills that close that gap.

Prerequisites

You should be familiar with training neural networks (forward/backward pass, loss functions), optimization techniques (batch normalization, learning rate schedules), and regularization (dropout, weight decay). Experience training at least one model yourself will make this much more concrete.

Before training: sanity checks

Before investing hours of GPU time, run these cheap checks. They catch the majority of bugs in minutes.

Overfit one batch

Take a single small batch (8-32 samples) and train on it. Your model should memorize it perfectly within a few hundred steps. If it can’t drive the loss to near zero on 8 examples, something is fundamentally broken. This test catches:

  • Incorrect loss function implementation
  • Wrong labels or data loading bugs
  • Architecture errors that prevent learning
  • Activation functions applied incorrectly

Check loss at initialization

Before any training, compute the loss on a random batch. For classification with CC classes using cross-entropy loss:

Expected initial loss=log(1C)=log(C)\text{Expected initial loss} = -\log\left(\frac{1}{C}\right) = \log(C)

For 10 classes: log(10)2.30\log(10) \approx 2.30. If your initial loss is 15.0, something is wrong (maybe the loss is not averaged, or labels are misaligned).

Verify data pipeline

Visualize a few training samples with their labels. Confirm that augmentations look correct. Check that normalization statistics are right. A shocking number of bugs come from the data pipeline: channels in the wrong order, labels shifted by one, normalization applied twice.

Loss curves: what they tell you

The training loss curve is your primary diagnostic tool. Learn to read it like a doctor reads an EKG.

Common loss curve patterns over 50 epochs

graph TD
  Start["Loss not
decreasing"] --> Q1{"Initial loss
correct?"}
  Q1 -->|"No"| F1["Check loss function,
labels, output layer"]
  Q1 -->|"Yes"| Q2{"Can overfit
one batch?"}
  Q2 -->|"No"| F2["Bug in model:
check architecture,
activations, data flow"]
  Q2 -->|"Yes"| Q3{"Learning rate
too high?"}
  Q3 -->|"Loss oscillates wildly"| F3["Reduce learning rate
by 10x"]
  Q3 -->|"Loss flat"| Q4{"Learning rate
too low?"}
  Q4 -->|"Maybe"| F4["Increase learning rate
by 10x"]
  Q4 -->|"Already tried"| Q5{"Gradient flow
blocked?"}
  Q5 -->|"Yes"| F5["Check for dead ReLUs,
vanishing gradients,
bad initialization"]
  Q5 -->|"No"| F6["Check data quality,
task difficulty,
model capacity"]

Five loss curve patterns

graph LR
  subgraph "Healthy"
      H["Train ↓ steadily
Val ↓ then levels off
Small gap"]
  end
  subgraph "Overfitting"
      O["Train ↓ keeps going
Val ↓ then ↑
Growing gap"]
  end
  subgraph "Underfitting"
      U["Train ↓ slowly
Val ↓ slowly
Both still high"]
  end
  subgraph "Diverging"
      D["Loss ↑ or explodes
to NaN/Inf"]
  end
  subgraph "Oscillating"
      OS["Loss bounces
up and down
No clear trend"]
  end

Healthy: training loss decreases smoothly, validation loss follows but levels off. The gap between them is small. This is what you want.

Overfitting: training loss keeps decreasing but validation loss starts increasing. The model memorizes training data instead of generalizing. Fixes: more data, regularization (dropout, weight decay), data augmentation, smaller model.

Underfitting: both training and validation loss are high and decrease slowly. The model doesn’t have enough capacity or isn’t learning fast enough. Fixes: larger model, more layers, higher learning rate, train longer.

Diverging: loss increases or explodes to NaN/Inf. Almost always a learning rate problem. Fixes: reduce learning rate, check for numerical issues (log of zero, division by zero), add gradient clipping.

Oscillating: loss bounces up and down without a clear downward trend. Usually means the learning rate is too high or the batch size is too small. Fixes: reduce learning rate, increase batch size, use learning rate warmup.

Gradient diagnostics

If loss curves don’t tell you enough, look at the gradients directly.

Gradient norms per layer

Compute the L2 norm of gradients for each layer after each backward pass. Healthy training shows gradient norms in a similar range across all layers.

Example 2: Diagnosing vanishing gradients

After one backward pass, you observe these gradient norms per layer:

LayerGradient norm
Layer 1 (closest to input)0.0003
Layer 20.001
Layer 30.8
Layer 4 (closest to output)2.1

The pattern is clear: gradients decrease by orders of magnitude as you go deeper (closer to the input). Layer 1’s gradient is 7,000 times smaller than Layer 4’s. This is vanishing gradients.

Diagnosis: the first layers are barely learning. By the chain rule, gradients multiply through each layer. If each layer’s Jacobian has spectral norm less than 1, the product shrinks exponentially with depth.

Three concrete fixes:

  1. Use residual connections: skip connections let gradients flow directly to earlier layers, bypassing the vanishing chain of multiplications.
  2. Switch activation functions: replace sigmoid/tanh with ReLU or its variants (LeakyReLU, GELU). Sigmoid saturates for large inputs, producing near-zero gradients.
  3. Apply batch normalization: normalizing activations at each layer keeps them in a range where gradients flow well. Place batch norm before or after the activation function.

If gradient norms are large everywhere and increasing, you have exploding gradients instead. Apply gradient clipping: cap the global gradient norm to a maximum value (commonly 1.0 or 5.0).

The debugging checklist

When your model isn’t working, go through these 10 things in order. Most problems are caught by the first five.

#CheckWhat to look for
1Data loadingCorrect labels, no off-by-one, proper normalization
2Loss functionMatches the task (cross-entropy for classification, MSE for regression)
3Output layerCorrect size, correct activation (softmax for classification, none for regression before loss)
4Initial lossMatches log(C)\log(C) for cross-entropy with CC classes
5Overfit one batchLoss reaches near zero on a tiny batch
6Learning rateTry 1e-1, 1e-2, 1e-3, 1e-4 on a small subset
7Gradient flowNorms are reasonable and similar across layers
8ShapesPrint tensor shapes at each layer; look for broadcasting bugs
9Data augmentationVisualize augmented samples; make sure they still look correct
10PreprocessingTrain/val/test use the same normalization statistics (from training set only)

Example 1: Sanity check failure

Setup: 8 training samples, model with 10,000 parameters. Learning rate 0.001. After 1,000 steps, the loss is 1.8 (for a 10-class problem, initial should be ~2.3).

The loss decreased from 2.3 to 1.8 but didn’t reach zero. With 10K parameters and 8 samples, the model has more than enough capacity to memorize. What’s wrong?

Walk through 5 causes:

  1. Wrong labels: the 8 samples might have incorrect or shuffled labels. Verify by printing each sample with its label.
  2. Output dimension mismatch: if the model outputs 5 classes but labels go up to 9, the loss can’t reach zero. Check model.output_dim == num_classes.
  3. Activation after output: if you apply softmax in the model and also in the loss function (CrossEntropyLoss in PyTorch includes softmax), you’re double-softmaxing. The gradients will be wrong.
  4. Learning rate too low: 0.001 might be too conservative for memorizing 8 samples. Try 0.01 or 0.1.
  5. Data not varying: if all 8 samples are nearly identical but have different labels, the model needs to memorize noise, which requires a very specific learning rate and many more steps. Check that the batch has diverse examples.

Ablation studies

An ablation study removes or disables one component at a time to measure its contribution. It answers the question: “Does this thing actually help?”

Example 3: Designing an ablation

You built a ResNet with attention pooling, data augmentation, and dropout. Test accuracy is 94.2%. Which components matter?

ExperimentAttention poolingAugmentationDropoutTest accuracy
Full model94.2%
No attention93.1%
No augmentation91.5%
No dropout93.8%

Reading the results:

  • Removing augmentation causes the biggest drop (2.7 percentage points). Data augmentation is the most important component.
  • Removing attention pooling drops 1.1 points. Worth keeping.
  • Removing dropout drops only 0.4 points. It helps a little, but you might drop it if inference speed matters.

Run each experiment 3 times with different seeds and report mean and standard deviation. A 0.4% improvement is meaningless if the standard deviation is 0.5%.

Common failure modes

SymptomLikely causeFix
Loss = NaN after a few stepsLearning rate too high, log(0), division by zeroReduce LR by 10x, add epsilon to denominators
Loss stuck at log(C)\log(C)Model outputting uniform predictionsCheck gradients, verify output layer, increase LR
Train loss great, val loss terribleOverfittingAdd regularization, more data, early stopping
Both losses highUnderfitting or data issueBigger model, check data quality, train longer
Val loss increases after epoch 1Severe overfitting or data leakCheck for data leakage between train/val
Loss oscillates wildlyLR too high or batch size too smallHalve LR, double batch size
Accuracy = 1/num_classesRandom guessingModel not learning at all, start from checklist item 1
Training suddenly diverges mid-runLR schedule issue, data corruptionCheck LR schedule, inspect data at that step

Hyperparameter sensitivity and priority

Not all hyperparameters deserve equal tuning effort. Here’s a practical priority ranking:

ParameterSensitivityTune first?Typical range
Learning rateVery high✓ Yes, always1e-4 to 1e-1
Batch sizeHigh✓ Early16 to 512
Weight decayMediumAfter LR1e-5 to 1e-2
Dropout rateMediumAfter LR0.1 to 0.5
Number of layersMedium-highDuring arch searchTask dependent
Hidden unitsMediumDuring arch search64 to 2048
Optimizer (Adam vs SGD)MediumTry both earlyAdam first, SGD+momentum if time
LR scheduleMediumAfter base LRCosine, step, or reduce-on-plateau
Data augmentationHigh (for vision)EarlyTask dependent

Start with learning rate. Get that right and many other things fall into place. Use learning rate finder techniques: start very small, increase exponentially each step, plot loss vs LR, and pick the LR where loss decreases fastest.

Reproducibility

Results you can’t reproduce are not results. Three things to get right:

  1. Random seeds: set seeds for Python, NumPy, PyTorch/TensorFlow, and CUDA. Note that some GPU operations are inherently non-deterministic.
import torch
import numpy as np
import random

random.seed(42)
np.random.seed(42)
torch.manual_seed(42)
torch.cuda.manual_seed_all(42)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
  1. Version pinning: record exact versions of all libraries. A PyTorch update can change default initialization or optimizer behavior.

  2. Logging everything: for every run, save the full config (hyperparameters, data paths, git commit hash), all metrics over time, and the best model checkpoint. You will need this when you try to figure out what worked three weeks ago.

Experiment tracking

Once you run more than a handful of experiments, spreadsheets break down. Use proper experiment tracking tools:

  • What to log: hyperparameters, training/validation metrics per epoch, gradient norms, learning rate schedule, system metrics (GPU utilization, memory), final test metrics
  • What to save: model checkpoints (at least the best one), training config, random seeds, git commit hash

A good experiment log lets you answer questions like:

  • “What was the best learning rate for the ResNet-50 experiments last week?”
  • “Did adding dropout actually help, or was it the augmentation change I made at the same time?”
  • “What hyperparameters produced the model currently in production?”

Tools like Weights & Biases, MLflow, or even a well-structured directory of JSON configs can work. The key is consistency: log the same things every time.

Practical workflow

Here’s the order I recommend for a new project:

  1. Get a baseline working fast. Small model, default hyperparameters, no augmentation. Just make sure the pipeline runs end-to-end.
  2. Sanity checks. Overfit one batch. Check initial loss. Visualize data.
  3. Tune learning rate. This is the single most impactful hyperparameter.
  4. Scale up the model until it overfits the training set. If it never overfits, your model is too small or your data is too noisy.
  5. Add regularization to close the train-val gap. Dropout, weight decay, data augmentation.
  6. Run ablations to confirm each component helps.
  7. Final tuning with Bayesian optimization or Hyperband on the most sensitive hyperparameters.

What comes next

This is the final article in the Deep Learning from Scratch series. If you’ve followed along from the beginning, you now have a solid foundation: from single neurons to CNNs, RNNs, Transformers, graph neural networks, architecture search, compression, and the practical skills to make it all work.

The field moves fast, but the fundamentals change slowly. Backpropagation, gradient descent, regularization, and careful experimentation remain at the core of everything. Master those, and you can pick up any new architecture or technique by reading the paper and implementing it yourself.

For the full curriculum, including the mathematics, optimization, and classical machine learning foundations that support deep learning, see the complete series index.

Start typing to search across all content
navigate Enter open Esc close