Forward pass and backpropagation
In this series (25 parts)
- Neural networks: the basic building block
- Forward pass and backpropagation
- Training neural networks: a practical guide
- Convolutional neural networks
- Recurrent neural networks and LSTMs
- Attention mechanism and transformers
- Word embeddings: from one-hot to dense representations
- Transfer learning and fine-tuning
- Optimization techniques for deep networks
- Regularization for deep networks
- Encoder-decoder architectures
- Generative models: an overview
- Restricted Boltzmann Machines
- Deep Belief Networks
- Variational Autoencoders
- Generative Adversarial Networks: training and theory
- DCGAN, conditional GANs, and GAN variants
- Representation learning and self-supervised learning
- Domain adaptation and fine-tuning strategies
- Distributed representations and latent spaces
- AutoML and hyperparameter optimization
- Neural architecture search
- Network compression and efficient inference
- Graph neural networks
- Practical deep learning: debugging and tuning
Prerequisites
Before reading this article, you should be comfortable with:
- Neural networks intro: what neurons compute, how layers compose
- Derivatives and the chain rule: the mathematical backbone of backpropagation
The big picture
Training has two phases: make a prediction (forward), then figure out how wrong you were and adjust (backward). Repeat thousands of times and the network learns.
Consider a tiny network: 2 inputs, 2 hidden neurons (sigmoid), 1 output (sigmoid), MSE loss. One forward pass with and target :
| Step | What happens | Numbers |
|---|---|---|
| Input | Feed raw features | |
| Hidden pre-activation | Weighted sum | |
| Hidden activation | Sigmoid squash | |
| Output pre-activation | Second weighted sum | |
| Output activation | Final prediction | |
| Loss (MSE) | How wrong we are |
Data flows forward to produce a prediction. Gradients flow backward to assign blame to each weight.
Forward and backward data flow
graph LR X["Input x"] --> W1["Multiply by W1"] W1 --> ACT1["Sigmoid"] ACT1 --> W2["Multiply by W2"] W2 --> ACT2["Sigmoid"] ACT2 --> LOSS["Loss"] Y["True label y"] --> LOSS LOSS -.->|"gradient"| ACT2 ACT2 -.->|"gradient"| W2 W2 -.->|"gradient"| ACT1 ACT1 -.->|"gradient"| W1 style LOSS fill:#f96,stroke:#333,color:#000
Now let’s formalize each phase.
The forward pass
Training a neural network has two phases: forward and backward. The forward pass is the easy part. You feed an input through the network, layer by layer, and get an output.
Each layer computes:
where is the input. The forward pass is just function composition: you pipe the output of one layer into the next.
The key detail: you need to save every intermediate value ( and ) during the forward pass. The backward pass needs them to compute gradients.
Loss functions
After the forward pass produces a prediction , we need to measure how wrong it is. That is what the loss function does. It takes the prediction and the true label and returns a single number: lower is better.
Common loss functions
| Loss | Formula | Use case | Derivative w.r.t. |
|---|---|---|---|
| MSE | Regression | ||
| Binary cross-entropy | Binary classification | ||
| Categorical cross-entropy | Multi-class classification | (with softmax) |
MSE penalizes large errors quadratically, so outliers have outsized influence. Cross-entropy works better for classification because it directly measures the gap between predicted probabilities and true labels.
Computational graphs
A neural network’s computation can be drawn as a directed graph. Each node is an operation (multiply, add, sigmoid). Each edge carries a tensor flowing between operations.
graph LR x["x"] --> mul1["W₁ · x"] W1["W₁"] --> mul1 mul1 --> add1["+b₁"] b1["b₁"] --> add1 add1 --> sig1["σ"] sig1 --> mul2["W₂ · h"] W2["W₂"] --> mul2 mul2 --> add2["+b₂"] b2["b₂"] --> add2 add2 --> sig2["σ"] sig2 --> loss["Loss"] y["y (true)"] --> loss style loss fill:#f96,stroke:#333,color:#000
The forward pass flows left to right. The backward pass flows right to left, carrying gradients. This graph structure is what frameworks like PyTorch and TensorFlow build internally. When you call .backward(), the framework walks this graph in reverse.
The backward pass: backpropagation
Backpropagation is just the chain rule applied systematically through the computational graph.
The chain rule says: if , then:
In a neural network, the loss is a deeply nested function of the weights. Backpropagation starts at the loss and works backward, computing how each parameter contributed to the error.
For each layer , going from output toward input:
Step 1: Compute the gradient of the loss with respect to the pre-activation:
Step 2: Compute gradients for the weights and biases:
Step 3: Propagate the gradient to the previous layer:
Then multiply by the activation derivative to get , and repeat.
sequenceDiagram participant I as Input x participant H as Hidden layers participant O as Output ŷ participant L as Loss L I->>H: Forward: compute activations H->>O: Forward: compute prediction O->>L: Compute loss L(ŷ, y) L->>O: ∂L/∂ŷ O->>H: Backward: propagate gradients H->>I: Backward: gradients reach all weights
Chain rule applied layer by layer
graph RL L["dL/dL = 1"] -->|"times dL/dy-hat"| DOUT["Output delta"] DOUT -->|"times sigmoid prime"| GRAD_W2["Gradient for W2"] DOUT -->|"times W2 transpose"| DHID["Hidden delta"] DHID -->|"times sigmoid prime"| GRAD_W1["Gradient for W1"]
Each layer peels off one link in the chain. The gradient at any layer equals the product of all derivative terms from the loss back to that point.
Gradient accumulation at shared nodes
When a variable feeds into multiple downstream operations, its gradient is the sum of the gradients from all paths. This comes directly from the multivariate chain rule. If node feeds into both and :
You add up all the contributions. Frameworks handle this automatically when traversing the graph.
The vanishing gradient problem
Gradient magnitudes per layer in a 5-layer network, illustrating the vanishing gradient problem.
Backpropagation multiplies gradient terms at each layer. If those terms are consistently less than 1, the gradient shrinks exponentially as it travels backward. By the time it reaches the first few layers, it is essentially zero. Those layers stop learning.
This happens naturally with sigmoid and tanh activations, because their derivatives are always less than 1. Sigmoid’s maximum derivative is 0.25 (at ), and for saturated neurons it is much smaller.
We will see this concretely in Example 3. ReLU helps because its gradient is exactly 1 for positive inputs. But deeper solutions like batch normalization, residual connections, and LSTMs are often needed for very deep networks.
Gradient magnitude across layers (sigmoid, max derivative 0.25)
graph LR L5["Layer 5: grad = 1.0"] --> L4["Layer 4: grad = 0.25"] L4 --> L3["Layer 3: grad = 0.063"] L3 --> L2["Layer 2: grad = 0.016"] L2 --> L1["Layer 1: grad = 0.004"] style L5 fill:#4CAF50,stroke:#333,color:#fff style L4 fill:#8BC34A,stroke:#333,color:#000 style L3 fill:#FFC107,stroke:#333,color:#000 style L2 fill:#FF9800,stroke:#333,color:#000 style L1 fill:#f44336,stroke:#333,color:#fff
Each layer multiplies the gradient by the activation derivative. With sigmoid, four layers reduce the gradient to less than 0.5% of the original signal.
Example 1: Full forward and backward pass
Let us trace every value through a small network: 2 inputs, 2 hidden units with sigmoid, 1 output with sigmoid, trained using MSE loss.
Weights:
Input: , Target:
Forward pass
Hidden layer pre-activations:
Hidden layer activations (sigmoid):
Output pre-activation:
Output activation:
Loss (MSE):
Backward pass
Gradient of loss w.r.t. output:
Output layer sigmoid derivative:
Output layer delta:
Gradients for output weights:
Propagate to hidden layer:
Hidden layer sigmoid derivatives:
Hidden layer deltas:
Gradients for hidden weights:
All gradients are negative, meaning we should increase the weights to push closer to 1. With a learning rate , each weight would be updated as .
Example 2: Backprop through a simple graph
Compute the gradients of at , .
Step 1: Break into nodes.
Let and . Then .
Step 2: Forward pass.
Step 3: Backward pass.
Start from the output. .
At the multiplication node:
At the addition node ():
At the subtraction node ():
Step 4: Apply the chain rule.
For , there are two paths (gradient accumulation):
Verification: Expand .
Notice how has a larger gradient than . This makes sense: appears in both terms of the product, so changes to affect through two paths.
Verifying gradients: analytical vs numerical
graph TD A["Analytical gradient backprop via chain rule (fast)"] --> C["Compare"] B["Numerical gradient perturb input, measure change (slow but reliable)"] --> C C -->|"Difference less than 1e-5"| OK["Implementation correct"] C -->|"Difference greater than 1e-3"| BUG["Bug in backprop code"]
The analytical gradient uses the chain rule and is fast. The numerical gradient perturbs each parameter by a tiny , measures the change in loss, and divides. Comparing the two is the standard way to debug backpropagation.
Example 3: Vanishing gradient in action
Let us see why sigmoid causes problems in deep networks. Consider a 5-layer network where each hidden layer uses sigmoid. Suppose each layer’s sigmoid outputs are around 0.9 (mildly saturated, which is very common).
The sigmoid derivative at output is:
As the gradient passes backward through each layer, it gets multiplied by this derivative (among other factors). Tracking just the sigmoid contribution:
| Layer (from output) | Gradient factor | Cumulative gradient |
|---|---|---|
| 5 (output side) | ||
| 4 | ||
| 3 | ||
| 2 | ||
| 1 (input side) |
After just 5 layers, the gradient is roughly 0.0006% of what it was at the output. Layer 1 barely receives any learning signal. This is the vanishing gradient problem.
Even if activations are at 0.5 (where the sigmoid derivative is maximized at 0.25), after 5 layers you get . That is still a factor of 1000 reduction.
ReLU helps because its derivative is 1 for positive inputs. The gradient passes through unchanged. But ReLU alone does not fully solve the problem in very deep networks (50+ layers). Solutions like residual connections and batch normalization become essential, and we will cover those in upcoming articles.
What comes next
You now understand the complete learning loop: forward pass computes a prediction, the loss measures error, and backpropagation sends gradients back so gradient descent can update every weight.
But there are many practical decisions that determine whether training actually works: how to initialize weights, how to set the learning rate, when to clip gradients, and more. The next article covers all of these in training neural networks: a practical guide.