Search…

Forward pass and backpropagation

In this series (25 parts)
  1. Neural networks: the basic building block
  2. Forward pass and backpropagation
  3. Training neural networks: a practical guide
  4. Convolutional neural networks
  5. Recurrent neural networks and LSTMs
  6. Attention mechanism and transformers
  7. Word embeddings: from one-hot to dense representations
  8. Transfer learning and fine-tuning
  9. Optimization techniques for deep networks
  10. Regularization for deep networks
  11. Encoder-decoder architectures
  12. Generative models: an overview
  13. Restricted Boltzmann Machines
  14. Deep Belief Networks
  15. Variational Autoencoders
  16. Generative Adversarial Networks: training and theory
  17. DCGAN, conditional GANs, and GAN variants
  18. Representation learning and self-supervised learning
  19. Domain adaptation and fine-tuning strategies
  20. Distributed representations and latent spaces
  21. AutoML and hyperparameter optimization
  22. Neural architecture search
  23. Network compression and efficient inference
  24. Graph neural networks
  25. Practical deep learning: debugging and tuning

Prerequisites

Before reading this article, you should be comfortable with:


The big picture

Training has two phases: make a prediction (forward), then figure out how wrong you were and adjust (backward). Repeat thousands of times and the network learns.

Consider a tiny network: 2 inputs, 2 hidden neurons (sigmoid), 1 output (sigmoid), MSE loss. One forward pass with x=[1,2]\mathbf{x} = [1, 2] and target y=1y = 1:

StepWhat happensNumbers
InputFeed raw featuresx1=1,  x2=2x_1 = 1, \; x_2 = 2
Hidden pre-activationWeighted sumz1=0.1(1)+0.2(2)=0.5z_1 = 0.1(1) + 0.2(2) = 0.5
Hidden activationSigmoid squashh1=σ(0.5)0.622h_1 = \sigma(0.5) \approx 0.622
Output pre-activationSecond weighted sumzo=0.5(0.622)+0.6(0.750)=0.761z_o = 0.5(0.622) + 0.6(0.750) = 0.761
Output activationFinal predictiony^=σ(0.761)0.682\hat{y} = \sigma(0.761) \approx 0.682
Loss (MSE)How wrong we areL=(0.6821)20.101L = (0.682 - 1)^2 \approx 0.101

Data flows forward to produce a prediction. Gradients flow backward to assign blame to each weight.

Forward and backward data flow

graph LR
  X["Input x"] --> W1["Multiply by W1"]
  W1 --> ACT1["Sigmoid"]
  ACT1 --> W2["Multiply by W2"]
  W2 --> ACT2["Sigmoid"]
  ACT2 --> LOSS["Loss"]
  Y["True label y"] --> LOSS
  LOSS -.->|"gradient"| ACT2
  ACT2 -.->|"gradient"| W2
  W2 -.->|"gradient"| ACT1
  ACT1 -.->|"gradient"| W1

  style LOSS fill:#f96,stroke:#333,color:#000

Now let’s formalize each phase.


The forward pass

Training a neural network has two phases: forward and backward. The forward pass is the easy part. You feed an input through the network, layer by layer, and get an output.

Each layer computes:

z(l)=W(l)a(l1)+b(l)\mathbf{z}^{(l)} = W^{(l)} \mathbf{a}^{(l-1)} + \mathbf{b}^{(l)}

a(l)=σ(z(l))\mathbf{a}^{(l)} = \sigma(\mathbf{z}^{(l)})

where a(0)=x\mathbf{a}^{(0)} = \mathbf{x} is the input. The forward pass is just function composition: you pipe the output of one layer into the next.

The key detail: you need to save every intermediate value (z(l)\mathbf{z}^{(l)} and a(l)\mathbf{a}^{(l)}) during the forward pass. The backward pass needs them to compute gradients.


Loss functions

After the forward pass produces a prediction y^\hat{y}, we need to measure how wrong it is. That is what the loss function does. It takes the prediction and the true label and returns a single number: lower is better.

Common loss functions

LossFormulaUse caseDerivative w.r.t. y^\hat{y}
MSE1ni=1n(yiy^i)2\frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2Regression2n(y^iyi)\frac{2}{n}(\hat{y}_i - y_i)
Binary cross-entropy[ylogy^+(1y)log(1y^)]-[y \log \hat{y} + (1-y)\log(1-\hat{y})]Binary classificationy^yy^(1y^)\frac{\hat{y} - y}{\hat{y}(1-\hat{y})}
Categorical cross-entropyc=1Cyclogy^c-\sum_{c=1}^{C} y_c \log \hat{y}_cMulti-class classificationy^cyc\hat{y}_c - y_c (with softmax)

MSE penalizes large errors quadratically, so outliers have outsized influence. Cross-entropy works better for classification because it directly measures the gap between predicted probabilities and true labels.


Computational graphs

A neural network’s computation can be drawn as a directed graph. Each node is an operation (multiply, add, sigmoid). Each edge carries a tensor flowing between operations.

graph LR
  x["x"] --> mul1["W₁ · x"]
  W1["W₁"] --> mul1
  mul1 --> add1["+b₁"]
  b1["b₁"] --> add1
  add1 --> sig1["σ"]
  sig1 --> mul2["W₂ · h"]
  W2["W₂"] --> mul2
  mul2 --> add2["+b₂"]
  b2["b₂"] --> add2
  add2 --> sig2["σ"]
  sig2 --> loss["Loss"]
  y["y (true)"] --> loss

  style loss fill:#f96,stroke:#333,color:#000

The forward pass flows left to right. The backward pass flows right to left, carrying gradients. This graph structure is what frameworks like PyTorch and TensorFlow build internally. When you call .backward(), the framework walks this graph in reverse.


The backward pass: backpropagation

Backpropagation is just the chain rule applied systematically through the computational graph.

The chain rule says: if y=f(g(x))y = f(g(x)), then:

dydx=dydgdgdx\frac{dy}{dx} = \frac{dy}{dg} \cdot \frac{dg}{dx}

In a neural network, the loss LL is a deeply nested function of the weights. Backpropagation starts at the loss and works backward, computing how each parameter contributed to the error.

For each layer ll, going from output toward input:

Step 1: Compute the gradient of the loss with respect to the pre-activation:

δ(l)=Lz(l)\boldsymbol{\delta}^{(l)} = \frac{\partial L}{\partial \mathbf{z}^{(l)}}

Step 2: Compute gradients for the weights and biases:

LW(l)=δ(l)(a(l1))\frac{\partial L}{\partial W^{(l)}} = \boldsymbol{\delta}^{(l)} (\mathbf{a}^{(l-1)})^\top

Lb(l)=δ(l)\frac{\partial L}{\partial \mathbf{b}^{(l)}} = \boldsymbol{\delta}^{(l)}

Step 3: Propagate the gradient to the previous layer:

La(l1)=(W(l))δ(l)\frac{\partial L}{\partial \mathbf{a}^{(l-1)}} = (W^{(l)})^\top \boldsymbol{\delta}^{(l)}

Then multiply by the activation derivative to get δ(l1)\boldsymbol{\delta}^{(l-1)}, and repeat.

sequenceDiagram
  participant I as Input x
  participant H as Hidden layers
  participant O as Output ŷ
  participant L as Loss L

  I->>H: Forward: compute activations
  H->>O: Forward: compute prediction
  O->>L: Compute loss L(ŷ, y)
  L->>O: ∂L/∂ŷ
  O->>H: Backward: propagate gradients
  H->>I: Backward: gradients reach all weights

Chain rule applied layer by layer

graph RL
  L["dL/dL = 1"] -->|"times dL/dy-hat"| DOUT["Output delta"]
  DOUT -->|"times sigmoid prime"| GRAD_W2["Gradient for W2"]
  DOUT -->|"times W2 transpose"| DHID["Hidden delta"]
  DHID -->|"times sigmoid prime"| GRAD_W1["Gradient for W1"]

Each layer peels off one link in the chain. The gradient at any layer equals the product of all derivative terms from the loss back to that point.

Gradient accumulation at shared nodes

When a variable feeds into multiple downstream operations, its gradient is the sum of the gradients from all paths. This comes directly from the multivariate chain rule. If node hh feeds into both f1f_1 and f2f_2:

Lh=Lf1f1h+Lf2f2h\frac{\partial L}{\partial h} = \frac{\partial L}{\partial f_1} \cdot \frac{\partial f_1}{\partial h} + \frac{\partial L}{\partial f_2} \cdot \frac{\partial f_2}{\partial h}

You add up all the contributions. Frameworks handle this automatically when traversing the graph.


The vanishing gradient problem

Gradient magnitudes per layer in a 5-layer network, illustrating the vanishing gradient problem.

Backpropagation multiplies gradient terms at each layer. If those terms are consistently less than 1, the gradient shrinks exponentially as it travels backward. By the time it reaches the first few layers, it is essentially zero. Those layers stop learning.

This happens naturally with sigmoid and tanh activations, because their derivatives are always less than 1. Sigmoid’s maximum derivative is 0.25 (at z=0z = 0), and for saturated neurons it is much smaller.

We will see this concretely in Example 3. ReLU helps because its gradient is exactly 1 for positive inputs. But deeper solutions like batch normalization, residual connections, and LSTMs are often needed for very deep networks.

Gradient magnitude across layers (sigmoid, max derivative 0.25)

graph LR
  L5["Layer 5: grad = 1.0"] --> L4["Layer 4: grad = 0.25"]
  L4 --> L3["Layer 3: grad = 0.063"]
  L3 --> L2["Layer 2: grad = 0.016"]
  L2 --> L1["Layer 1: grad = 0.004"]

  style L5 fill:#4CAF50,stroke:#333,color:#fff
  style L4 fill:#8BC34A,stroke:#333,color:#000
  style L3 fill:#FFC107,stroke:#333,color:#000
  style L2 fill:#FF9800,stroke:#333,color:#000
  style L1 fill:#f44336,stroke:#333,color:#fff

Each layer multiplies the gradient by the activation derivative. With sigmoid, four layers reduce the gradient to less than 0.5% of the original signal.


Example 1: Full forward and backward pass

Let us trace every value through a small network: 2 inputs, 2 hidden units with sigmoid, 1 output with sigmoid, trained using MSE loss.

Weights:

W(1)=[0.10.20.30.4],b(1)=[00]W^{(1)} = \begin{bmatrix} 0.1 & 0.2 \\ 0.3 & 0.4 \end{bmatrix}, \quad \mathbf{b}^{(1)} = \begin{bmatrix} 0 \\ 0 \end{bmatrix}

w(2)=[0.50.6],b(2)=0\mathbf{w}^{(2)} = \begin{bmatrix} 0.5 \\ 0.6 \end{bmatrix}, \quad b^{(2)} = 0

Input: x=[1,2]\mathbf{x} = [1, 2]^\top, Target: y=1y = 1

Forward pass

Hidden layer pre-activations:

z1(1)=0.11+0.22=0.5z_1^{(1)} = 0.1 \cdot 1 + 0.2 \cdot 2 = 0.5

z2(1)=0.31+0.42=1.1z_2^{(1)} = 0.3 \cdot 1 + 0.4 \cdot 2 = 1.1

Hidden layer activations (sigmoid):

h1=σ(0.5)=11+e0.50.6225h_1 = \sigma(0.5) = \frac{1}{1 + e^{-0.5}} \approx 0.6225

h2=σ(1.1)=11+e1.10.7503h_2 = \sigma(1.1) = \frac{1}{1 + e^{-1.1}} \approx 0.7503

Output pre-activation:

z(2)=0.50.6225+0.60.7503=0.3113+0.4502=0.7615z^{(2)} = 0.5 \cdot 0.6225 + 0.6 \cdot 0.7503 = 0.3113 + 0.4502 = 0.7615

Output activation:

y^=σ(0.7615)0.6816\hat{y} = \sigma(0.7615) \approx 0.6816

Loss (MSE):

L=(y^y)2=(0.68161)2=(0.3184)20.1014L = (\hat{y} - y)^2 = (0.6816 - 1)^2 = (-0.3184)^2 \approx 0.1014

Backward pass

Gradient of loss w.r.t. output:

Ly^=2(y^y)=2(0.68161)=0.6368\frac{\partial L}{\partial \hat{y}} = 2(\hat{y} - y) = 2(0.6816 - 1) = -0.6368

Output layer sigmoid derivative:

σ(z(2))=y^(1y^)=0.6816×0.31840.2170\sigma'(z^{(2)}) = \hat{y}(1 - \hat{y}) = 0.6816 \times 0.3184 \approx 0.2170

Output layer delta:

δ(2)=0.6368×0.2170=0.1382\delta^{(2)} = -0.6368 \times 0.2170 = -0.1382

Gradients for output weights:

Lw1(2)=δ(2)h1=0.1382×0.6225=0.0860\frac{\partial L}{\partial w_1^{(2)}} = \delta^{(2)} \cdot h_1 = -0.1382 \times 0.6225 = -0.0860

Lw2(2)=δ(2)h2=0.1382×0.7503=0.1037\frac{\partial L}{\partial w_2^{(2)}} = \delta^{(2)} \cdot h_2 = -0.1382 \times 0.7503 = -0.1037

Lb(2)=δ(2)=0.1382\frac{\partial L}{\partial b^{(2)}} = \delta^{(2)} = -0.1382

Propagate to hidden layer:

Lh1=w1(2)δ(2)=0.5×(0.1382)=0.0691\frac{\partial L}{\partial h_1} = w_1^{(2)} \cdot \delta^{(2)} = 0.5 \times (-0.1382) = -0.0691

Lh2=w2(2)δ(2)=0.6×(0.1382)=0.0829\frac{\partial L}{\partial h_2} = w_2^{(2)} \cdot \delta^{(2)} = 0.6 \times (-0.1382) = -0.0829

Hidden layer sigmoid derivatives:

σ(z1(1))=h1(1h1)=0.6225×0.3775=0.2350\sigma'(z_1^{(1)}) = h_1(1 - h_1) = 0.6225 \times 0.3775 = 0.2350

σ(z2(1))=h2(1h2)=0.7503×0.2497=0.1874\sigma'(z_2^{(1)}) = h_2(1 - h_2) = 0.7503 \times 0.2497 = 0.1874

Hidden layer deltas:

δ1(1)=0.0691×0.2350=0.0162\delta_1^{(1)} = -0.0691 \times 0.2350 = -0.0162

δ2(1)=0.0829×0.1874=0.0155\delta_2^{(1)} = -0.0829 \times 0.1874 = -0.0155

Gradients for hidden weights:

LW(1)=δ(1)x=[0.0162×10.0162×20.0155×10.0155×2]=[0.01620.03240.01550.0310]\frac{\partial L}{\partial W^{(1)}} = \boldsymbol{\delta}^{(1)} \mathbf{x}^\top = \begin{bmatrix} -0.0162 \times 1 & -0.0162 \times 2 \\ -0.0155 \times 1 & -0.0155 \times 2 \end{bmatrix} = \begin{bmatrix} -0.0162 & -0.0324 \\ -0.0155 & -0.0310 \end{bmatrix}

All gradients are negative, meaning we should increase the weights to push y^\hat{y} closer to 1. With a learning rate η=0.5\eta = 0.5, each weight would be updated as wwηLww \leftarrow w - \eta \cdot \frac{\partial L}{\partial w}.


Example 2: Backprop through a simple graph

Compute the gradients of z=(x+y)(y3)z = (x + y)(y - 3) at x=2x = 2, y=4y = 4.

Step 1: Break into nodes.

Let a=x+ya = x + y and b=y3b = y - 3. Then z=abz = a \cdot b.

Step 2: Forward pass.

a=2+4=6a = 2 + 4 = 6

b=43=1b = 4 - 3 = 1

z=6×1=6z = 6 \times 1 = 6

Step 3: Backward pass.

Start from the output. zz=1\frac{\partial z}{\partial z} = 1.

At the multiplication node:

za=b=1,zb=a=6\frac{\partial z}{\partial a} = b = 1, \quad \frac{\partial z}{\partial b} = a = 6

At the addition node (a=x+ya = x + y):

ax=1,ay=1\frac{\partial a}{\partial x} = 1, \quad \frac{\partial a}{\partial y} = 1

At the subtraction node (b=y3b = y - 3):

by=1\frac{\partial b}{\partial y} = 1

Step 4: Apply the chain rule.

zx=zaax=1×1=1\frac{\partial z}{\partial x} = \frac{\partial z}{\partial a} \cdot \frac{\partial a}{\partial x} = 1 \times 1 = 1

For yy, there are two paths (gradient accumulation):

zy=zaay+zbby=1×1+6×1=7\frac{\partial z}{\partial y} = \frac{\partial z}{\partial a} \cdot \frac{\partial a}{\partial y} + \frac{\partial z}{\partial b} \cdot \frac{\partial b}{\partial y} = 1 \times 1 + 6 \times 1 = 7

Verification: Expand z=(x+y)(y3)=xy3x+y23yz = (x+y)(y-3) = xy - 3x + y^2 - 3y.

zx=y3=43=1\frac{\partial z}{\partial x} = y - 3 = 4 - 3 = 1 \checkmark

zy=x+2y3=2+83=7\frac{\partial z}{\partial y} = x + 2y - 3 = 2 + 8 - 3 = 7 \checkmark

Notice how yy has a larger gradient than xx. This makes sense: yy appears in both terms of the product, so changes to yy affect zz through two paths.

Verifying gradients: analytical vs numerical

graph TD
  A["Analytical gradient
backprop via chain rule
(fast)"] --> C["Compare"]
  B["Numerical gradient
perturb input, measure change
(slow but reliable)"] --> C
  C -->|"Difference less than 1e-5"| OK["Implementation correct"]
  C -->|"Difference greater than 1e-3"| BUG["Bug in backprop code"]

The analytical gradient uses the chain rule and is fast. The numerical gradient perturbs each parameter by a tiny ϵ\epsilon, measures the change in loss, and divides. Comparing the two is the standard way to debug backpropagation.


Example 3: Vanishing gradient in action

Let us see why sigmoid causes problems in deep networks. Consider a 5-layer network where each hidden layer uses sigmoid. Suppose each layer’s sigmoid outputs are around 0.9 (mildly saturated, which is very common).

The sigmoid derivative at output aa is:

σ(z)=a(1a)=0.9×0.1=0.09\sigma'(z) = a(1 - a) = 0.9 \times 0.1 = 0.09

As the gradient passes backward through each layer, it gets multiplied by this derivative (among other factors). Tracking just the sigmoid contribution:

Layer (from output)Gradient factorCumulative gradient
5 (output side)0.090.090.09000.0900
4×0.09\times\, 0.090.00810.0081
3×0.09\times\, 0.090.000730.00073
2×0.09\times\, 0.090.0000660.000066
1 (input side)×0.09\times\, 0.090.00000590.0000059

After just 5 layers, the gradient is roughly 0.0006% of what it was at the output. Layer 1 barely receives any learning signal. This is the vanishing gradient problem.

Even if activations are at 0.5 (where the sigmoid derivative is maximized at 0.25), after 5 layers you get 0.2550.0010.25^5 \approx 0.001. That is still a factor of 1000 reduction.

ReLU helps because its derivative is 1 for positive inputs. The gradient passes through unchanged. But ReLU alone does not fully solve the problem in very deep networks (50+ layers). Solutions like residual connections and batch normalization become essential, and we will cover those in upcoming articles.


What comes next

You now understand the complete learning loop: forward pass computes a prediction, the loss measures error, and backpropagation sends gradients back so gradient descent can update every weight.

But there are many practical decisions that determine whether training actually works: how to initialize weights, how to set the learning rate, when to clip gradients, and more. The next article covers all of these in training neural networks: a practical guide.

Start typing to search across all content
navigate Enter open Esc close