Oct 6, 2025 · 22 min read · Deep Learning

Forward pass and backpropagation

In this series (25 parts)

Prerequisites

Before reading this article, you should be comfortable with:

Neural networks intro: what neurons compute, how layers compose
Derivatives and the chain rule: the mathematical backbone of backpropagation

The big picture

Training has two phases: make a prediction (forward), then figure out how wrong you were and adjust (backward). Repeat thousands of times and the network learns.

Consider a tiny network: 2 inputs, 2 hidden neurons (sigmoid), 1 output (sigmoid), MSE loss. One forward pass with $\mathbf{x} = [1, 2]$ and target $y = 1$ :

Step	What happens	Numbers
Input	Feed raw features	$x_1 = 1, \; x_2 = 2$
Hidden pre-activation	Weighted sum	$z_1 = 0.1(1) + 0.2(2) = 0.5$
Hidden activation	Sigmoid squash	$h_1 = \sigma(0.5) \approx 0.622$
Output pre-activation	Second weighted sum	$z_o = 0.5(0.622) + 0.6(0.750) = 0.761$
Output activation	Final prediction	$\hat{y} = \sigma(0.761) \approx 0.682$
Loss (MSE)	How wrong we are	$L = (0.682 - 1)^2 \approx 0.101$

Data flows forward to produce a prediction. Gradients flow backward to assign blame to each weight.

Forward and backward data flow

graph LR
  X["Input x"] --> W1["Multiply by W1"]
  W1 --> ACT1["Sigmoid"]
  ACT1 --> W2["Multiply by W2"]
  W2 --> ACT2["Sigmoid"]
  ACT2 --> LOSS["Loss"]
  Y["True label y"] --> LOSS
  LOSS -.->|"gradient"| ACT2
  ACT2 -.->|"gradient"| W2
  W2 -.->|"gradient"| ACT1
  ACT1 -.->|"gradient"| W1

  style LOSS fill:#f96,stroke:#333,color:#000

Now let’s formalize each phase.

The forward pass

Training a neural network has two phases: forward and backward. The forward pass is the easy part. You feed an input through the network, layer by layer, and get an output.

Each layer computes:

$\mathbf{z}^{(l)} = W^{(l)} \mathbf{a}^{(l-1)} + \mathbf{b}^{(l)}$

$\mathbf{a}^{(l)} = \sigma(\mathbf{z}^{(l)})$

where $\mathbf{a}^{(0)} = \mathbf{x}$ is the input. The forward pass is just function composition: you pipe the output of one layer into the next.

The key detail: you need to save every intermediate value ( $\mathbf{z}^{(l)}$ and $\mathbf{a}^{(l)}$ ) during the forward pass. The backward pass needs them to compute gradients.

Loss functions

After the forward pass produces a prediction $\hat{y}$ , we need to measure how wrong it is. That is what the loss function does. It takes the prediction and the true label and returns a single number: lower is better.

Common loss functions

Loss	Formula	Use case	Derivative w.r.t. $\hat{y}$
MSE	$\frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2$	Regression	$\frac{2}{n}(\hat{y}_i - y_i)$
Binary cross-entropy	$-[y \log \hat{y} + (1-y)\log(1-\hat{y})]$	Binary classification	$\frac{\hat{y} - y}{\hat{y}(1-\hat{y})}$
Categorical cross-entropy	$-\sum_{c=1}^{C} y_c \log \hat{y}_c$	Multi-class classification	$\hat{y}_c - y_c$ (with softmax)

MSE penalizes large errors quadratically, so outliers have outsized influence. Cross-entropy works better for classification because it directly measures the gap between predicted probabilities and true labels.

Computational graphs

A neural network’s computation can be drawn as a directed graph. Each node is an operation (multiply, add, sigmoid). Each edge carries a tensor flowing between operations.

graph LR
  x["x"] --> mul1["W₁ · x"]
  W1["W₁"] --> mul1
  mul1 --> add1["+b₁"]
  b1["b₁"] --> add1
  add1 --> sig1["σ"]
  sig1 --> mul2["W₂ · h"]
  W2["W₂"] --> mul2
  mul2 --> add2["+b₂"]
  b2["b₂"] --> add2
  add2 --> sig2["σ"]
  sig2 --> loss["Loss"]
  y["y (true)"] --> loss

  style loss fill:#f96,stroke:#333,color:#000

The forward pass flows left to right. The backward pass flows right to left, carrying gradients. This graph structure is what frameworks like PyTorch and TensorFlow build internally. When you call .backward(), the framework walks this graph in reverse.

The backward pass: backpropagation

Backpropagation is just the chain rule applied systematically through the computational graph.

The chain rule says: if $y = f(g(x))$ , then:

$\frac{dy}{dx} = \frac{dy}{dg} \cdot \frac{dg}{dx}$

In a neural network, the loss $L$ is a deeply nested function of the weights. Backpropagation starts at the loss and works backward, computing how each parameter contributed to the error.

For each layer $l$ , going from output toward input:

Step 1: Compute the gradient of the loss with respect to the pre-activation:

$\boldsymbol{\delta}^{(l)} = \frac{\partial L}{\partial \mathbf{z}^{(l)}}$

Step 2: Compute gradients for the weights and biases:

$\frac{\partial L}{\partial W^{(l)}} = \boldsymbol{\delta}^{(l)} (\mathbf{a}^{(l-1)})^\top$

$\frac{\partial L}{\partial \mathbf{b}^{(l)}} = \boldsymbol{\delta}^{(l)}$

Step 3: Propagate the gradient to the previous layer:

$\frac{\partial L}{\partial \mathbf{a}^{(l-1)}} = (W^{(l)})^\top \boldsymbol{\delta}^{(l)}$

Then multiply by the activation derivative to get $\boldsymbol{\delta}^{(l-1)}$ , and repeat.

sequenceDiagram
  participant I as Input x
  participant H as Hidden layers
  participant O as Output ŷ
  participant L as Loss L

  I->>H: Forward: compute activations
  H->>O: Forward: compute prediction
  O->>L: Compute loss L(ŷ, y)
  L->>O: ∂L/∂ŷ
  O->>H: Backward: propagate gradients
  H->>I: Backward: gradients reach all weights

Chain rule applied layer by layer

graph RL
  L["dL/dL = 1"] -->|"times dL/dy-hat"| DOUT["Output delta"]
  DOUT -->|"times sigmoid prime"| GRAD_W2["Gradient for W2"]
  DOUT -->|"times W2 transpose"| DHID["Hidden delta"]
  DHID -->|"times sigmoid prime"| GRAD_W1["Gradient for W1"]

Each layer peels off one link in the chain. The gradient at any layer equals the product of all derivative terms from the loss back to that point.

Gradient accumulation at shared nodes

When a variable feeds into multiple downstream operations, its gradient is the sum of the gradients from all paths. This comes directly from the multivariate chain rule. If node $h$ feeds into both $f_1$ and $f_2$ :

$\frac{\partial L}{\partial h} = \frac{\partial L}{\partial f_1} \cdot \frac{\partial f_1}{\partial h} + \frac{\partial L}{\partial f_2} \cdot \frac{\partial f_2}{\partial h}$

You add up all the contributions. Frameworks handle this automatically when traversing the graph.

The vanishing gradient problem

Gradient magnitudes per layer in a 5-layer network, illustrating the vanishing gradient problem.

Backpropagation multiplies gradient terms at each layer. If those terms are consistently less than 1, the gradient shrinks exponentially as it travels backward. By the time it reaches the first few layers, it is essentially zero. Those layers stop learning.

This happens naturally with sigmoid and tanh activations, because their derivatives are always less than 1. Sigmoid’s maximum derivative is 0.25 (at $z = 0$ ), and for saturated neurons it is much smaller.

We will see this concretely in Example 3. ReLU helps because its gradient is exactly 1 for positive inputs. But deeper solutions like batch normalization, residual connections, and LSTMs are often needed for very deep networks.

Gradient magnitude across layers (sigmoid, max derivative 0.25)

graph LR
  L5["Layer 5: grad = 1.0"] --> L4["Layer 4: grad = 0.25"]
  L4 --> L3["Layer 3: grad = 0.063"]
  L3 --> L2["Layer 2: grad = 0.016"]
  L2 --> L1["Layer 1: grad = 0.004"]

  style L5 fill:#4CAF50,stroke:#333,color:#fff
  style L4 fill:#8BC34A,stroke:#333,color:#000
  style L3 fill:#FFC107,stroke:#333,color:#000
  style L2 fill:#FF9800,stroke:#333,color:#000
  style L1 fill:#f44336,stroke:#333,color:#fff

Each layer multiplies the gradient by the activation derivative. With sigmoid, four layers reduce the gradient to less than 0.5% of the original signal.

Example 1: Full forward and backward pass

Let us trace every value through a small network: 2 inputs, 2 hidden units with sigmoid, 1 output with sigmoid, trained using MSE loss.

Weights:

$W^{(1)} = \begin{bmatrix} 0.1 & 0.2 \\ 0.3 & 0.4 \end{bmatrix}, \quad \mathbf{b}^{(1)} = \begin{bmatrix} 0 \\ 0 \end{bmatrix}$

$\mathbf{w}^{(2)} = \begin{bmatrix} 0.5 \\ 0.6 \end{bmatrix}, \quad b^{(2)} = 0$

Input: $\mathbf{x} = [1, 2]^\top$ , Target: $y = 1$

Forward pass

Hidden layer pre-activations:

$z_1^{(1)} = 0.1 \cdot 1 + 0.2 \cdot 2 = 0.5$

$z_2^{(1)} = 0.3 \cdot 1 + 0.4 \cdot 2 = 1.1$

Hidden layer activations (sigmoid):

$h_1 = \sigma(0.5) = \frac{1}{1 + e^{-0.5}} \approx 0.6225$

$h_2 = \sigma(1.1) = \frac{1}{1 + e^{-1.1}} \approx 0.7503$

Output pre-activation:

$z^{(2)} = 0.5 \cdot 0.6225 + 0.6 \cdot 0.7503 = 0.3113 + 0.4502 = 0.7615$

Output activation:

$\hat{y} = \sigma(0.7615) \approx 0.6816$

Loss (MSE):

$L = (\hat{y} - y)^2 = (0.6816 - 1)^2 = (-0.3184)^2 \approx 0.1014$

Backward pass

Gradient of loss w.r.t. output:

$\frac{\partial L}{\partial \hat{y}} = 2(\hat{y} - y) = 2(0.6816 - 1) = -0.6368$

Output layer sigmoid derivative:

$\sigma'(z^{(2)}) = \hat{y}(1 - \hat{y}) = 0.6816 \times 0.3184 \approx 0.2170$

Output layer delta:

$\delta^{(2)} = -0.6368 \times 0.2170 = -0.1382$

Gradients for output weights:

$\frac{\partial L}{\partial w_1^{(2)}} = \delta^{(2)} \cdot h_1 = -0.1382 \times 0.6225 = -0.0860$

$\frac{\partial L}{\partial w_2^{(2)}} = \delta^{(2)} \cdot h_2 = -0.1382 \times 0.7503 = -0.1037$

$\frac{\partial L}{\partial b^{(2)}} = \delta^{(2)} = -0.1382$

Propagate to hidden layer:

$\frac{\partial L}{\partial h_1} = w_1^{(2)} \cdot \delta^{(2)} = 0.5 \times (-0.1382) = -0.0691$

$\frac{\partial L}{\partial h_2} = w_2^{(2)} \cdot \delta^{(2)} = 0.6 \times (-0.1382) = -0.0829$

Hidden layer sigmoid derivatives:

$\sigma'(z_1^{(1)}) = h_1(1 - h_1) = 0.6225 \times 0.3775 = 0.2350$

$\sigma'(z_2^{(1)}) = h_2(1 - h_2) = 0.7503 \times 0.2497 = 0.1874$

Hidden layer deltas:

$\delta_1^{(1)} = -0.0691 \times 0.2350 = -0.0162$

$\delta_2^{(1)} = -0.0829 \times 0.1874 = -0.0155$

Gradients for hidden weights:

$\frac{\partial L}{\partial W^{(1)}} = \boldsymbol{\delta}^{(1)} \mathbf{x}^\top = \begin{bmatrix} -0.0162 \times 1 & -0.0162 \times 2 \\ -0.0155 \times 1 & -0.0155 \times 2 \end{bmatrix} = \begin{bmatrix} -0.0162 & -0.0324 \\ -0.0155 & -0.0310 \end{bmatrix}$

All gradients are negative, meaning we should increase the weights to push $\hat{y}$ closer to 1. With a learning rate $\eta = 0.5$ , each weight would be updated as $w \leftarrow w - \eta \cdot \frac{\partial L}{\partial w}$ .

Example 2: Backprop through a simple graph

Compute the gradients of $z = (x + y)(y - 3)$ at $x = 2$ , $y = 4$ .

Step 1: Break into nodes.

Let $a = x + y$ and $b = y - 3$ . Then $z = a \cdot b$ .

Step 2: Forward pass.

$a = 2 + 4 = 6$

$b = 4 - 3 = 1$

$z = 6 \times 1 = 6$

Step 3: Backward pass.

Start from the output. $\frac{\partial z}{\partial z} = 1$ .

At the multiplication node:

$\frac{\partial z}{\partial a} = b = 1, \quad \frac{\partial z}{\partial b} = a = 6$

At the addition node ( $a = x + y$ ):

$\frac{\partial a}{\partial x} = 1, \quad \frac{\partial a}{\partial y} = 1$

At the subtraction node ( $b = y - 3$ ):

$\frac{\partial b}{\partial y} = 1$

Step 4: Apply the chain rule.

$\frac{\partial z}{\partial x} = \frac{\partial z}{\partial a} \cdot \frac{\partial a}{\partial x} = 1 \times 1 = 1$

For $y$ , there are two paths (gradient accumulation):

$\frac{\partial z}{\partial y} = \frac{\partial z}{\partial a} \cdot \frac{\partial a}{\partial y} + \frac{\partial z}{\partial b} \cdot \frac{\partial b}{\partial y} = 1 \times 1 + 6 \times 1 = 7$

Verification: Expand $z = (x+y)(y-3) = xy - 3x + y^2 - 3y$ .

$\frac{\partial z}{\partial x} = y - 3 = 4 - 3 = 1 \checkmark$

$\frac{\partial z}{\partial y} = x + 2y - 3 = 2 + 8 - 3 = 7 \checkmark$

Notice how $y$ has a larger gradient than $x$ . This makes sense: $y$ appears in both terms of the product, so changes to $y$ affect $z$ through two paths.

Verifying gradients: analytical vs numerical

graph TD
  A["Analytical gradient
backprop via chain rule
(fast)"] --> C["Compare"]
  B["Numerical gradient
perturb input, measure change
(slow but reliable)"] --> C
  C -->|"Difference less than 1e-5"| OK["Implementation correct"]
  C -->|"Difference greater than 1e-3"| BUG["Bug in backprop code"]

The analytical gradient uses the chain rule and is fast. The numerical gradient perturbs each parameter by a tiny $\epsilon$ , measures the change in loss, and divides. Comparing the two is the standard way to debug backpropagation.

Example 3: Vanishing gradient in action

Let us see why sigmoid causes problems in deep networks. Consider a 5-layer network where each hidden layer uses sigmoid. Suppose each layer’s sigmoid outputs are around 0.9 (mildly saturated, which is very common).

The sigmoid derivative at output $a$ is:

$\sigma'(z) = a(1 - a) = 0.9 \times 0.1 = 0.09$

As the gradient passes backward through each layer, it gets multiplied by this derivative (among other factors). Tracking just the sigmoid contribution:

Layer (from output)	Gradient factor	Cumulative gradient
5 (output side)	$0.09$	$0.0900$
4	$\times\, 0.09$	$0.0081$
3	$\times\, 0.09$	$0.00073$
2	$\times\, 0.09$	$0.000066$
1 (input side)	$\times\, 0.09$	$0.0000059$

After just 5 layers, the gradient is roughly 0.0006% of what it was at the output. Layer 1 barely receives any learning signal. This is the vanishing gradient problem.

Even if activations are at 0.5 (where the sigmoid derivative is maximized at 0.25), after 5 layers you get $0.25^5 \approx 0.001$ . That is still a factor of 1000 reduction.

ReLU helps because its derivative is 1 for positive inputs. The gradient passes through unchanged. But ReLU alone does not fully solve the problem in very deep networks (50+ layers). Solutions like residual connections and batch normalization become essential, and we will cover those in upcoming articles.

What comes next

You now understand the complete learning loop: forward pass computes a prediction, the loss measures error, and backpropagation sends gradients back so gradient descent can update every weight.

But there are many practical decisions that determine whether training actually works: how to initialize weights, how to set the learning rate, when to clip gradients, and more. The next article covers all of these in training neural networks: a practical guide.

← Back to all series