Oct 21, 2025 · 22 min read · Deep Learning

Recurrent neural networks and LSTMs

In this series (25 parts)

Prerequisites

Before reading this, make sure you understand:

Forward pass and backpropagation: how gradients flow through a network, especially the vanishing gradient problem

The big picture

Feedforward networks forget everything between inputs. They treat each sample as if they have never seen anything before. RNNs have memory.

Think of reading the sentence “The cat sat on the ___.” Each word updates your mental model. After “The cat sat on the,” you expect a noun, probably a surface. Your brain carries context forward word by word. An RNN does the same thing: it maintains a hidden state that accumulates information from every token it has seen.

Processing “The cat sat on the” step by step

Step	Word	What the hidden state captures	Likely next word
1	The	Expecting a noun	cat, dog, man
2	cat	Subject is “cat”	sat, ran, slept
3	sat	Cat performed an action	on, in, down
4	on	Location coming next	the, a, top
5	the	Expecting a surface or object	mat, chair, floor

An RNN unrolled through time

graph LR
  x1["x1: The"] --> h1["h1"]
  h1 --> h2["h2"]
  x2["x2: cat"] --> h2
  h2 --> h3["h3"]
  x3["x3: sat"] --> h3
  h3 --> h4["h4"]
  x4["x4: on"] --> h4
  h4 --> h5["h5"]
  x5["x5: the"] --> h5
  h5 --> OUT["Output: mat"]

Each hidden state blends the current input with everything that came before. The same weights are reused at every step.

Now let’s formalize this recurrence.

Why sequences need memory

Feedforward networks process each input independently. They have no concept of order or history. But language, music, stock prices, and sensor data are all sequences where context matters. The meaning of a word depends on the words before it. The next note depends on the melody so far.

To handle sequences, we need a network that can remember what it has seen. That is the core idea behind recurrent neural networks.

RNN: the basic recurrence

A recurrent neural network processes one element at a time, maintaining a hidden state $\mathbf{h}_t$ that summarizes everything it has seen so far.

At each time step $t$ :

$\mathbf{h}_t = \tanh(W_h \mathbf{h}_{t-1} + W_x \mathbf{x}_t + \mathbf{b})$

$\mathbf{y}_t = W_y \mathbf{h}_t + \mathbf{b}_y$

Here $\mathbf{x}_t$ is the input at time $t$ , $\mathbf{h}_{t-1}$ is the previous hidden state, and $\mathbf{y}_t$ is the output. The same weights $W_h$ , $W_x$ , $W_y$ are shared across all time steps. This is parameter sharing in the time dimension, analogous to how CNNs share weights across space.

The hidden state is the RNN’s memory. It carries information from past inputs to influence future outputs.

Unrolled view

When we “unroll” an RNN across time, it looks like a very deep feedforward network where each layer corresponds to one time step:

graph LR
  x0["x₀"] --> H0["h₀ = tanh(W_h·h₋₁ + W_x·x₀ + b)"]
  h_init["h₋₁ = 0"] --> H0
  H0 --> y0["y₀"]

  x1["x₁"] --> H1["h₁ = tanh(W_h·h₀ + W_x·x₁ + b)"]
  H0 -->|h₀| H1
  H1 --> y1["y₁"]

  x2["x₂"] --> H2["h₂ = tanh(W_h·h₁ + W_x·x₂ + b)"]
  H1 -->|h₁| H2
  H2 --> y2["y₂"]

  x3["x₃"] --> H3["h₃ = tanh(W_h·h₂ + W_x·x₃ + b)"]
  H2 -->|h₂| H3
  H3 --> y3["y₃"]

Each time step receives the hidden state from the previous step and the current input. The same weights are used at every step.

Backpropagation through time (BPTT)

Training an RNN means computing gradients through this unrolled graph. We apply the chain rule across all time steps. This is called backpropagation through time.

The gradient of the loss at time $T$ with respect to the hidden state at time $t$ involves a product of Jacobians:

$\frac{\partial L_T}{\partial \mathbf{h}_t} = \frac{\partial L_T}{\partial \mathbf{h}_T} \prod_{k=t+1}^{T} \frac{\partial \mathbf{h}_k}{\partial \mathbf{h}_{k-1}}$

Each factor $\frac{\partial \mathbf{h}_k}{\partial \mathbf{h}_{k-1}}$ involves $W_h^\top \cdot \text{diag}(\tanh'(\mathbf{z}_k))$ . This product of matrices is where things go wrong.

Vanishing and exploding gradients in RNNs

Gradient magnitude vs time step: vanilla RNN gradients vanish exponentially, while LSTM maintains roughly constant gradients.

The product $\prod_{k=t+1}^{T} \frac{\partial \mathbf{h}_k}{\partial \mathbf{h}_{k-1}}$ either shrinks to zero or blows up, depending on the spectral radius of $W_h$ and the activation derivatives.

Vanishing gradient: if $\|W_h\|$ is small and tanh derivatives are less than 1, the product decays exponentially. Gradients from distant time steps vanish, and the RNN cannot learn long-range dependencies. It effectively “forgets” early inputs.

Exploding gradient: if $\|W_h\|$ is large, the product grows exponentially. Gradients become enormous, and weight updates are destructive. Gradient clipping (covered in the training guide) helps with explosions, but vanishing gradients need a fundamentally different approach.

LSTM: Long Short-Term Memory

LSTMs solve the vanishing gradient problem with a clever design: a cell state $\mathbf{c}_t$ that acts as a memory highway. Information can flow along the cell state with minimal modification, allowing gradients to travel many time steps without decaying.

The LSTM has four gates that control information flow:

LSTM gates in plain English

graph TD
  INPUT["New input + previous hidden state"] --> FORGET["Forget gate:
What old info should I discard?"]
  INPUT --> REMEMBER["Input gate:
What new info is worth storing?"]
  INPUT --> CANDIDATE["Candidate:
What is the new info?"]
  INPUT --> OUTPUT["Output gate:
What should I reveal?"]
  FORGET --> CELL["Cell state
(long-term memory)"]
  REMEMBER --> CELL
  CANDIDATE --> CELL
  CELL --> HIDDEN["Hidden state
(working memory)"]
  OUTPUT --> HIDDEN

The forget gate erases irrelevant old information. The input gate and candidate together write new information. The output gate controls what the cell exposes to the rest of the network.

The four gates

Forget gate: decides what to remove from the cell state.

$\mathbf{f}_t = \sigma(W_f [\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_f)$

Input gate: decides what new information to add.

$\mathbf{i}_t = \sigma(W_i [\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_i)$

Candidate cell state: proposes new content.

$\tilde{\mathbf{c}}_t = \tanh(W_c [\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_c)$

Output gate: decides what to expose as the hidden state.

$\mathbf{o}_t = \sigma(W_o [\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_o)$

Cell state and hidden state updates

$\mathbf{c}_t = \mathbf{f}_t \odot \mathbf{c}_{t-1} + \mathbf{i}_t \odot \tilde{\mathbf{c}}_t$

$\mathbf{h}_t = \mathbf{o}_t \odot \tanh(\mathbf{c}_t)$

The $\odot$ symbol means element-wise multiplication.

graph TD
  subgraph LSTM Cell
      ht1["h_(t-1)"] --> concat["[h_(t-1), x_t]"]
      xt["x_t"] --> concat

      concat --> fg["Forget gate: σ"]
      concat --> ig["Input gate: σ"]
      concat --> cand["Candidate: tanh"]
      concat --> og["Output gate: σ"]

      ct1["c_(t-1)"] --> mul1["× (forget)"]
      fg --> mul1

      ig --> mul2["× (input)"]
      cand --> mul2

      mul1 --> add["+ (update cell)"]
      mul2 --> add

      add --> ct["c_t"]
      ct --> tanh_c["tanh"]
      tanh_c --> mul3["× (output)"]
      og --> mul3
      mul3 --> ht["h_t"]
  end

The key insight: the cell state $\mathbf{c}_t$ is updated through addition (not multiplication by a weight matrix). The gradient through the cell state is:

$\frac{\partial \mathbf{c}_t}{\partial \mathbf{c}_{t-1}} = \mathbf{f}_t$

If the forget gate is close to 1, the gradient passes through almost unchanged. This is the constant error carousel: a direct path for gradients to flow across many time steps.

Why LSTM gradients survive long sequences

graph LR
  C0["c0"] -->|"times f1
(near 1.0)"| C1["c1"]
  C1 -->|"times f2
(near 1.0)"| C2["c2"]
  C2 -->|"times f3
(near 1.0)"| C3["c3"]
  C3 -->|"times f4
(near 1.0)"| C4["c4"]

  style C0 fill:#4CAF50,stroke:#333,color:#fff
  style C1 fill:#66BB6A,stroke:#333,color:#fff
  style C2 fill:#81C784,stroke:#333,color:#000
  style C3 fill:#A5D6A7,stroke:#333,color:#000
  style C4 fill:#C8E6C9,stroke:#333,color:#000

The cell state updates through addition, not matrix multiplication. When the forget gate stays close to 1, gradients pass through nearly unchanged. This additive path is the key difference from vanilla RNNs, where gradients must pass through a weight matrix at every step.

GRU: Gated Recurrent Unit

The GRU simplifies the LSTM by merging the forget and input gates into a single update gate, and combining the cell state and hidden state.

Update gate: how much of the old state to keep.

$\mathbf{z}_t = \sigma(W_z [\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_z)$

Reset gate: how much of the old state to use when computing the candidate.

$\mathbf{r}_t = \sigma(W_r [\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_r)$

Candidate hidden state:

$\tilde{\mathbf{h}}_t = \tanh(W_h [\mathbf{r}_t \odot \mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_h)$

Final hidden state (interpolation between old and new):

$\mathbf{h}_t = (1 - \mathbf{z}_t) \odot \mathbf{h}_{t-1} + \mathbf{z}_t \odot \tilde{\mathbf{h}}_t$

graph TD
  subgraph GRU Cell
      ht1_g["h_(t-1)"] --> concat_g["[h_(t-1), x_t]"]
      xt_g["x_t"] --> concat_g

      concat_g --> zg["Update gate z: σ"]
      concat_g --> rg["Reset gate r: σ"]

      rg --> mul_r["r × h_(t-1)"]
      ht1_g --> mul_r
      mul_r --> cand_g["Candidate: tanh([r⊙h, x])"]
      xt_g --> cand_g

      zg --> interp["h_t = (1-z)⊙h_(t-1) + z⊙h̃_t"]
      ht1_g --> interp
      cand_g --> interp

      interp --> ht_g["h_t"]
  end

GRU has fewer parameters than LSTM (3 weight matrices instead of 4) and often performs comparably. When your dataset is small or you want faster training, GRU is worth trying.

What each architecture adds

graph LR
  RNN["Vanilla RNN
One hidden state
No gates"] -->|"Add cell state
and 4 gates"| LSTM["LSTM
Cell + hidden state
Forget, input,
candidate, output"]
  RNN -->|"Add 2 gates
(simpler)"| GRU["GRU
One hidden state
Update gate,
reset gate"]

When to use what

Property	Vanilla RNN	LSTM	GRU	Transformer
Long-range memory	Poor	Good	Good	Excellent
Handles long sequences	No (gradients vanish)	Yes	Yes	Yes (with positional encoding)
Parallelizable	No (sequential)	No (sequential)	No (sequential)	Yes (fully parallel)
Parameters	Fewest	Most	Moderate	Depends on layers/heads
Best use case	Very short sequences	Long sequences, generation	Similar to LSTM, smaller data	Large-scale NLP, vision

Vanilla RNNs are rarely used in practice. LSTMs and GRUs dominated sequence modeling from 2014 to 2017. Since then, Transformers with attention mechanisms have taken over for most tasks. But LSTMs remain useful for streaming data, edge devices, and cases where you need to process one token at a time.

Example 1: One RNN step

Compute one time step of an RNN with 2D hidden state.

Given:

$W_h = \begin{bmatrix} 0.1 & 0.2 \\ 0.3 & 0.4 \end{bmatrix}, \quad W_x = \begin{bmatrix} 0.5 & -0.3 \\ -0.2 & 0.8 \end{bmatrix}, \quad \mathbf{b} = \begin{bmatrix} 0.0 \\ 0.2 \end{bmatrix}$

$\mathbf{x}_t = \begin{bmatrix} 1 \\ 0 \end{bmatrix}, \quad \mathbf{h}_{t-1} = \begin{bmatrix} 0 \\ 0 \end{bmatrix}$

Step 1: Compute the pre-activation.

$\mathbf{z} = W_h \mathbf{h}_{t-1} + W_x \mathbf{x}_t + \mathbf{b}$

Since $\mathbf{h}_{t-1} = \mathbf{0}$ , the first term vanishes:

$W_x \mathbf{x}_t = \begin{bmatrix} 0.5(1) + (-0.3)(0) \\ -0.2(1) + 0.8(0) \end{bmatrix} = \begin{bmatrix} 0.5 \\ -0.2 \end{bmatrix}$

$\mathbf{z} = \begin{bmatrix} 0.5 \\ -0.2 \end{bmatrix} + \begin{bmatrix} 0.0 \\ 0.2 \end{bmatrix} = \begin{bmatrix} 0.5 \\ 0.0 \end{bmatrix}$

Step 2: Apply tanh.

$\mathbf{h}_t = \tanh(\mathbf{z}) = \begin{bmatrix} \tanh(0.5) \\ \tanh(0.0) \end{bmatrix} = \begin{bmatrix} 0.4621 \\ 0.0 \end{bmatrix}$

The hidden state now encodes information about the first input. At the next time step, $\mathbf{h}_t$ feeds back in through $W_h$ , mixing with the new input.

Notice that if we had a nonzero previous hidden state, $W_h \mathbf{h}_{t-1}$ would blend past information with the current input. That is how the RNN maintains memory.

Example 2: One LSTM step

Work through a single LSTM time step with scalar values for clarity.

Given (all weights are 1x2 vectors operating on $[\mathbf{h}_{t-1}, \mathbf{x}_t]$ ):

$W_f = [0.5, 0.4], \; b_f = 0.1 \quad \text{(forget gate)}$

$W_i = [0.3, 0.6], \; b_i = -0.1 \quad \text{(input gate)}$

$W_c = [0.7, -0.2], \; b_c = 0.0 \quad \text{(candidate)}$

$W_o = [0.4, 0.5], \; b_o = 0.2 \quad \text{(output gate)}$

$x_t = 0.5, \quad h_{t-1} = 0.3, \quad c_{t-1} = 0.1$

The concatenated input is $[h_{t-1}, x_t] = [0.3, 0.5]$ .

Forget gate:

$f_t = \sigma(0.5 \cdot 0.3 + 0.4 \cdot 0.5 + 0.1) = \sigma(0.15 + 0.20 + 0.1) = \sigma(0.45) \approx 0.611$

Input gate:

$i_t = \sigma(0.3 \cdot 0.3 + 0.6 \cdot 0.5 + (-0.1)) = \sigma(0.09 + 0.30 - 0.1) = \sigma(0.29) \approx 0.572$

Candidate cell state:

$\tilde{c}_t = \tanh(0.7 \cdot 0.3 + (-0.2) \cdot 0.5 + 0.0) = \tanh(0.21 - 0.10) = \tanh(0.11) \approx 0.110$

Output gate:

$o_t = \sigma(0.4 \cdot 0.3 + 0.5 \cdot 0.5 + 0.2) = \sigma(0.12 + 0.25 + 0.2) = \sigma(0.57) \approx 0.639$

Update cell state:

$c_t = f_t \cdot c_{t-1} + i_t \cdot \tilde{c}_t = 0.611 \times 0.1 + 0.572 \times 0.110$

$c_t = 0.0611 + 0.0629 = 0.124$

Compute hidden state:

$h_t = o_t \cdot \tanh(c_t) = 0.639 \times \tanh(0.124) \approx 0.639 \times 0.1234 = 0.0789$

Summary of this step:

Gate/State	Pre-activation	Value
Forget gate $f_t$	0.45	0.611
Input gate $i_t$	0.29	0.572
Candidate $\tilde{c}_t$	0.11	0.110
Output gate $o_t$	0.57	0.639
Cell state $c_t$	-	0.124
Hidden state $h_t$	-	0.079

The forget gate at 0.611 means we keep about 61% of the old cell state. The input gate at 0.572 means we incorporate about 57% of the candidate. These gates let the LSTM selectively remember and forget, which is what makes it effective for long sequences.

Example 3: Vanishing gradient comparison

Compare gradient flow in a vanilla RNN vs an LSTM over 10 time steps.

Vanilla RNN

At each time step, the gradient is multiplied by $\tanh'(z_k)$ . For a typical hidden activation of $\tanh(z) \approx 0.7$ :

$\tanh'(z) = 1 - \tanh^2(z) = 1 - 0.49 = 0.51$

After 10 time steps, the gradient contribution from the tanh derivatives alone:

$\prod_{k=1}^{10} 0.51 = 0.51^{10} \approx 0.00117$

The gradient reaching the first time step is roughly 0.1% of the gradient at the output. And this does not even account for the $W_h$ factors, which can shrink it further. Early time steps receive almost no learning signal.

LSTM cell state

The gradient through the LSTM cell state is $\prod_{k=1}^{T} f_k$ , where $f_k$ is the forget gate value. If the forget gate learns to stay close to 1 (say $f_k \approx 0.95$ ):

$\prod_{k=1}^{10} 0.95 = 0.95^{10} \approx 0.599$

After 10 time steps, 60% of the gradient survives. That is a massive improvement over the RNN’s 0.1%.

Time steps back	RNN gradient ( $0.51^t$ )	LSTM gradient ( $0.95^t$ )
1	0.510	0.950
3	0.133	0.857
5	0.035	0.774
10	0.00117	0.599
20	0.0000014	0.358

At 20 time steps, the RNN gradient is essentially zero ( $1.4 \times 10^{-6}$ ), while the LSTM still retains 36% of the signal. This is why LSTMs can learn dependencies spanning hundreds of time steps, while vanilla RNNs struggle with sequences longer than 10-20 steps.

The cell state acts as a gradient highway. When the forget gate is close to 1, the cell state passes information (and gradients) forward with minimal loss. This is the fundamental insight that makes LSTMs work.

What comes next

RNNs and LSTMs process sequences one step at a time. This sequential nature limits parallelism and makes training slow for very long sequences. The attention mechanism and Transformers solve this by processing all positions in parallel and using attention to directly connect any two positions in the sequence, regardless of distance. Transformers have largely replaced RNNs for NLP and are now expanding into vision, audio, and more.

← Back to all series