Recurrent neural networks and LSTMs
In this series (25 parts)
- Neural networks: the basic building block
- Forward pass and backpropagation
- Training neural networks: a practical guide
- Convolutional neural networks
- Recurrent neural networks and LSTMs
- Attention mechanism and transformers
- Word embeddings: from one-hot to dense representations
- Transfer learning and fine-tuning
- Optimization techniques for deep networks
- Regularization for deep networks
- Encoder-decoder architectures
- Generative models: an overview
- Restricted Boltzmann Machines
- Deep Belief Networks
- Variational Autoencoders
- Generative Adversarial Networks: training and theory
- DCGAN, conditional GANs, and GAN variants
- Representation learning and self-supervised learning
- Domain adaptation and fine-tuning strategies
- Distributed representations and latent spaces
- AutoML and hyperparameter optimization
- Neural architecture search
- Network compression and efficient inference
- Graph neural networks
- Practical deep learning: debugging and tuning
Prerequisites
Before reading this, make sure you understand:
- Forward pass and backpropagation: how gradients flow through a network, especially the vanishing gradient problem
The big picture
Feedforward networks forget everything between inputs. They treat each sample as if they have never seen anything before. RNNs have memory.
Think of reading the sentence “The cat sat on the ___.” Each word updates your mental model. After “The cat sat on the,” you expect a noun, probably a surface. Your brain carries context forward word by word. An RNN does the same thing: it maintains a hidden state that accumulates information from every token it has seen.
Processing “The cat sat on the” step by step
| Step | Word | What the hidden state captures | Likely next word |
|---|---|---|---|
| 1 | The | Expecting a noun | cat, dog, man |
| 2 | cat | Subject is “cat” | sat, ran, slept |
| 3 | sat | Cat performed an action | on, in, down |
| 4 | on | Location coming next | the, a, top |
| 5 | the | Expecting a surface or object | mat, chair, floor |
An RNN unrolled through time
graph LR x1["x1: The"] --> h1["h1"] h1 --> h2["h2"] x2["x2: cat"] --> h2 h2 --> h3["h3"] x3["x3: sat"] --> h3 h3 --> h4["h4"] x4["x4: on"] --> h4 h4 --> h5["h5"] x5["x5: the"] --> h5 h5 --> OUT["Output: mat"]
Each hidden state blends the current input with everything that came before. The same weights are reused at every step.
Now let’s formalize this recurrence.
Why sequences need memory
Feedforward networks process each input independently. They have no concept of order or history. But language, music, stock prices, and sensor data are all sequences where context matters. The meaning of a word depends on the words before it. The next note depends on the melody so far.
To handle sequences, we need a network that can remember what it has seen. That is the core idea behind recurrent neural networks.
RNN: the basic recurrence
A recurrent neural network processes one element at a time, maintaining a hidden state that summarizes everything it has seen so far.
At each time step :
Here is the input at time , is the previous hidden state, and is the output. The same weights , , are shared across all time steps. This is parameter sharing in the time dimension, analogous to how CNNs share weights across space.
The hidden state is the RNN’s memory. It carries information from past inputs to influence future outputs.
Unrolled view
When we “unroll” an RNN across time, it looks like a very deep feedforward network where each layer corresponds to one time step:
graph LR x0["x₀"] --> H0["h₀ = tanh(W_h·h₋₁ + W_x·x₀ + b)"] h_init["h₋₁ = 0"] --> H0 H0 --> y0["y₀"] x1["x₁"] --> H1["h₁ = tanh(W_h·h₀ + W_x·x₁ + b)"] H0 -->|h₀| H1 H1 --> y1["y₁"] x2["x₂"] --> H2["h₂ = tanh(W_h·h₁ + W_x·x₂ + b)"] H1 -->|h₁| H2 H2 --> y2["y₂"] x3["x₃"] --> H3["h₃ = tanh(W_h·h₂ + W_x·x₃ + b)"] H2 -->|h₂| H3 H3 --> y3["y₃"]
Each time step receives the hidden state from the previous step and the current input. The same weights are used at every step.
Backpropagation through time (BPTT)
Training an RNN means computing gradients through this unrolled graph. We apply the chain rule across all time steps. This is called backpropagation through time.
The gradient of the loss at time with respect to the hidden state at time involves a product of Jacobians:
Each factor involves . This product of matrices is where things go wrong.
Vanishing and exploding gradients in RNNs
Gradient magnitude vs time step: vanilla RNN gradients vanish exponentially, while LSTM maintains roughly constant gradients.
The product either shrinks to zero or blows up, depending on the spectral radius of and the activation derivatives.
Vanishing gradient: if is small and tanh derivatives are less than 1, the product decays exponentially. Gradients from distant time steps vanish, and the RNN cannot learn long-range dependencies. It effectively “forgets” early inputs.
Exploding gradient: if is large, the product grows exponentially. Gradients become enormous, and weight updates are destructive. Gradient clipping (covered in the training guide) helps with explosions, but vanishing gradients need a fundamentally different approach.
LSTM: Long Short-Term Memory
LSTMs solve the vanishing gradient problem with a clever design: a cell state that acts as a memory highway. Information can flow along the cell state with minimal modification, allowing gradients to travel many time steps without decaying.
The LSTM has four gates that control information flow:
LSTM gates in plain English
graph TD INPUT["New input + previous hidden state"] --> FORGET["Forget gate: What old info should I discard?"] INPUT --> REMEMBER["Input gate: What new info is worth storing?"] INPUT --> CANDIDATE["Candidate: What is the new info?"] INPUT --> OUTPUT["Output gate: What should I reveal?"] FORGET --> CELL["Cell state (long-term memory)"] REMEMBER --> CELL CANDIDATE --> CELL CELL --> HIDDEN["Hidden state (working memory)"] OUTPUT --> HIDDEN
The forget gate erases irrelevant old information. The input gate and candidate together write new information. The output gate controls what the cell exposes to the rest of the network.
The four gates
Forget gate: decides what to remove from the cell state.
Input gate: decides what new information to add.
Candidate cell state: proposes new content.
Output gate: decides what to expose as the hidden state.
Cell state and hidden state updates
The symbol means element-wise multiplication.
graph TD
subgraph LSTM Cell
ht1["h_(t-1)"] --> concat["[h_(t-1), x_t]"]
xt["x_t"] --> concat
concat --> fg["Forget gate: σ"]
concat --> ig["Input gate: σ"]
concat --> cand["Candidate: tanh"]
concat --> og["Output gate: σ"]
ct1["c_(t-1)"] --> mul1["× (forget)"]
fg --> mul1
ig --> mul2["× (input)"]
cand --> mul2
mul1 --> add["+ (update cell)"]
mul2 --> add
add --> ct["c_t"]
ct --> tanh_c["tanh"]
tanh_c --> mul3["× (output)"]
og --> mul3
mul3 --> ht["h_t"]
end
The key insight: the cell state is updated through addition (not multiplication by a weight matrix). The gradient through the cell state is:
If the forget gate is close to 1, the gradient passes through almost unchanged. This is the constant error carousel: a direct path for gradients to flow across many time steps.
Why LSTM gradients survive long sequences
graph LR C0["c0"] -->|"times f1 (near 1.0)"| C1["c1"] C1 -->|"times f2 (near 1.0)"| C2["c2"] C2 -->|"times f3 (near 1.0)"| C3["c3"] C3 -->|"times f4 (near 1.0)"| C4["c4"] style C0 fill:#4CAF50,stroke:#333,color:#fff style C1 fill:#66BB6A,stroke:#333,color:#fff style C2 fill:#81C784,stroke:#333,color:#000 style C3 fill:#A5D6A7,stroke:#333,color:#000 style C4 fill:#C8E6C9,stroke:#333,color:#000
The cell state updates through addition, not matrix multiplication. When the forget gate stays close to 1, gradients pass through nearly unchanged. This additive path is the key difference from vanilla RNNs, where gradients must pass through a weight matrix at every step.
GRU: Gated Recurrent Unit
The GRU simplifies the LSTM by merging the forget and input gates into a single update gate, and combining the cell state and hidden state.
Update gate: how much of the old state to keep.
Reset gate: how much of the old state to use when computing the candidate.
Candidate hidden state:
Final hidden state (interpolation between old and new):
graph TD
subgraph GRU Cell
ht1_g["h_(t-1)"] --> concat_g["[h_(t-1), x_t]"]
xt_g["x_t"] --> concat_g
concat_g --> zg["Update gate z: σ"]
concat_g --> rg["Reset gate r: σ"]
rg --> mul_r["r × h_(t-1)"]
ht1_g --> mul_r
mul_r --> cand_g["Candidate: tanh([r⊙h, x])"]
xt_g --> cand_g
zg --> interp["h_t = (1-z)⊙h_(t-1) + z⊙h̃_t"]
ht1_g --> interp
cand_g --> interp
interp --> ht_g["h_t"]
end
GRU has fewer parameters than LSTM (3 weight matrices instead of 4) and often performs comparably. When your dataset is small or you want faster training, GRU is worth trying.
What each architecture adds
graph LR RNN["Vanilla RNN One hidden state No gates"] -->|"Add cell state and 4 gates"| LSTM["LSTM Cell + hidden state Forget, input, candidate, output"] RNN -->|"Add 2 gates (simpler)"| GRU["GRU One hidden state Update gate, reset gate"]
When to use what
| Property | Vanilla RNN | LSTM | GRU | Transformer |
|---|---|---|---|---|
| Long-range memory | Poor | Good | Good | Excellent |
| Handles long sequences | No (gradients vanish) | Yes | Yes | Yes (with positional encoding) |
| Parallelizable | No (sequential) | No (sequential) | No (sequential) | Yes (fully parallel) |
| Parameters | Fewest | Most | Moderate | Depends on layers/heads |
| Best use case | Very short sequences | Long sequences, generation | Similar to LSTM, smaller data | Large-scale NLP, vision |
Vanilla RNNs are rarely used in practice. LSTMs and GRUs dominated sequence modeling from 2014 to 2017. Since then, Transformers with attention mechanisms have taken over for most tasks. But LSTMs remain useful for streaming data, edge devices, and cases where you need to process one token at a time.
Example 1: One RNN step
Compute one time step of an RNN with 2D hidden state.
Given:
Step 1: Compute the pre-activation.
Since , the first term vanishes:
Step 2: Apply tanh.
The hidden state now encodes information about the first input. At the next time step, feeds back in through , mixing with the new input.
Notice that if we had a nonzero previous hidden state, would blend past information with the current input. That is how the RNN maintains memory.
Example 2: One LSTM step
Work through a single LSTM time step with scalar values for clarity.
Given (all weights are 1x2 vectors operating on ):
The concatenated input is .
Forget gate:
Input gate:
Candidate cell state:
Output gate:
Update cell state:
Compute hidden state:
Summary of this step:
| Gate/State | Pre-activation | Value |
|---|---|---|
| Forget gate | 0.45 | 0.611 |
| Input gate | 0.29 | 0.572 |
| Candidate | 0.11 | 0.110 |
| Output gate | 0.57 | 0.639 |
| Cell state | - | 0.124 |
| Hidden state | - | 0.079 |
The forget gate at 0.611 means we keep about 61% of the old cell state. The input gate at 0.572 means we incorporate about 57% of the candidate. These gates let the LSTM selectively remember and forget, which is what makes it effective for long sequences.
Example 3: Vanishing gradient comparison
Compare gradient flow in a vanilla RNN vs an LSTM over 10 time steps.
Vanilla RNN
At each time step, the gradient is multiplied by . For a typical hidden activation of :
After 10 time steps, the gradient contribution from the tanh derivatives alone:
The gradient reaching the first time step is roughly 0.1% of the gradient at the output. And this does not even account for the factors, which can shrink it further. Early time steps receive almost no learning signal.
LSTM cell state
The gradient through the LSTM cell state is , where is the forget gate value. If the forget gate learns to stay close to 1 (say ):
After 10 time steps, 60% of the gradient survives. That is a massive improvement over the RNN’s 0.1%.
| Time steps back | RNN gradient () | LSTM gradient () |
|---|---|---|
| 1 | 0.510 | 0.950 |
| 3 | 0.133 | 0.857 |
| 5 | 0.035 | 0.774 |
| 10 | 0.00117 | 0.599 |
| 20 | 0.0000014 | 0.358 |
At 20 time steps, the RNN gradient is essentially zero (), while the LSTM still retains 36% of the signal. This is why LSTMs can learn dependencies spanning hundreds of time steps, while vanilla RNNs struggle with sequences longer than 10-20 steps.
The cell state acts as a gradient highway. When the forget gate is close to 1, the cell state passes information (and gradients) forward with minimal loss. This is the fundamental insight that makes LSTMs work.
What comes next
RNNs and LSTMs process sequences one step at a time. This sequential nature limits parallelism and makes training slow for very long sequences. The attention mechanism and Transformers solve this by processing all positions in parallel and using attention to directly connect any two positions in the sequence, regardless of distance. Transformers have largely replaced RNNs for NLP and are now expanding into vision, audio, and more.