Search…

Recurrent neural networks and LSTMs

In this series (25 parts)
  1. Neural networks: the basic building block
  2. Forward pass and backpropagation
  3. Training neural networks: a practical guide
  4. Convolutional neural networks
  5. Recurrent neural networks and LSTMs
  6. Attention mechanism and transformers
  7. Word embeddings: from one-hot to dense representations
  8. Transfer learning and fine-tuning
  9. Optimization techniques for deep networks
  10. Regularization for deep networks
  11. Encoder-decoder architectures
  12. Generative models: an overview
  13. Restricted Boltzmann Machines
  14. Deep Belief Networks
  15. Variational Autoencoders
  16. Generative Adversarial Networks: training and theory
  17. DCGAN, conditional GANs, and GAN variants
  18. Representation learning and self-supervised learning
  19. Domain adaptation and fine-tuning strategies
  20. Distributed representations and latent spaces
  21. AutoML and hyperparameter optimization
  22. Neural architecture search
  23. Network compression and efficient inference
  24. Graph neural networks
  25. Practical deep learning: debugging and tuning

Prerequisites

Before reading this, make sure you understand:


The big picture

Feedforward networks forget everything between inputs. They treat each sample as if they have never seen anything before. RNNs have memory.

Think of reading the sentence “The cat sat on the ___.” Each word updates your mental model. After “The cat sat on the,” you expect a noun, probably a surface. Your brain carries context forward word by word. An RNN does the same thing: it maintains a hidden state that accumulates information from every token it has seen.

Processing “The cat sat on the” step by step

StepWordWhat the hidden state capturesLikely next word
1TheExpecting a nouncat, dog, man
2catSubject is “cat”sat, ran, slept
3satCat performed an actionon, in, down
4onLocation coming nextthe, a, top
5theExpecting a surface or objectmat, chair, floor

An RNN unrolled through time

graph LR
  x1["x1: The"] --> h1["h1"]
  h1 --> h2["h2"]
  x2["x2: cat"] --> h2
  h2 --> h3["h3"]
  x3["x3: sat"] --> h3
  h3 --> h4["h4"]
  x4["x4: on"] --> h4
  h4 --> h5["h5"]
  x5["x5: the"] --> h5
  h5 --> OUT["Output: mat"]

Each hidden state blends the current input with everything that came before. The same weights are reused at every step.

Now let’s formalize this recurrence.


Why sequences need memory

Feedforward networks process each input independently. They have no concept of order or history. But language, music, stock prices, and sensor data are all sequences where context matters. The meaning of a word depends on the words before it. The next note depends on the melody so far.

To handle sequences, we need a network that can remember what it has seen. That is the core idea behind recurrent neural networks.


RNN: the basic recurrence

A recurrent neural network processes one element at a time, maintaining a hidden state ht\mathbf{h}_t that summarizes everything it has seen so far.

At each time step tt:

ht=tanh(Whht1+Wxxt+b)\mathbf{h}_t = \tanh(W_h \mathbf{h}_{t-1} + W_x \mathbf{x}_t + \mathbf{b})

yt=Wyht+by\mathbf{y}_t = W_y \mathbf{h}_t + \mathbf{b}_y

Here xt\mathbf{x}_t is the input at time tt, ht1\mathbf{h}_{t-1} is the previous hidden state, and yt\mathbf{y}_t is the output. The same weights WhW_h, WxW_x, WyW_y are shared across all time steps. This is parameter sharing in the time dimension, analogous to how CNNs share weights across space.

The hidden state is the RNN’s memory. It carries information from past inputs to influence future outputs.

Unrolled view

When we “unroll” an RNN across time, it looks like a very deep feedforward network where each layer corresponds to one time step:

graph LR
  x0["x₀"] --> H0["h₀ = tanh(W_h·h₋₁ + W_x·x₀ + b)"]
  h_init["h₋₁ = 0"] --> H0
  H0 --> y0["y₀"]

  x1["x₁"] --> H1["h₁ = tanh(W_h·h₀ + W_x·x₁ + b)"]
  H0 -->|h₀| H1
  H1 --> y1["y₁"]

  x2["x₂"] --> H2["h₂ = tanh(W_h·h₁ + W_x·x₂ + b)"]
  H1 -->|h₁| H2
  H2 --> y2["y₂"]

  x3["x₃"] --> H3["h₃ = tanh(W_h·h₂ + W_x·x₃ + b)"]
  H2 -->|h₂| H3
  H3 --> y3["y₃"]

Each time step receives the hidden state from the previous step and the current input. The same weights are used at every step.


Backpropagation through time (BPTT)

Training an RNN means computing gradients through this unrolled graph. We apply the chain rule across all time steps. This is called backpropagation through time.

The gradient of the loss at time TT with respect to the hidden state at time tt involves a product of Jacobians:

LTht=LThTk=t+1Thkhk1\frac{\partial L_T}{\partial \mathbf{h}_t} = \frac{\partial L_T}{\partial \mathbf{h}_T} \prod_{k=t+1}^{T} \frac{\partial \mathbf{h}_k}{\partial \mathbf{h}_{k-1}}

Each factor hkhk1\frac{\partial \mathbf{h}_k}{\partial \mathbf{h}_{k-1}} involves Whdiag(tanh(zk))W_h^\top \cdot \text{diag}(\tanh'(\mathbf{z}_k)). This product of matrices is where things go wrong.


Vanishing and exploding gradients in RNNs

Gradient magnitude vs time step: vanilla RNN gradients vanish exponentially, while LSTM maintains roughly constant gradients.

The product k=t+1Thkhk1\prod_{k=t+1}^{T} \frac{\partial \mathbf{h}_k}{\partial \mathbf{h}_{k-1}} either shrinks to zero or blows up, depending on the spectral radius of WhW_h and the activation derivatives.

Vanishing gradient: if Wh\|W_h\| is small and tanh derivatives are less than 1, the product decays exponentially. Gradients from distant time steps vanish, and the RNN cannot learn long-range dependencies. It effectively “forgets” early inputs.

Exploding gradient: if Wh\|W_h\| is large, the product grows exponentially. Gradients become enormous, and weight updates are destructive. Gradient clipping (covered in the training guide) helps with explosions, but vanishing gradients need a fundamentally different approach.


LSTM: Long Short-Term Memory

LSTMs solve the vanishing gradient problem with a clever design: a cell state ct\mathbf{c}_t that acts as a memory highway. Information can flow along the cell state with minimal modification, allowing gradients to travel many time steps without decaying.

The LSTM has four gates that control information flow:

LSTM gates in plain English

graph TD
  INPUT["New input + previous hidden state"] --> FORGET["Forget gate:
What old info should I discard?"]
  INPUT --> REMEMBER["Input gate:
What new info is worth storing?"]
  INPUT --> CANDIDATE["Candidate:
What is the new info?"]
  INPUT --> OUTPUT["Output gate:
What should I reveal?"]
  FORGET --> CELL["Cell state
(long-term memory)"]
  REMEMBER --> CELL
  CANDIDATE --> CELL
  CELL --> HIDDEN["Hidden state
(working memory)"]
  OUTPUT --> HIDDEN

The forget gate erases irrelevant old information. The input gate and candidate together write new information. The output gate controls what the cell exposes to the rest of the network.

The four gates

Forget gate: decides what to remove from the cell state.

ft=σ(Wf[ht1,xt]+bf)\mathbf{f}_t = \sigma(W_f [\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_f)

Input gate: decides what new information to add.

it=σ(Wi[ht1,xt]+bi)\mathbf{i}_t = \sigma(W_i [\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_i)

Candidate cell state: proposes new content.

c~t=tanh(Wc[ht1,xt]+bc)\tilde{\mathbf{c}}_t = \tanh(W_c [\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_c)

Output gate: decides what to expose as the hidden state.

ot=σ(Wo[ht1,xt]+bo)\mathbf{o}_t = \sigma(W_o [\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_o)

Cell state and hidden state updates

ct=ftct1+itc~t\mathbf{c}_t = \mathbf{f}_t \odot \mathbf{c}_{t-1} + \mathbf{i}_t \odot \tilde{\mathbf{c}}_t

ht=ottanh(ct)\mathbf{h}_t = \mathbf{o}_t \odot \tanh(\mathbf{c}_t)

The \odot symbol means element-wise multiplication.

graph TD
  subgraph LSTM Cell
      ht1["h_(t-1)"] --> concat["[h_(t-1), x_t]"]
      xt["x_t"] --> concat

      concat --> fg["Forget gate: σ"]
      concat --> ig["Input gate: σ"]
      concat --> cand["Candidate: tanh"]
      concat --> og["Output gate: σ"]

      ct1["c_(t-1)"] --> mul1["× (forget)"]
      fg --> mul1

      ig --> mul2["× (input)"]
      cand --> mul2

      mul1 --> add["+ (update cell)"]
      mul2 --> add

      add --> ct["c_t"]
      ct --> tanh_c["tanh"]
      tanh_c --> mul3["× (output)"]
      og --> mul3
      mul3 --> ht["h_t"]
  end

The key insight: the cell state ct\mathbf{c}_t is updated through addition (not multiplication by a weight matrix). The gradient through the cell state is:

ctct1=ft\frac{\partial \mathbf{c}_t}{\partial \mathbf{c}_{t-1}} = \mathbf{f}_t

If the forget gate is close to 1, the gradient passes through almost unchanged. This is the constant error carousel: a direct path for gradients to flow across many time steps.

Why LSTM gradients survive long sequences

graph LR
  C0["c0"] -->|"times f1
(near 1.0)"| C1["c1"]
  C1 -->|"times f2
(near 1.0)"| C2["c2"]
  C2 -->|"times f3
(near 1.0)"| C3["c3"]
  C3 -->|"times f4
(near 1.0)"| C4["c4"]

  style C0 fill:#4CAF50,stroke:#333,color:#fff
  style C1 fill:#66BB6A,stroke:#333,color:#fff
  style C2 fill:#81C784,stroke:#333,color:#000
  style C3 fill:#A5D6A7,stroke:#333,color:#000
  style C4 fill:#C8E6C9,stroke:#333,color:#000

The cell state updates through addition, not matrix multiplication. When the forget gate stays close to 1, gradients pass through nearly unchanged. This additive path is the key difference from vanilla RNNs, where gradients must pass through a weight matrix at every step.


GRU: Gated Recurrent Unit

The GRU simplifies the LSTM by merging the forget and input gates into a single update gate, and combining the cell state and hidden state.

Update gate: how much of the old state to keep.

zt=σ(Wz[ht1,xt]+bz)\mathbf{z}_t = \sigma(W_z [\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_z)

Reset gate: how much of the old state to use when computing the candidate.

rt=σ(Wr[ht1,xt]+br)\mathbf{r}_t = \sigma(W_r [\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_r)

Candidate hidden state:

h~t=tanh(Wh[rtht1,xt]+bh)\tilde{\mathbf{h}}_t = \tanh(W_h [\mathbf{r}_t \odot \mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_h)

Final hidden state (interpolation between old and new):

ht=(1zt)ht1+zth~t\mathbf{h}_t = (1 - \mathbf{z}_t) \odot \mathbf{h}_{t-1} + \mathbf{z}_t \odot \tilde{\mathbf{h}}_t

graph TD
  subgraph GRU Cell
      ht1_g["h_(t-1)"] --> concat_g["[h_(t-1), x_t]"]
      xt_g["x_t"] --> concat_g

      concat_g --> zg["Update gate z: σ"]
      concat_g --> rg["Reset gate r: σ"]

      rg --> mul_r["r × h_(t-1)"]
      ht1_g --> mul_r
      mul_r --> cand_g["Candidate: tanh([r⊙h, x])"]
      xt_g --> cand_g

      zg --> interp["h_t = (1-z)⊙h_(t-1) + z⊙h̃_t"]
      ht1_g --> interp
      cand_g --> interp

      interp --> ht_g["h_t"]
  end

GRU has fewer parameters than LSTM (3 weight matrices instead of 4) and often performs comparably. When your dataset is small or you want faster training, GRU is worth trying.

What each architecture adds

graph LR
  RNN["Vanilla RNN
One hidden state
No gates"] -->|"Add cell state
and 4 gates"| LSTM["LSTM
Cell + hidden state
Forget, input,
candidate, output"]
  RNN -->|"Add 2 gates
(simpler)"| GRU["GRU
One hidden state
Update gate,
reset gate"]

When to use what

PropertyVanilla RNNLSTMGRUTransformer
Long-range memoryPoorGoodGoodExcellent
Handles long sequencesNo (gradients vanish)YesYesYes (with positional encoding)
ParallelizableNo (sequential)No (sequential)No (sequential)Yes (fully parallel)
ParametersFewestMostModerateDepends on layers/heads
Best use caseVery short sequencesLong sequences, generationSimilar to LSTM, smaller dataLarge-scale NLP, vision

Vanilla RNNs are rarely used in practice. LSTMs and GRUs dominated sequence modeling from 2014 to 2017. Since then, Transformers with attention mechanisms have taken over for most tasks. But LSTMs remain useful for streaming data, edge devices, and cases where you need to process one token at a time.


Example 1: One RNN step

Compute one time step of an RNN with 2D hidden state.

Given:

Wh=[0.10.20.30.4],Wx=[0.50.30.20.8],b=[0.00.2]W_h = \begin{bmatrix} 0.1 & 0.2 \\ 0.3 & 0.4 \end{bmatrix}, \quad W_x = \begin{bmatrix} 0.5 & -0.3 \\ -0.2 & 0.8 \end{bmatrix}, \quad \mathbf{b} = \begin{bmatrix} 0.0 \\ 0.2 \end{bmatrix}

xt=[10],ht1=[00]\mathbf{x}_t = \begin{bmatrix} 1 \\ 0 \end{bmatrix}, \quad \mathbf{h}_{t-1} = \begin{bmatrix} 0 \\ 0 \end{bmatrix}

Step 1: Compute the pre-activation.

z=Whht1+Wxxt+b\mathbf{z} = W_h \mathbf{h}_{t-1} + W_x \mathbf{x}_t + \mathbf{b}

Since ht1=0\mathbf{h}_{t-1} = \mathbf{0}, the first term vanishes:

Wxxt=[0.5(1)+(0.3)(0)0.2(1)+0.8(0)]=[0.50.2]W_x \mathbf{x}_t = \begin{bmatrix} 0.5(1) + (-0.3)(0) \\ -0.2(1) + 0.8(0) \end{bmatrix} = \begin{bmatrix} 0.5 \\ -0.2 \end{bmatrix}

z=[0.50.2]+[0.00.2]=[0.50.0]\mathbf{z} = \begin{bmatrix} 0.5 \\ -0.2 \end{bmatrix} + \begin{bmatrix} 0.0 \\ 0.2 \end{bmatrix} = \begin{bmatrix} 0.5 \\ 0.0 \end{bmatrix}

Step 2: Apply tanh.

ht=tanh(z)=[tanh(0.5)tanh(0.0)]=[0.46210.0]\mathbf{h}_t = \tanh(\mathbf{z}) = \begin{bmatrix} \tanh(0.5) \\ \tanh(0.0) \end{bmatrix} = \begin{bmatrix} 0.4621 \\ 0.0 \end{bmatrix}

The hidden state now encodes information about the first input. At the next time step, ht\mathbf{h}_t feeds back in through WhW_h, mixing with the new input.

Notice that if we had a nonzero previous hidden state, Whht1W_h \mathbf{h}_{t-1} would blend past information with the current input. That is how the RNN maintains memory.


Example 2: One LSTM step

Work through a single LSTM time step with scalar values for clarity.

Given (all weights are 1x2 vectors operating on [ht1,xt][\mathbf{h}_{t-1}, \mathbf{x}_t]):

Wf=[0.5,0.4],  bf=0.1(forget gate)W_f = [0.5, 0.4], \; b_f = 0.1 \quad \text{(forget gate)}

Wi=[0.3,0.6],  bi=0.1(input gate)W_i = [0.3, 0.6], \; b_i = -0.1 \quad \text{(input gate)}

Wc=[0.7,0.2],  bc=0.0(candidate)W_c = [0.7, -0.2], \; b_c = 0.0 \quad \text{(candidate)}

Wo=[0.4,0.5],  bo=0.2(output gate)W_o = [0.4, 0.5], \; b_o = 0.2 \quad \text{(output gate)}

xt=0.5,ht1=0.3,ct1=0.1x_t = 0.5, \quad h_{t-1} = 0.3, \quad c_{t-1} = 0.1

The concatenated input is [ht1,xt]=[0.3,0.5][h_{t-1}, x_t] = [0.3, 0.5].

Forget gate:

ft=σ(0.50.3+0.40.5+0.1)=σ(0.15+0.20+0.1)=σ(0.45)0.611f_t = \sigma(0.5 \cdot 0.3 + 0.4 \cdot 0.5 + 0.1) = \sigma(0.15 + 0.20 + 0.1) = \sigma(0.45) \approx 0.611

Input gate:

it=σ(0.30.3+0.60.5+(0.1))=σ(0.09+0.300.1)=σ(0.29)0.572i_t = \sigma(0.3 \cdot 0.3 + 0.6 \cdot 0.5 + (-0.1)) = \sigma(0.09 + 0.30 - 0.1) = \sigma(0.29) \approx 0.572

Candidate cell state:

c~t=tanh(0.70.3+(0.2)0.5+0.0)=tanh(0.210.10)=tanh(0.11)0.110\tilde{c}_t = \tanh(0.7 \cdot 0.3 + (-0.2) \cdot 0.5 + 0.0) = \tanh(0.21 - 0.10) = \tanh(0.11) \approx 0.110

Output gate:

ot=σ(0.40.3+0.50.5+0.2)=σ(0.12+0.25+0.2)=σ(0.57)0.639o_t = \sigma(0.4 \cdot 0.3 + 0.5 \cdot 0.5 + 0.2) = \sigma(0.12 + 0.25 + 0.2) = \sigma(0.57) \approx 0.639

Update cell state:

ct=ftct1+itc~t=0.611×0.1+0.572×0.110c_t = f_t \cdot c_{t-1} + i_t \cdot \tilde{c}_t = 0.611 \times 0.1 + 0.572 \times 0.110

ct=0.0611+0.0629=0.124c_t = 0.0611 + 0.0629 = 0.124

Compute hidden state:

ht=ottanh(ct)=0.639×tanh(0.124)0.639×0.1234=0.0789h_t = o_t \cdot \tanh(c_t) = 0.639 \times \tanh(0.124) \approx 0.639 \times 0.1234 = 0.0789

Summary of this step:

Gate/StatePre-activationValue
Forget gate ftf_t0.450.611
Input gate iti_t0.290.572
Candidate c~t\tilde{c}_t0.110.110
Output gate oto_t0.570.639
Cell state ctc_t-0.124
Hidden state hth_t-0.079

The forget gate at 0.611 means we keep about 61% of the old cell state. The input gate at 0.572 means we incorporate about 57% of the candidate. These gates let the LSTM selectively remember and forget, which is what makes it effective for long sequences.


Example 3: Vanishing gradient comparison

Compare gradient flow in a vanilla RNN vs an LSTM over 10 time steps.

Vanilla RNN

At each time step, the gradient is multiplied by tanh(zk)\tanh'(z_k). For a typical hidden activation of tanh(z)0.7\tanh(z) \approx 0.7:

tanh(z)=1tanh2(z)=10.49=0.51\tanh'(z) = 1 - \tanh^2(z) = 1 - 0.49 = 0.51

After 10 time steps, the gradient contribution from the tanh derivatives alone:

k=1100.51=0.51100.00117\prod_{k=1}^{10} 0.51 = 0.51^{10} \approx 0.00117

The gradient reaching the first time step is roughly 0.1% of the gradient at the output. And this does not even account for the WhW_h factors, which can shrink it further. Early time steps receive almost no learning signal.

LSTM cell state

The gradient through the LSTM cell state is k=1Tfk\prod_{k=1}^{T} f_k, where fkf_k is the forget gate value. If the forget gate learns to stay close to 1 (say fk0.95f_k \approx 0.95):

k=1100.95=0.95100.599\prod_{k=1}^{10} 0.95 = 0.95^{10} \approx 0.599

After 10 time steps, 60% of the gradient survives. That is a massive improvement over the RNN’s 0.1%.

Time steps backRNN gradient (0.51t0.51^t)LSTM gradient (0.95t0.95^t)
10.5100.950
30.1330.857
50.0350.774
100.001170.599
200.00000140.358

At 20 time steps, the RNN gradient is essentially zero (1.4×1061.4 \times 10^{-6}), while the LSTM still retains 36% of the signal. This is why LSTMs can learn dependencies spanning hundreds of time steps, while vanilla RNNs struggle with sequences longer than 10-20 steps.

The cell state acts as a gradient highway. When the forget gate is close to 1, the cell state passes information (and gradients) forward with minimal loss. This is the fundamental insight that makes LSTMs work.


What comes next

RNNs and LSTMs process sequences one step at a time. This sequential nature limits parallelism and makes training slow for very long sequences. The attention mechanism and Transformers solve this by processing all positions in parallel and using attention to directly connect any two positions in the sequence, regardless of distance. Transformers have largely replaced RNNs for NLP and are now expanding into vision, audio, and more.

Start typing to search across all content
navigate Enter open Esc close