Encoder-decoder architectures
In this series (25 parts)
- Neural networks: the basic building block
- Forward pass and backpropagation
- Training neural networks: a practical guide
- Convolutional neural networks
- Recurrent neural networks and LSTMs
- Attention mechanism and transformers
- Word embeddings: from one-hot to dense representations
- Transfer learning and fine-tuning
- Optimization techniques for deep networks
- Regularization for deep networks
- Encoder-decoder architectures
- Generative models: an overview
- Restricted Boltzmann Machines
- Deep Belief Networks
- Variational Autoencoders
- Generative Adversarial Networks: training and theory
- DCGAN, conditional GANs, and GAN variants
- Representation learning and self-supervised learning
- Domain adaptation and fine-tuning strategies
- Distributed representations and latent spaces
- AutoML and hyperparameter optimization
- Neural architecture search
- Network compression and efficient inference
- Graph neural networks
- Practical deep learning: debugging and tuning
Prerequisites: RNNs and LSTMs and Attention and Transformers.
Many real problems have inputs and outputs of different lengths. A French sentence has 7 words; its English translation might have 9. An image has millions of pixels; its segmentation mask has the same spatial size but a completely different meaning. Encoder-decoder architectures handle this by splitting the model into two halves: one reads the input, the other produces the output.
The core idea
Translation is a two-step process: understand the input sentence (encode), then generate the output sentence (decode). A human translator reads the French sentence, builds a mental picture of its meaning, and then expresses that meaning in English. An encoder-decoder model does the same thing, replacing “mental picture” with a numeric vector.
French to English, word by word
| Position | French (source) | English (target) |
|---|---|---|
| 1 | Le | The |
| 2 | chat | cat |
| 3 | est | is |
| 4 | assis | sitting |
| 5 | sur | on |
| 6 | le | the |
| 7 | tapis | mat |
The source has 7 words. The target also has 7 here, but that is a coincidence. Sentences in different languages rarely align one to one. Some words expand (“est assis” to “is sitting”), others compress. The model must handle variable lengths on both sides.
Encoder-decoder overview
graph LR A["Source sentence (French)"] --> B["Encoder Reads all input words"] B --> C["Context vector Compressed meaning"] C --> D["Decoder Generates one word at a time"] D --> E["Output sentence (English)"]
The encoder reads the full input and compresses it into a fixed-length context vector. The decoder takes that vector and produces the output sequence, one token at a time. This design handles any input length and any output length.
Now let’s formalize each component.
The seq2seq setup
A sequence-to-sequence (seq2seq) model maps a variable-length input sequence to a variable-length output sequence. You cannot do this with a single feedforward network because the dimensions are not fixed. The encoder-decoder design solves this with three components:
- Encoder: processes the entire input and compresses it into a representation.
- Context vector: the compressed representation that bridges encoder and decoder. Also called the bottleneck.
- Decoder: takes the context vector and generates output tokens one at a time.
The encoder and decoder can be RNNs, CNNs, or Transformers. The architecture does not prescribe which. What matters is the flow: read everything, compress, then generate.
graph LR
subgraph Encoder
X1["x₁"] --> H1["h₁"]
X2["x₂"] --> H2["h₂"]
X3["x₃"] --> H3["h₃"]
H1 --> H2
H2 --> H3
end
H3 --> C["Context c"]
subgraph Decoder
C --> S1["s₁ → y₁"]
S1 --> S2["s₂ → y₂"]
S2 --> S3["s₃ → y₃"]
end
style C fill:#ff9,stroke:#333,color:#000
Figure 1: Basic encoder-decoder. The encoder reads x₁, x₂, x₃ and produces hidden states. The final hidden state h₃ becomes the context vector c. The decoder generates y₁, y₂, y₃ conditioned on c.
The encoder
The encoder reads the input sequence and produces a sequence of hidden states:
For an LSTM encoder, is the LSTM cell update. For a Transformer encoder, it is a stack of self-attention layers. The key point: after processing all input tokens, the encoder has built a representation of the entire input.
In the simplest design, we take the final hidden state as the context vector . This single vector must carry everything the decoder needs to know about the input.
The decoder
The decoder generates one output token at a time. At each step , it takes:
- The previous hidden state
- The previous output token (or a start-of-sequence token for )
- The context vector
And computes:
The decoder keeps generating until it produces an end-of-sequence token or hits a maximum length. During training, we use teacher forcing: feed the ground-truth instead of the model’s own prediction.
The bottleneck problem
BLEU score vs sequence length. Translation quality degrades significantly as input sequences get longer, motivating attention mechanisms.
Here is the fundamental tension. The context vector is a fixed-size vector, often 256 or 512 dimensions. But the input might be 5 tokens or 500. You are asking a single vector to memorize an entire paragraph.
Short sequences work fine. Long sequences lose information, especially details from early tokens. The encoder’s final hidden state tends to be dominated by recent inputs, because that is how RNNs work: earlier information fades.
This is the bottleneck problem. It limits seq2seq performance on long sequences, and it motivated the invention of attention.
Example 3: Quantifying bottleneck compression
Consider encoding a 10-step sequence where each encoder hidden state is 64-dimensional:
Total information in encoder hidden states:
Context vector dimension: 32
Compression ratio:
You are compressing 640 numbers into 32. For a short sequence of 3 steps, the ratio would be , which is much more manageable. But as sequences get longer, the compression gets more extreme, and performance degrades.
With attention, the decoder accesses all 10 hidden states directly. That is values, zero compression. Each decoder step picks which encoder states matter most.
Attention in encoder-decoder models
Attention solves the bottleneck by letting the decoder look at all encoder hidden states, not just the last one. At each decoder step , the decoder computes a weighted sum of encoder states:
The score function can be a dot product, additive (Bahdanau), or scaled dot product. The attention weights tell the decoder how much to focus on each encoder position.
graph LR
subgraph Encoder
H1["h₁"]
H2["h₂"]
H3["h₃"]
end
subgraph Attention
S["s₀ (decoder)"] -->|score| A1["α₁"]
S -->|score| A2["α₂"]
S -->|score| A3["α₃"]
A1 --> CT["c₁ = Σ αᵢhᵢ"]
A2 --> CT
A3 --> CT
end
H1 -->|weighted| CT
H2 -->|weighted| CT
H3 -->|weighted| CT
CT --> D["Decoder step 1"]
style CT fill:#9f9,stroke:#333,color:#000
Figure 2: Encoder-decoder with attention. The decoder state s₀ computes attention scores against all encoder hidden states. The weighted sum c₁ replaces the fixed context vector.
Now the context is different at every decoder step. When translating “the cat sat on the mat,” the decoder focuses on “cat” when generating the subject and on “mat” when generating the object. This is far more expressive than a single fixed vector.
Seq2seq without attention vs with attention
graph TD
subgraph Without Attention
E1["Encoder"] --> C1["Single context
vector c"]
C1 --> D1["Decoder uses
same c at
every step"]
end
subgraph With Attention
E2["Encoder"] --> H1["h1"]
E2 --> H2["h2"]
E2 --> H3["h3"]
H1 --> W1["Weighted sum
different at
each step"]
H2 --> W1
H3 --> W1
W1 --> D2["Decoder gets
custom context
per step"]
end
Without attention, long inputs get crushed into a single vector, losing early details. With attention, the decoder reaches back into the encoder at every step, picking out exactly the information it needs. This is why attention transformed machine translation quality.
Example 1: Attention forward pass for translation
Three encoder hidden states and a decoder initial state:
Step 1: Compute dot-product attention scores.
Step 2: Softmax to get attention weights.
When the decoder has no information yet (zero state), it pays equal attention to every encoder position. This makes sense: with no context about what to generate, no position is more relevant than another.
Step 3: Compute context vector.
Step 4: Decoder update. Using :
Step 5: Output probabilities. With projecting to a 3-word vocabulary:
Word 2 gets the highest probability. At the next step, is no longer zero, so the attention weights will shift to focus on the most relevant encoder states.
Skip connections: the U-Net approach
Not all encoder-decoder models process sequences. In image segmentation, the encoder downsamples an image to extract features, and the decoder upsamples back to the original resolution. The bottleneck loses spatial detail, which is a problem when you need pixel-precise output.
U-Net solves this with skip connections: direct links from encoder layers to decoder layers at the same resolution. The decoder receives both the upsampled features (coarse, semantic) and the encoder features (fine, spatial). It concatenates them along the channel dimension.
graph TD E1["Encoder 64×64×32"] --> E2["Encoder 32×32×64"] E2 --> E3["Encoder 16×16×128"] E3 --> B["Bottleneck 8×8×256"] B --> D3["Decoder 16×16×128"] D3 --> D2["Decoder 32×32×64"] D2 --> D1["Decoder 64×64×32"] E3 -.->|"skip (concat)"| D3 E2 -.->|"skip (concat)"| D2 E1 -.->|"skip (concat)"| D1 style B fill:#f99,stroke:#333,color:#000
Figure 3: U-Net with skip connections. Dashed arrows show feature maps passed directly from encoder to decoder at matching resolutions. This preserves fine spatial detail.
Example 2: U-Net skip connection in practice
Suppose at one level of the U-Net:
Encoder feature map (1 channel, 2x2):
Decoder upsampled feature map (1 channel, 2x2):
Skip connection: concatenate along the channel dimension.
The result has shape (height x width x channels):
Channel 0 carries the decoder’s coarse semantic features. Channel 1 carries the encoder’s fine spatial features. A subsequent convolution fuses both into a single representation that has both precise boundaries and correct semantics.
Without the skip connection, the decoder only sees the upsampled , which lost the sharp edges during downsampling. With it, the network can recover those edges from .
Encoder-decoder variants
| Architecture | Skip Connections | Attention Type | Main Application | Key Paper |
|---|---|---|---|---|
| Seq2seq (Sutskever 2014) | None | None | Translation | Sutskever et al. |
| Bahdanau seq2seq | None | Additive | Translation | Bahdanau et al. 2015 |
| Transformer | None | Multi-head self + cross | Translation, text generation | Vaswani et al. 2017 |
| U-Net | Yes (concat) | None | Image segmentation | Ronneberger et al. 2015 |
| Attention U-Net | Yes (gated) | Additive gates on skips | Medical image segmentation | Oktay et al. 2018 |
| Show-Attend-Tell | None | Soft/Hard attention | Image captioning | Xu et al. 2015 |
The table shows how the same encoder-decoder idea adapts across domains. Translation models use attention over temporal positions. Segmentation models use skip connections over spatial resolutions. Some architectures combine both.
Applications
Machine translation. The original and most famous application. The encoder reads a sentence in one language, and the decoder generates it in another. Modern Transformer-based systems (like those behind translation services) are still encoder-decoder models at their core.
Image segmentation. U-Net and its variants dominate medical image segmentation. The encoder (often a pretrained CNN like ResNet) extracts features at multiple scales. The decoder upsamples and produces per-pixel class labels.
Image captioning. The encoder is a CNN that produces a grid of feature vectors from an image. The decoder is an RNN or Transformer that generates a natural language caption, attending to different image regions at each word.
Text summarization. The encoder reads a long document. The decoder generates a shorter summary. Attention is critical here because the decoder needs to identify and focus on the most important parts of the input.
Speech recognition. Audio features go into the encoder, and text comes out of the decoder. The input and output lengths differ significantly, making the encoder-decoder framework a natural fit.
Training encoder-decoder models
Training uses standard backpropagation through the entire network. The loss is typically cross-entropy between the predicted and true output tokens, summed over all decoder steps:
where is the ground truth token at step .
Teacher forcing feeds the true to the decoder during training instead of the model’s own prediction. This stabilizes training but creates a mismatch with inference, where the model must use its own predictions. Scheduled sampling gradually shifts from teacher forcing to model predictions during training.
Teacher forcing vs free running
graph TD
subgraph Teacher Forcing
A1["Input: ground truth y1"] --> B1["Decoder step 2"]
B1 --> C1["Input: ground truth y2"]
C1 --> D1["Decoder step 3"]
end
subgraph Free Running
A2["Input: model prediction y1"] --> B2["Decoder step 2"]
B2 --> C2["Input: model prediction y2"]
C2 --> D2["Decoder step 3"]
end
With teacher forcing, each step gets the correct previous token, so errors never compound. At inference time, the model must use its own predictions, and a single mistake can cascade. Scheduled sampling mixes both strategies during training to bridge the gap.
Beam search at inference keeps the top- most probable partial sequences at each step instead of greedily picking the best token. This explores more of the output space and usually produces better results than greedy decoding.
Common pitfalls
Exposure bias. Teacher forcing means the decoder never sees its own mistakes during training. At inference, one wrong token can cascade into a completely wrong sequence.
Length mismatch. The model may generate outputs that are too short or too long. Length penalties in beam search help, but this remains an active area of research.
Attention collapse. In some cases, the attention weights become too uniform or too peaked, ignoring relevant encoder states. Regularizing the attention distribution can help.
What comes next
Encoder-decoder architectures are the backbone of many modern systems. Now that you understand how they work, the natural question is: can we use neural networks not just to map inputs to outputs, but to generate entirely new data? That is the topic of generative models, where encoder-decoder ideas (especially the decoder half) play a central role.