Search…

Encoder-decoder architectures

In this series (25 parts)
  1. Neural networks: the basic building block
  2. Forward pass and backpropagation
  3. Training neural networks: a practical guide
  4. Convolutional neural networks
  5. Recurrent neural networks and LSTMs
  6. Attention mechanism and transformers
  7. Word embeddings: from one-hot to dense representations
  8. Transfer learning and fine-tuning
  9. Optimization techniques for deep networks
  10. Regularization for deep networks
  11. Encoder-decoder architectures
  12. Generative models: an overview
  13. Restricted Boltzmann Machines
  14. Deep Belief Networks
  15. Variational Autoencoders
  16. Generative Adversarial Networks: training and theory
  17. DCGAN, conditional GANs, and GAN variants
  18. Representation learning and self-supervised learning
  19. Domain adaptation and fine-tuning strategies
  20. Distributed representations and latent spaces
  21. AutoML and hyperparameter optimization
  22. Neural architecture search
  23. Network compression and efficient inference
  24. Graph neural networks
  25. Practical deep learning: debugging and tuning

Prerequisites: RNNs and LSTMs and Attention and Transformers.

Many real problems have inputs and outputs of different lengths. A French sentence has 7 words; its English translation might have 9. An image has millions of pixels; its segmentation mask has the same spatial size but a completely different meaning. Encoder-decoder architectures handle this by splitting the model into two halves: one reads the input, the other produces the output.

The core idea

Translation is a two-step process: understand the input sentence (encode), then generate the output sentence (decode). A human translator reads the French sentence, builds a mental picture of its meaning, and then expresses that meaning in English. An encoder-decoder model does the same thing, replacing “mental picture” with a numeric vector.

French to English, word by word

PositionFrench (source)English (target)
1LeThe
2chatcat
3estis
4assissitting
5suron
6lethe
7tapismat

The source has 7 words. The target also has 7 here, but that is a coincidence. Sentences in different languages rarely align one to one. Some words expand (“est assis” to “is sitting”), others compress. The model must handle variable lengths on both sides.

Encoder-decoder overview

graph LR
  A["Source sentence
(French)"] --> B["Encoder
Reads all input words"]
  B --> C["Context vector
Compressed meaning"]
  C --> D["Decoder
Generates one word at a time"]
  D --> E["Output sentence
(English)"]

The encoder reads the full input and compresses it into a fixed-length context vector. The decoder takes that vector and produces the output sequence, one token at a time. This design handles any input length and any output length.

Now let’s formalize each component.

The seq2seq setup

A sequence-to-sequence (seq2seq) model maps a variable-length input sequence to a variable-length output sequence. You cannot do this with a single feedforward network because the dimensions are not fixed. The encoder-decoder design solves this with three components:

  1. Encoder: processes the entire input and compresses it into a representation.
  2. Context vector: the compressed representation that bridges encoder and decoder. Also called the bottleneck.
  3. Decoder: takes the context vector and generates output tokens one at a time.

The encoder and decoder can be RNNs, CNNs, or Transformers. The architecture does not prescribe which. What matters is the flow: read everything, compress, then generate.

graph LR
  subgraph Encoder
      X1["x₁"] --> H1["h₁"]
      X2["x₂"] --> H2["h₂"]
      X3["x₃"] --> H3["h₃"]
      H1 --> H2
      H2 --> H3
  end
  H3 --> C["Context c"]
  subgraph Decoder
      C --> S1["s₁ → y₁"]
      S1 --> S2["s₂ → y₂"]
      S2 --> S3["s₃ → y₃"]
  end
  style C fill:#ff9,stroke:#333,color:#000

Figure 1: Basic encoder-decoder. The encoder reads x₁, x₂, x₃ and produces hidden states. The final hidden state h₃ becomes the context vector c. The decoder generates y₁, y₂, y₃ conditioned on c.

The encoder

The encoder reads the input sequence x1,x2,,xTx_1, x_2, \ldots, x_T and produces a sequence of hidden states:

ht=fenc(xt,ht1)h_t = f_{\text{enc}}(x_t, h_{t-1})

For an LSTM encoder, fencf_{\text{enc}} is the LSTM cell update. For a Transformer encoder, it is a stack of self-attention layers. The key point: after processing all TT input tokens, the encoder has built a representation of the entire input.

In the simplest design, we take the final hidden state hTh_T as the context vector c=hTc = h_T. This single vector must carry everything the decoder needs to know about the input.

The decoder

The decoder generates one output token at a time. At each step tt, it takes:

  • The previous hidden state st1s_{t-1}
  • The previous output token yt1y_{t-1} (or a start-of-sequence token for t=1t=1)
  • The context vector cc

And computes:

st=fdec(yt1,st1,c)s_t = f_{\text{dec}}(y_{t-1}, s_{t-1}, c) P(yty<t,x)=softmax(Wost)P(y_t \mid y_{<t}, x) = \text{softmax}(W_o \cdot s_t)

The decoder keeps generating until it produces an end-of-sequence token or hits a maximum length. During training, we use teacher forcing: feed the ground-truth yt1y_{t-1} instead of the model’s own prediction.

The bottleneck problem

BLEU score vs sequence length. Translation quality degrades significantly as input sequences get longer, motivating attention mechanisms.

Here is the fundamental tension. The context vector cc is a fixed-size vector, often 256 or 512 dimensions. But the input might be 5 tokens or 500. You are asking a single vector to memorize an entire paragraph.

Short sequences work fine. Long sequences lose information, especially details from early tokens. The encoder’s final hidden state tends to be dominated by recent inputs, because that is how RNNs work: earlier information fades.

This is the bottleneck problem. It limits seq2seq performance on long sequences, and it motivated the invention of attention.

Example 3: Quantifying bottleneck compression

Consider encoding a 10-step sequence where each encoder hidden state is 64-dimensional:

Total information in encoder hidden states:

10×64=640 values10 \times 64 = 640 \text{ values}

Context vector dimension: 32

Compression ratio:

64032=20×\frac{640}{32} = 20\times

You are compressing 640 numbers into 32. For a short sequence of 3 steps, the ratio would be 19232=6×\frac{192}{32} = 6\times, which is much more manageable. But as sequences get longer, the compression gets more extreme, and performance degrades.

With attention, the decoder accesses all 10 hidden states directly. That is 10×64=64010 \times 64 = 640 values, zero compression. Each decoder step picks which encoder states matter most.

Attention in encoder-decoder models

Attention solves the bottleneck by letting the decoder look at all encoder hidden states, not just the last one. At each decoder step tt, the decoder computes a weighted sum of encoder states:

eti=score(st1,hi)e_{ti} = \text{score}(s_{t-1}, h_i) αti=exp(eti)j=1Texp(etj)\alpha_{ti} = \frac{\exp(e_{ti})}{\sum_{j=1}^{T} \exp(e_{tj})} ct=i=1Tαtihic_t = \sum_{i=1}^{T} \alpha_{ti} \, h_i

The score function can be a dot product, additive (Bahdanau), or scaled dot product. The attention weights αti\alpha_{ti} tell the decoder how much to focus on each encoder position.

graph LR
  subgraph Encoder
      H1["h₁"]
      H2["h₂"]
      H3["h₃"]
  end
  subgraph Attention
      S["s₀ (decoder)"] -->|score| A1["α₁"]
      S -->|score| A2["α₂"]
      S -->|score| A3["α₃"]
      A1 --> CT["c₁ = Σ αᵢhᵢ"]
      A2 --> CT
      A3 --> CT
  end
  H1 -->|weighted| CT
  H2 -->|weighted| CT
  H3 -->|weighted| CT
  CT --> D["Decoder step 1"]
  style CT fill:#9f9,stroke:#333,color:#000

Figure 2: Encoder-decoder with attention. The decoder state s₀ computes attention scores against all encoder hidden states. The weighted sum c₁ replaces the fixed context vector.

Now the context is different at every decoder step. When translating “the cat sat on the mat,” the decoder focuses on “cat” when generating the subject and on “mat” when generating the object. This is far more expressive than a single fixed vector.

Seq2seq without attention vs with attention

graph TD
  subgraph Without Attention
      E1["Encoder"] --> C1["Single context
vector c"]
      C1 --> D1["Decoder uses
same c at
every step"]
  end
  subgraph With Attention
      E2["Encoder"] --> H1["h1"]
      E2 --> H2["h2"]
      E2 --> H3["h3"]
      H1 --> W1["Weighted sum
different at
each step"]
      H2 --> W1
      H3 --> W1
      W1 --> D2["Decoder gets
custom context
per step"]
  end

Without attention, long inputs get crushed into a single vector, losing early details. With attention, the decoder reaches back into the encoder at every step, picking out exactly the information it needs. This is why attention transformed machine translation quality.

Example 1: Attention forward pass for translation

Three encoder hidden states and a decoder initial state:

h1=[0.5,  0.3],h2=[0.1,  0.8],h3=[0.7,  0.2]h_1 = [0.5,\; 0.3], \quad h_2 = [-0.1,\; 0.8], \quad h_3 = [0.7,\; -0.2] s0=[0,  0]s_0 = [0,\; 0]

Step 1: Compute dot-product attention scores.

e1=s0Th1=00.5+00.3=0e_1 = s_0^T h_1 = 0 \cdot 0.5 + 0 \cdot 0.3 = 0 e2=s0Th2=0(0.1)+00.8=0e_2 = s_0^T h_2 = 0 \cdot (-0.1) + 0 \cdot 0.8 = 0 e3=s0Th3=00.7+0(0.2)=0e_3 = s_0^T h_3 = 0 \cdot 0.7 + 0 \cdot (-0.2) = 0

Step 2: Softmax to get attention weights.

α=softmax([0,0,0])=[13,  13,  13]\alpha = \text{softmax}([0, 0, 0]) = \left[\frac{1}{3},\; \frac{1}{3},\; \frac{1}{3}\right]

When the decoder has no information yet (zero state), it pays equal attention to every encoder position. This makes sense: with no context about what to generate, no position is more relevant than another.

Step 3: Compute context vector.

c1=13(h1+h2+h3)=13[0.5+(0.1)+0.7,  0.3+0.8+(0.2)]c_1 = \frac{1}{3}(h_1 + h_2 + h_3) = \frac{1}{3}[0.5 + (-0.1) + 0.7,\; 0.3 + 0.8 + (-0.2)] =13[1.1,  0.9]=[0.367,  0.300]= \frac{1}{3}[1.1,\; 0.9] = [0.367,\; 0.300]

Step 4: Decoder update. Using Wc=[0.50.30.10.8]W_c = \begin{bmatrix} 0.5 & 0.3 \\ -0.1 & 0.8 \end{bmatrix}:

s1=tanh(Wcc1)=tanh ⁣([0.5(0.367)+0.3(0.300)0.1(0.367)+0.8(0.300)])s_1 = \tanh(W_c \cdot c_1) = \tanh\!\left(\begin{bmatrix} 0.5(0.367) + 0.3(0.300) \\ -0.1(0.367) + 0.8(0.300) \end{bmatrix}\right) =tanh ⁣([0.2740.203])=[0.2680.201]= \tanh\!\left(\begin{bmatrix} 0.274 \\ 0.203 \end{bmatrix}\right) = \begin{bmatrix} 0.268 \\ 0.201 \end{bmatrix}

Step 5: Output probabilities. With WoW_o projecting to a 3-word vocabulary:

Wo=[0.50.30.10.70.40.2]W_o = \begin{bmatrix} 0.5 & -0.3 \\ 0.1 & 0.7 \\ -0.4 & 0.2 \end{bmatrix}

logits=Wos1=[0.074,  0.168,  0.067]\text{logits} = W_o \cdot s_1 = [0.074,\; 0.168,\; -0.067] P(y1)=softmax([0.074,  0.168,  0.067])=[0.337,  0.370,  0.293]P(y_1) = \text{softmax}([0.074,\; 0.168,\; -0.067]) = [0.337,\; 0.370,\; 0.293]

Word 2 gets the highest probability. At the next step, s1s_1 is no longer zero, so the attention weights will shift to focus on the most relevant encoder states.

Skip connections: the U-Net approach

Not all encoder-decoder models process sequences. In image segmentation, the encoder downsamples an image to extract features, and the decoder upsamples back to the original resolution. The bottleneck loses spatial detail, which is a problem when you need pixel-precise output.

U-Net solves this with skip connections: direct links from encoder layers to decoder layers at the same resolution. The decoder receives both the upsampled features (coarse, semantic) and the encoder features (fine, spatial). It concatenates them along the channel dimension.

graph TD
  E1["Encoder 64×64×32"] --> E2["Encoder 32×32×64"]
  E2 --> E3["Encoder 16×16×128"]
  E3 --> B["Bottleneck 8×8×256"]
  B --> D3["Decoder 16×16×128"]
  D3 --> D2["Decoder 32×32×64"]
  D2 --> D1["Decoder 64×64×32"]
  E3 -.->|"skip (concat)"| D3
  E2 -.->|"skip (concat)"| D2
  E1 -.->|"skip (concat)"| D1
  style B fill:#f99,stroke:#333,color:#000

Figure 3: U-Net with skip connections. Dashed arrows show feature maps passed directly from encoder to decoder at matching resolutions. This preserves fine spatial detail.

Example 2: U-Net skip connection in practice

Suppose at one level of the U-Net:

Encoder feature map (1 channel, 2x2):

Fenc=[1234]F_{\text{enc}} = \begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix}

Decoder upsampled feature map (1 channel, 2x2):

Fdec=[0.50.30.10.7]F_{\text{dec}} = \begin{bmatrix} 0.5 & 0.3 \\ 0.1 & 0.7 \end{bmatrix}

Skip connection: concatenate along the channel dimension.

The result has shape 2×2×22 \times 2 \times 2 (height x width x channels):

Fcombined[:,:,0]=[0.50.30.10.7],Fcombined[:,:,1]=[1234]F_{\text{combined}}[:,:,0] = \begin{bmatrix} 0.5 & 0.3 \\ 0.1 & 0.7 \end{bmatrix}, \quad F_{\text{combined}}[:,:,1] = \begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix}

Channel 0 carries the decoder’s coarse semantic features. Channel 1 carries the encoder’s fine spatial features. A subsequent 3×33 \times 3 convolution fuses both into a single representation that has both precise boundaries and correct semantics.

Without the skip connection, the decoder only sees the upsampled FdecF_{\text{dec}}, which lost the sharp edges during downsampling. With it, the network can recover those edges from FencF_{\text{enc}}.

Encoder-decoder variants

ArchitectureSkip ConnectionsAttention TypeMain ApplicationKey Paper
Seq2seq (Sutskever 2014)NoneNoneTranslationSutskever et al.
Bahdanau seq2seqNoneAdditiveTranslationBahdanau et al. 2015
TransformerNoneMulti-head self + crossTranslation, text generationVaswani et al. 2017
U-NetYes (concat)NoneImage segmentationRonneberger et al. 2015
Attention U-NetYes (gated)Additive gates on skipsMedical image segmentationOktay et al. 2018
Show-Attend-TellNoneSoft/Hard attentionImage captioningXu et al. 2015

The table shows how the same encoder-decoder idea adapts across domains. Translation models use attention over temporal positions. Segmentation models use skip connections over spatial resolutions. Some architectures combine both.

Applications

Machine translation. The original and most famous application. The encoder reads a sentence in one language, and the decoder generates it in another. Modern Transformer-based systems (like those behind translation services) are still encoder-decoder models at their core.

Image segmentation. U-Net and its variants dominate medical image segmentation. The encoder (often a pretrained CNN like ResNet) extracts features at multiple scales. The decoder upsamples and produces per-pixel class labels.

Image captioning. The encoder is a CNN that produces a grid of feature vectors from an image. The decoder is an RNN or Transformer that generates a natural language caption, attending to different image regions at each word.

Text summarization. The encoder reads a long document. The decoder generates a shorter summary. Attention is critical here because the decoder needs to identify and focus on the most important parts of the input.

Speech recognition. Audio features go into the encoder, and text comes out of the decoder. The input and output lengths differ significantly, making the encoder-decoder framework a natural fit.

Training encoder-decoder models

Training uses standard backpropagation through the entire network. The loss is typically cross-entropy between the predicted and true output tokens, summed over all decoder steps:

L=t=1TlogP(yty<t,x)\mathcal{L} = -\sum_{t=1}^{T'} \log P(y_t^* \mid y_{<t}^*, x)

where yty_t^* is the ground truth token at step tt.

Teacher forcing feeds the true yt1y_{t-1}^* to the decoder during training instead of the model’s own prediction. This stabilizes training but creates a mismatch with inference, where the model must use its own predictions. Scheduled sampling gradually shifts from teacher forcing to model predictions during training.

Teacher forcing vs free running

graph TD
  subgraph Teacher Forcing
      A1["Input: ground truth y1"] --> B1["Decoder step 2"]
      B1 --> C1["Input: ground truth y2"]
      C1 --> D1["Decoder step 3"]
  end
  subgraph Free Running
      A2["Input: model prediction y1"] --> B2["Decoder step 2"]
      B2 --> C2["Input: model prediction y2"]
      C2 --> D2["Decoder step 3"]
  end

With teacher forcing, each step gets the correct previous token, so errors never compound. At inference time, the model must use its own predictions, and a single mistake can cascade. Scheduled sampling mixes both strategies during training to bridge the gap.

Beam search at inference keeps the top-kk most probable partial sequences at each step instead of greedily picking the best token. This explores more of the output space and usually produces better results than greedy decoding.

Common pitfalls

Exposure bias. Teacher forcing means the decoder never sees its own mistakes during training. At inference, one wrong token can cascade into a completely wrong sequence.

Length mismatch. The model may generate outputs that are too short or too long. Length penalties in beam search help, but this remains an active area of research.

Attention collapse. In some cases, the attention weights become too uniform or too peaked, ignoring relevant encoder states. Regularizing the attention distribution can help.

What comes next

Encoder-decoder architectures are the backbone of many modern systems. Now that you understand how they work, the natural question is: can we use neural networks not just to map inputs to outputs, but to generate entirely new data? That is the topic of generative models, where encoder-decoder ideas (especially the decoder half) play a central role.

Start typing to search across all content
navigate Enter open Esc close