Nov 20, 2025 · 20 min read · Deep Learning

Encoder-decoder architectures

In this series (25 parts)

Prerequisites: RNNs and LSTMs and Attention and Transformers.

Many real problems have inputs and outputs of different lengths. A French sentence has 7 words; its English translation might have 9. An image has millions of pixels; its segmentation mask has the same spatial size but a completely different meaning. Encoder-decoder architectures handle this by splitting the model into two halves: one reads the input, the other produces the output.

The core idea

Translation is a two-step process: understand the input sentence (encode), then generate the output sentence (decode). A human translator reads the French sentence, builds a mental picture of its meaning, and then expresses that meaning in English. An encoder-decoder model does the same thing, replacing “mental picture” with a numeric vector.

French to English, word by word

Position	French (source)	English (target)
1	Le	The
2	chat	cat
3	est	is
4	assis	sitting
5	sur	on
6	le	the
7	tapis	mat

The source has 7 words. The target also has 7 here, but that is a coincidence. Sentences in different languages rarely align one to one. Some words expand (“est assis” to “is sitting”), others compress. The model must handle variable lengths on both sides.

Encoder-decoder overview

graph LR
  A["Source sentence
(French)"] --> B["Encoder
Reads all input words"]
  B --> C["Context vector
Compressed meaning"]
  C --> D["Decoder
Generates one word at a time"]
  D --> E["Output sentence
(English)"]

The encoder reads the full input and compresses it into a fixed-length context vector. The decoder takes that vector and produces the output sequence, one token at a time. This design handles any input length and any output length.

Now let’s formalize each component.

The seq2seq setup

A sequence-to-sequence (seq2seq) model maps a variable-length input sequence to a variable-length output sequence. You cannot do this with a single feedforward network because the dimensions are not fixed. The encoder-decoder design solves this with three components:

Encoder: processes the entire input and compresses it into a representation.
Context vector: the compressed representation that bridges encoder and decoder. Also called the bottleneck.
Decoder: takes the context vector and generates output tokens one at a time.

The encoder and decoder can be RNNs, CNNs, or Transformers. The architecture does not prescribe which. What matters is the flow: read everything, compress, then generate.

graph LR
  subgraph Encoder
      X1["x₁"] --> H1["h₁"]
      X2["x₂"] --> H2["h₂"]
      X3["x₃"] --> H3["h₃"]
      H1 --> H2
      H2 --> H3
  end
  H3 --> C["Context c"]
  subgraph Decoder
      C --> S1["s₁ → y₁"]
      S1 --> S2["s₂ → y₂"]
      S2 --> S3["s₃ → y₃"]
  end
  style C fill:#ff9,stroke:#333,color:#000

Figure 1: Basic encoder-decoder. The encoder reads x₁, x₂, x₃ and produces hidden states. The final hidden state h₃ becomes the context vector c. The decoder generates y₁, y₂, y₃ conditioned on c.

The encoder

The encoder reads the input sequence $x_1, x_2, \ldots, x_T$ and produces a sequence of hidden states:

$h_t = f_{\text{enc}}(x_t, h_{t-1})$

For an LSTM encoder, $f_{\text{enc}}$ is the LSTM cell update. For a Transformer encoder, it is a stack of self-attention layers. The key point: after processing all $T$ input tokens, the encoder has built a representation of the entire input.

In the simplest design, we take the final hidden state $h_T$ as the context vector $c = h_T$ . This single vector must carry everything the decoder needs to know about the input.

The decoder

The decoder generates one output token at a time. At each step $t$ , it takes:

The previous hidden state $s_{t-1}$
The previous output token $y_{t-1}$ (or a start-of-sequence token for $t=1$ )
The context vector $c$

And computes:

$s_t = f_{\text{dec}}(y_{t-1}, s_{t-1}, c)$ $P(y_t \mid y_{<t}, x) = \text{softmax}(W_o \cdot s_t)$

The decoder keeps generating until it produces an end-of-sequence token or hits a maximum length. During training, we use teacher forcing: feed the ground-truth $y_{t-1}$ instead of the model’s own prediction.

The bottleneck problem

BLEU score vs sequence length. Translation quality degrades significantly as input sequences get longer, motivating attention mechanisms.

Here is the fundamental tension. The context vector $c$ is a fixed-size vector, often 256 or 512 dimensions. But the input might be 5 tokens or 500. You are asking a single vector to memorize an entire paragraph.

Short sequences work fine. Long sequences lose information, especially details from early tokens. The encoder’s final hidden state tends to be dominated by recent inputs, because that is how RNNs work: earlier information fades.

This is the bottleneck problem. It limits seq2seq performance on long sequences, and it motivated the invention of attention.

Example 3: Quantifying bottleneck compression

Consider encoding a 10-step sequence where each encoder hidden state is 64-dimensional:

Total information in encoder hidden states:

$10 \times 64 = 640 \text{ values}$

Context vector dimension: 32

Compression ratio:

$\frac{640}{32} = 20\times$

You are compressing 640 numbers into 32. For a short sequence of 3 steps, the ratio would be $\frac{192}{32} = 6\times$ , which is much more manageable. But as sequences get longer, the compression gets more extreme, and performance degrades.

With attention, the decoder accesses all 10 hidden states directly. That is $10 \times 64 = 640$ values, zero compression. Each decoder step picks which encoder states matter most.

Attention in encoder-decoder models

Attention solves the bottleneck by letting the decoder look at all encoder hidden states, not just the last one. At each decoder step $t$ , the decoder computes a weighted sum of encoder states:

$e_{ti} = \text{score}(s_{t-1}, h_i)$ $\alpha_{ti} = \frac{\exp(e_{ti})}{\sum_{j=1}^{T} \exp(e_{tj})}$ $c_t = \sum_{i=1}^{T} \alpha_{ti} \, h_i$

The score function can be a dot product, additive (Bahdanau), or scaled dot product. The attention weights $\alpha_{ti}$ tell the decoder how much to focus on each encoder position.

graph LR
  subgraph Encoder
      H1["h₁"]
      H2["h₂"]
      H3["h₃"]
  end
  subgraph Attention
      S["s₀ (decoder)"] -->|score| A1["α₁"]
      S -->|score| A2["α₂"]
      S -->|score| A3["α₃"]
      A1 --> CT["c₁ = Σ αᵢhᵢ"]
      A2 --> CT
      A3 --> CT
  end
  H1 -->|weighted| CT
  H2 -->|weighted| CT
  H3 -->|weighted| CT
  CT --> D["Decoder step 1"]
  style CT fill:#9f9,stroke:#333,color:#000

Figure 2: Encoder-decoder with attention. The decoder state s₀ computes attention scores against all encoder hidden states. The weighted sum c₁ replaces the fixed context vector.

Now the context is different at every decoder step. When translating “the cat sat on the mat,” the decoder focuses on “cat” when generating the subject and on “mat” when generating the object. This is far more expressive than a single fixed vector.

Seq2seq without attention vs with attention

graph TD
  subgraph Without Attention
      E1["Encoder"] --> C1["Single context
vector c"]
      C1 --> D1["Decoder uses
same c at
every step"]
  end
  subgraph With Attention
      E2["Encoder"] --> H1["h1"]
      E2 --> H2["h2"]
      E2 --> H3["h3"]
      H1 --> W1["Weighted sum
different at
each step"]
      H2 --> W1
      H3 --> W1
      W1 --> D2["Decoder gets
custom context
per step"]
  end

Without attention, long inputs get crushed into a single vector, losing early details. With attention, the decoder reaches back into the encoder at every step, picking out exactly the information it needs. This is why attention transformed machine translation quality.

Example 1: Attention forward pass for translation

Three encoder hidden states and a decoder initial state:

$h_1 = [0.5,\; 0.3], \quad h_2 = [-0.1,\; 0.8], \quad h_3 = [0.7,\; -0.2]$ $s_0 = [0,\; 0]$

Step 1: Compute dot-product attention scores.

$e_1 = s_0^T h_1 = 0 \cdot 0.5 + 0 \cdot 0.3 = 0$ $e_2 = s_0^T h_2 = 0 \cdot (-0.1) + 0 \cdot 0.8 = 0$ $e_3 = s_0^T h_3 = 0 \cdot 0.7 + 0 \cdot (-0.2) = 0$

Step 2: Softmax to get attention weights.

$\alpha = \text{softmax}([0, 0, 0]) = \left[\frac{1}{3},\; \frac{1}{3},\; \frac{1}{3}\right]$

When the decoder has no information yet (zero state), it pays equal attention to every encoder position. This makes sense: with no context about what to generate, no position is more relevant than another.

Step 3: Compute context vector.

$c_1 = \frac{1}{3}(h_1 + h_2 + h_3) = \frac{1}{3}[0.5 + (-0.1) + 0.7,\; 0.3 + 0.8 + (-0.2)]$ $= \frac{1}{3}[1.1,\; 0.9] = [0.367,\; 0.300]$

Step 4: Decoder update. Using $W_c = \begin{bmatrix} 0.5 & 0.3 \\ -0.1 & 0.8 \end{bmatrix}$ :

$s_1 = \tanh(W_c \cdot c_1) = \tanh\!\left(\begin{bmatrix} 0.5(0.367) + 0.3(0.300) \\ -0.1(0.367) + 0.8(0.300) \end{bmatrix}\right)$ $= \tanh\!\left(\begin{bmatrix} 0.274 \\ 0.203 \end{bmatrix}\right) = \begin{bmatrix} 0.268 \\ 0.201 \end{bmatrix}$

Step 5: Output probabilities. With $W_o$ projecting to a 3-word vocabulary:

$W_o = \begin{bmatrix} 0.5 & -0.3 \\ 0.1 & 0.7 \\ -0.4 & 0.2 \end{bmatrix}$

$\text{logits} = W_o \cdot s_1 = [0.074,\; 0.168,\; -0.067]$ $P(y_1) = \text{softmax}([0.074,\; 0.168,\; -0.067]) = [0.337,\; 0.370,\; 0.293]$

Word 2 gets the highest probability. At the next step, $s_1$ is no longer zero, so the attention weights will shift to focus on the most relevant encoder states.

Skip connections: the U-Net approach

Not all encoder-decoder models process sequences. In image segmentation, the encoder downsamples an image to extract features, and the decoder upsamples back to the original resolution. The bottleneck loses spatial detail, which is a problem when you need pixel-precise output.

U-Net solves this with skip connections: direct links from encoder layers to decoder layers at the same resolution. The decoder receives both the upsampled features (coarse, semantic) and the encoder features (fine, spatial). It concatenates them along the channel dimension.

graph TD
  E1["Encoder 64×64×32"] --> E2["Encoder 32×32×64"]
  E2 --> E3["Encoder 16×16×128"]
  E3 --> B["Bottleneck 8×8×256"]
  B --> D3["Decoder 16×16×128"]
  D3 --> D2["Decoder 32×32×64"]
  D2 --> D1["Decoder 64×64×32"]
  E3 -.->|"skip (concat)"| D3
  E2 -.->|"skip (concat)"| D2
  E1 -.->|"skip (concat)"| D1
  style B fill:#f99,stroke:#333,color:#000

Figure 3: U-Net with skip connections. Dashed arrows show feature maps passed directly from encoder to decoder at matching resolutions. This preserves fine spatial detail.

Example 2: U-Net skip connection in practice

Suppose at one level of the U-Net:

Encoder feature map (1 channel, 2x2):

$F_{\text{enc}} = \begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix}$

Decoder upsampled feature map (1 channel, 2x2):

$F_{\text{dec}} = \begin{bmatrix} 0.5 & 0.3 \\ 0.1 & 0.7 \end{bmatrix}$

Skip connection: concatenate along the channel dimension.

The result has shape $2 \times 2 \times 2$ (height x width x channels):

$F_{\text{combined}}[:,:,0] = \begin{bmatrix} 0.5 & 0.3 \\ 0.1 & 0.7 \end{bmatrix}, \quad F_{\text{combined}}[:,:,1] = \begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix}$

Channel 0 carries the decoder’s coarse semantic features. Channel 1 carries the encoder’s fine spatial features. A subsequent $3 \times 3$ convolution fuses both into a single representation that has both precise boundaries and correct semantics.

Without the skip connection, the decoder only sees the upsampled $F_{\text{dec}}$ , which lost the sharp edges during downsampling. With it, the network can recover those edges from $F_{\text{enc}}$ .

Encoder-decoder variants

Architecture	Skip Connections	Attention Type	Main Application	Key Paper
Seq2seq (Sutskever 2014)	None	None	Translation	Sutskever et al.
Bahdanau seq2seq	None	Additive	Translation	Bahdanau et al. 2015
Transformer	None	Multi-head self + cross	Translation, text generation	Vaswani et al. 2017
U-Net	Yes (concat)	None	Image segmentation	Ronneberger et al. 2015
Attention U-Net	Yes (gated)	Additive gates on skips	Medical image segmentation	Oktay et al. 2018
Show-Attend-Tell	None	Soft/Hard attention	Image captioning	Xu et al. 2015

The table shows how the same encoder-decoder idea adapts across domains. Translation models use attention over temporal positions. Segmentation models use skip connections over spatial resolutions. Some architectures combine both.

Applications

Machine translation. The original and most famous application. The encoder reads a sentence in one language, and the decoder generates it in another. Modern Transformer-based systems (like those behind translation services) are still encoder-decoder models at their core.

Image segmentation. U-Net and its variants dominate medical image segmentation. The encoder (often a pretrained CNN like ResNet) extracts features at multiple scales. The decoder upsamples and produces per-pixel class labels.

Image captioning. The encoder is a CNN that produces a grid of feature vectors from an image. The decoder is an RNN or Transformer that generates a natural language caption, attending to different image regions at each word.

Text summarization. The encoder reads a long document. The decoder generates a shorter summary. Attention is critical here because the decoder needs to identify and focus on the most important parts of the input.

Speech recognition. Audio features go into the encoder, and text comes out of the decoder. The input and output lengths differ significantly, making the encoder-decoder framework a natural fit.

Training encoder-decoder models

Training uses standard backpropagation through the entire network. The loss is typically cross-entropy between the predicted and true output tokens, summed over all decoder steps:

$\mathcal{L} = -\sum_{t=1}^{T'} \log P(y_t^* \mid y_{<t}^*, x)$

where $y_t^*$ is the ground truth token at step $t$ .

Teacher forcing feeds the true $y_{t-1}^*$ to the decoder during training instead of the model’s own prediction. This stabilizes training but creates a mismatch with inference, where the model must use its own predictions. Scheduled sampling gradually shifts from teacher forcing to model predictions during training.

Teacher forcing vs free running

graph TD
  subgraph Teacher Forcing
      A1["Input: ground truth y1"] --> B1["Decoder step 2"]
      B1 --> C1["Input: ground truth y2"]
      C1 --> D1["Decoder step 3"]
  end
  subgraph Free Running
      A2["Input: model prediction y1"] --> B2["Decoder step 2"]
      B2 --> C2["Input: model prediction y2"]
      C2 --> D2["Decoder step 3"]
  end

With teacher forcing, each step gets the correct previous token, so errors never compound. At inference time, the model must use its own predictions, and a single mistake can cascade. Scheduled sampling mixes both strategies during training to bridge the gap.

Beam search at inference keeps the top- $k$ most probable partial sequences at each step instead of greedily picking the best token. This explores more of the output space and usually produces better results than greedy decoding.

Common pitfalls

Exposure bias. Teacher forcing means the decoder never sees its own mistakes during training. At inference, one wrong token can cascade into a completely wrong sequence.

Length mismatch. The model may generate outputs that are too short or too long. Length penalties in beam search help, but this remains an active area of research.

Attention collapse. In some cases, the attention weights become too uniform or too peaked, ignoring relevant encoder states. Regularizing the attention distribution can help.

What comes next

Encoder-decoder architectures are the backbone of many modern systems. Now that you understand how they work, the natural question is: can we use neural networks not just to map inputs to outputs, but to generate entirely new data? That is the topic of generative models, where encoder-decoder ideas (especially the decoder half) play a central role.

← Back to all series