Oct 26, 2025 · 24 min read · Deep Learning

Attention mechanism and transformers

In this series (25 parts)

Attention is a mechanism that lets a neural network decide which parts of the input matter most for each part of the output. Instead of compressing an entire sequence into one vector, the model learns to focus on the relevant pieces at each step. This single idea removed the biggest bottleneck in sequence models and led directly to the transformer, the architecture behind every major language model today.

Prerequisites

You should be comfortable with RNNs and LSTMs, especially the encoder-decoder (seq2seq) setup. You will also need matrix multiplication and the dot product. If you are rusty on either, review those articles first.

The big picture

When translating a sentence, some input words matter more for certain output words. Translating “The cat sat on the mat” into French: when generating “chat” (French for cat), the model should focus on “cat,” not “mat” or “on.” Attention makes this possible.

Here is how attention weights might look for this translation:

Output (French)	The	cat	sat	on	the	mat
Le	0.70	0.05	0.05	0.05	0.10	0.05
chat	0.05	0.80	0.05	0.02	0.03	0.05
assis	0.03	0.10	0.75	0.05	0.02	0.05
sur	0.02	0.02	0.05	0.80	0.06	0.05
le	0.05	0.03	0.02	0.05	0.60	0.25
tapis	0.02	0.05	0.03	0.05	0.10	0.75

Each row sums to 1. The largest weight in each row shows where the model focuses most.

Attention: weighting input words by relevance

graph TD
  IN1["The"] --> W["Compute relevance
to current output word"]
  IN2["cat"] --> W
  IN3["sat"] --> W
  IN4["on"] --> W
  IN5["the"] --> W
  IN6["mat"] --> W
  W --> SOFT["Softmax:
weights sum to 1"]
  SOFT --> MIX["Weighted combination
of input representations"]
  MIX --> OUT["Output: chat"]

Instead of compressing the entire input into one vector, the model looks back at every input word and picks the relevant ones for each output step.

Now let’s formalize this mechanism.

The fixed-length bottleneck

A standard seq2seq model uses an encoder RNN to read an input sequence token by token, producing a single hidden vector at the end. The decoder RNN then generates the output sequence using only that one vector as its starting context.

This works for short sequences. But consider translating a 50-word sentence. The encoder must pack all meaning, word order, and relationships into a fixed-length vector, typically 256 or 512 dimensions. Important details get lost, especially for tokens near the start of the sequence. Performance drops steadily as input length grows.

The core problem: one vector is not enough to represent an entire sequence. We need a way for the decoder to go back and look at specific parts of the input.

Attention: letting the decoder look back

Attention weights for translating “The cat sat” to French. Each target word attends most strongly to its aligned source word.

The fix is straightforward. Instead of one compressed vector, give the decoder access to all encoder hidden states at every step.

The encoder produces hidden states $h_1, h_2, \ldots, h_n$ , one per input token. At decoder step $t$ , the decoder has its own hidden state $s_t$ . We compute an alignment score between $s_t$ and each encoder state:

$e_{tj} = \text{score}(s_t, h_j)$

Softmax turns these scores into weights that sum to 1:

$\alpha_{tj} = \frac{\exp(e_{tj})}{\sum_{k=1}^{n} \exp(e_{tk})}$

The context vector for step $t$ is the weighted sum of all encoder states:

$c_t = \sum_{j=1}^{n} \alpha_{tj} \, h_j$

Now the decoder uses both $s_t$ and $c_t$ to produce its output. Different decoder steps can attend to different input positions. When translating “the cat sat on the mat,” the decoder focuses on “cat” when generating its translation of that word, then shifts to “sat” for the next word.

This was proposed by Bahdanau et al. in 2014. It improved translation quality dramatically, especially on longer sentences.

Scaled dot-product attention

Bahdanau’s original attention used a small neural network to compute scores. Luong (2015) simplified this to a plain dot product. The transformer (Vaswani et al., 2017) generalized the idea with three learned projections: Query, Key, and Value.

Given an input sequence, we project it into three matrices:

Query ( $Q$ ): what am I looking for?
Key ( $K$ ): what do I contain?
Value ( $V$ ): what information do I carry?

The attention formula is:

$\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^T}{\sqrt{d_k}}\right) V$

Here $d_k$ is the dimension of the key vectors. The product $QK^T$ gives a matrix of scores: how well each query matches each key. Row-wise softmax converts each row of raw scores into weights that sum to 1. Multiplying by $V$ produces the weighted combination of values.

Query-Key-Value in plain language

graph LR
  Q["Query:
What am I looking for?"] --> SCORE["Dot product:
how well does each
key match my query?"]
  K["Key:
What do I contain?"] --> SCORE
  SCORE --> SOFT["Softmax:
turn scores into weights"]
  SOFT --> MIX["Weighted sum
of values"]
  V["Value:
What info do I carry?"] --> MIX
  MIX --> OUT["Attention output"]

Worked example: three-token attention

Let $d_k = 2$ with the following matrices:

$Q = \begin{bmatrix} 1 & 0 \\ 0 & 1 \\ 1 & 1 \end{bmatrix}, \quad K = \begin{bmatrix} 1 & 1 \\ 0 & 1 \\ 1 & 0 \end{bmatrix}, \quad V = \begin{bmatrix} 1 & 0 \\ 0 & 1 \\ 0 & 0 \end{bmatrix}$

Step 1: Compute $QK^T$ (3 x 3)

Each entry is a dot product between one row of $Q$ and one row of $K$ :

$QK^T = \begin{bmatrix} 1{\cdot}1 + 0{\cdot}1 & 1{\cdot}0 + 0{\cdot}1 & 1{\cdot}1 + 0{\cdot}0 \\ 0{\cdot}1 + 1{\cdot}1 & 0{\cdot}0 + 1{\cdot}1 & 0{\cdot}1 + 1{\cdot}0 \\ 1{\cdot}1 + 1{\cdot}1 & 1{\cdot}0 + 1{\cdot}1 & 1{\cdot}1 + 1{\cdot}0 \end{bmatrix} = \begin{bmatrix} 1 & 0 & 1 \\ 1 & 1 & 0 \\ 2 & 1 & 1 \end{bmatrix}$

Step 2: Scale by $\sqrt{d_k} = \sqrt{2} \approx 1.414$

$\frac{QK^T}{\sqrt{2}} = \begin{bmatrix} 0.707 & 0.000 & 0.707 \\ 0.707 & 0.707 & 0.000 \\ 1.414 & 0.707 & 0.707 \end{bmatrix}$

Step 3: Apply softmax row by row

Row 0: scores $[0.707,\; 0.000,\; 0.707]$

$\exp(0.707) \approx 2.028, \quad \exp(0.000) = 1.000, \quad \exp(0.707) \approx 2.028$

Sum $= 5.056$ . Dividing: $[0.401,\; 0.198,\; 0.401]$ .

Row 1: scores $[0.707,\; 0.707,\; 0.000]$

$\exp(0.707) \approx 2.028, \quad \exp(0.707) \approx 2.028, \quad \exp(0.000) = 1.000$

Sum $= 5.056$ . Dividing: $[0.401,\; 0.401,\; 0.198]$ .

Row 2: scores $[1.414,\; 0.707,\; 0.707]$

$\exp(1.414) \approx 4.113, \quad \exp(0.707) \approx 2.028, \quad \exp(0.707) \approx 2.028$

Sum $= 8.169$ . Dividing: $[0.503,\; 0.248,\; 0.248]$ .

The full attention weight matrix:

$A = \begin{bmatrix} 0.401 & 0.198 & 0.401 \\ 0.401 & 0.401 & 0.198 \\ 0.503 & 0.248 & 0.248 \end{bmatrix}$

Step 4: Multiply $A \times V$

$\text{Output} = \begin{bmatrix} 0.401 \cdot 1 + 0.198 \cdot 0 + 0.401 \cdot 0 & 0.401 \cdot 0 + 0.198 \cdot 1 + 0.401 \cdot 0 \\ 0.401 \cdot 1 + 0.401 \cdot 0 + 0.198 \cdot 0 & 0.401 \cdot 0 + 0.401 \cdot 1 + 0.198 \cdot 0 \\ 0.503 \cdot 1 + 0.248 \cdot 0 + 0.248 \cdot 0 & 0.503 \cdot 0 + 0.248 \cdot 1 + 0.248 \cdot 0 \end{bmatrix} = \begin{bmatrix} 0.401 & 0.198 \\ 0.401 & 0.401 \\ 0.503 & 0.248 \end{bmatrix}$

Notice that token 3 (row 2) places the most weight on key 1, because its query $[1, 1]$ aligns best with key $[1, 1]$ . Token 1 (row 0) splits its attention equally between keys 1 and 3, ignoring key 2.

Why scale by $\sqrt{d_k}$ ?

When $d_k$ is large, dot products grow large. Suppose each element of $q$ and $k$ is drawn independently from $N(0, 1)$ . The dot product $q \cdot k = \sum_{i=1}^{d_k} q_i k_i$ is a sum of $d_k$ independent terms, each with mean 0 and variance 1. So:

$\mathbb{E}[q \cdot k] = 0, \qquad \text{Var}(q \cdot k) = d_k$

With $d_k = 64$ , the standard deviation is $\sqrt{64} = 8$ . Dot products regularly hit values like $+8$ or $-8$ . When softmax receives inputs this large, it produces near-one-hot outputs. Gradients become tiny, and learning stalls.

Dividing by $\sqrt{d_k}$ rescales the variance back to 1:

$\text{Var}\!\left(\frac{q \cdot k}{\sqrt{d_k}}\right) = \frac{d_k}{d_k} = 1$

Worked example: scaling in practice

Compare softmax on unscaled scores (simulating $d_k = 64$ , where values have std $\approx 8$ ) versus properly scaled scores (std $\approx 1$ ).

Unscaled: $\text{softmax}([8,\; -4,\; 6])$

$\exp(8) \approx 2981.0, \quad \exp(-4) \approx 0.018, \quad \exp(6) \approx 403.4$

$\text{Sum} = 3384.4$

$\text{softmax} = [0.881,\; 0.000,\; 0.119]$

Almost all weight collapses to one position. The second position gets essentially zero weight, meaning gradients for that position vanish.

Scaled: $\text{softmax}([1.0,\; -0.5,\; 0.75])$

$\exp(1.0) \approx 2.718, \quad \exp(-0.5) \approx 0.607, \quad \exp(0.75) \approx 2.117$

$\text{Sum} = 5.442$

$\text{softmax} = [0.500,\; 0.111,\; 0.389]$

Weight spreads across all positions. Gradients flow everywhere, and the model can adjust all attention weights during backpropagation.

Multi-head attention

A single attention head learns one type of relationship. Maybe it captures word order, or subject-verb agreement, or coreference. But language has many simultaneous relationships. Multi-head attention runs several attention computations in parallel, each with its own learned projections.

For $h$ heads with model dimension $d_{\text{model}}$ , each head operates on dimension $d_k = d_{\text{model}} / h$ :

$\text{head}_i = \text{Attention}(QW_i^Q, \; KW_i^K, \; VW_i^V)$

$\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h) \, W^O$

The projection matrices $W_i^Q, W_i^K, W_i^V \in \mathbb{R}^{d_{\text{model}} \times d_k}$ and the output projection $W^O \in \mathbb{R}^{h \cdot d_k \times d_{\text{model}}}$ are all learned parameters. The total computation cost is comparable to single-head attention at full dimensionality, but the model gains $h$ distinct “views” of the input.

In practice, BERT-base uses $h = 12$ heads with $d_{\text{model}} = 768$ , giving $d_k = 64$ per head. GPT-3 uses $h = 96$ heads with $d_{\text{model}} = 12288$ .

Multi-head attention: split, attend, concatenate

graph LR
  INPUT["Input"] --> SPLIT["Split into h heads
(each sees d_model / h dims)"]
  SPLIT --> H1["Head 1:
attend independently"]
  SPLIT --> H2["Head 2:
attend independently"]
  SPLIT --> HN["Head h:
attend independently"]
  H1 --> CAT["Concatenate
all head outputs"]
  H2 --> CAT
  HN --> CAT
  CAT --> PROJ["Linear projection W_O"]
  PROJ --> OUT["Multi-head output"]

Each head learns a different type of relationship: one might capture word order, another subject-verb agreement, another coreference. Concatenating them gives the model multiple simultaneous views.

graph TD
  Q["Q"] --> WQ1["W₁Q"] & WQ2["W₂Q"] & WQh["WₕQ"]
  K["K"] --> WK1["W₁K"] & WK2["W₂K"] & WKh["WₕK"]
  V["V"] --> WV1["W₁V"] & WV2["W₂V"] & WVh["WₕV"]
  WQ1 --> H1["Head 1: Attention"]
  WK1 --> H1
  WV1 --> H1
  WQ2 --> H2["Head 2: Attention"]
  WK2 --> H2
  WV2 --> H2
  WQh --> Hh["Head h: Attention"]
  WKh --> Hh
  WVh --> Hh
  H1 --> Cat["Concatenate"]
  H2 --> Cat
  Hh --> Cat
  Cat --> WO["W_O Projection"]
  WO --> Out["Multi-Head Output"]

Positional encoding

Transformers process all tokens in parallel. Unlike RNNs, there is no built-in notion of sequence order. The sentences “dog bites man” and “man bites dog” would produce identical attention patterns without position information.

Positional encodings fix this by adding a position signal directly to the input embeddings. The original transformer uses sinusoidal functions:

$PE_{(pos, 2i)} = \sin\!\left(\frac{pos}{10000^{\,2i / d_{\text{model}}}}\right)$

$PE_{(pos, 2i+1)} = \cos\!\left(\frac{pos}{10000^{\,2i / d_{\text{model}}}}\right)$

Each dimension pair uses a different frequency. Low-index dimensions oscillate quickly, capturing fine position differences. High-index dimensions oscillate slowly, capturing broad position information. A key property: the encoding for position $pos + k$ can be expressed as a linear function of the encoding at position $pos$ , which makes it easy for the model to learn relative position patterns.

Worked example: positional encoding matrix

Let $d_{\text{model}} = 4$ and compute encodings for positions 0, 1, and 2.

The four dimensions use two frequency pairs:

Dimension 0 ( $i = 0$ , even): $\sin(pos / 10000^{0/4}) = \sin(pos)$
Dimension 1 ( $i = 0$ , odd): $\cos(pos / 10000^{0/4}) = \cos(pos)$
Dimension 2 ( $i = 1$ , even): $\sin(pos / 10000^{2/4}) = \sin(pos / 100)$
Dimension 3 ( $i = 1$ , odd): $\cos(pos / 10000^{2/4}) = \cos(pos / 100)$

Computing each value:

Position	$\sin(pos)$	$\cos(pos)$	$\sin(pos/100)$	$\cos(pos/100)$
0	0.000	1.000	0.000	1.000
1	0.841	0.540	0.010	1.000
2	0.909	-0.416	0.020	1.000

The full positional encoding matrix:

$PE = \begin{bmatrix} 0.000 & 1.000 & 0.000 & 1.000 \\ 0.841 & 0.540 & 0.010 & 1.000 \\ 0.909 & -0.416 & 0.020 & 1.000 \end{bmatrix}$

The first two columns change rapidly between positions, while the last two change very slowly. This gives the model both fine-grained and coarse position signals at different dimensions.

Positional encoding methods

Method	How it works	Extrapolates?	Learned?	Used in
Sinusoidal	Fixed sin/cos at varying frequencies	Yes, in theory	No	Original transformer
Learned	Embedding lookup table indexed by position	No, fixed max length	Yes	BERT, GPT-2
RoPE	Rotates query and key vectors by position-dependent angle	Yes, with decay	No	LLaMA, PaLM
ALiBi	Adds linear bias to attention scores based on token distance	Yes	No	BLOOM, MPT

The transformer block

Each transformer encoder block chains four components together with residual connections:

Multi-head attention: every token attends to every other token.
Add and norm: a residual connection adds the input back to the attention output, followed by layer normalization.
Feed-forward network (FFN): two linear layers with a ReLU or GELU activation between them, applied to each position independently.
Add and norm: another residual connection and layer norm after the FFN.

Dropout is applied after each sub-layer (attention and FFN) during training.

The residual connections are critical for training deep stacks. They let gradients flow directly through the network via the chain rule, preventing the vanishing gradient problem. The feed-forward network gives the model per-token transformation capacity beyond what attention provides.

A typical transformer stacks many of these blocks. BERT-base uses 12, GPT-2 uses 48, and GPT-3 uses 96.

graph TD
  A["Input Embedding + Positional Encoding"] --> B["Multi-Head Attention"]
  B --> C["Add (residual)"]
  A --> C
  C --> D["Layer Norm"]
  D --> E["Feed-Forward Network"]
  E --> F["Add (residual)"]
  D --> F
  F --> G["Layer Norm"]
  G --> H["Block Output"]

Three transformer architectures

The original transformer has both an encoder and a decoder. But researchers found that using only one half works better for certain tasks. Three variants have become standard.

graph TD
  subgraph "Encoder-Only: BERT"
      direction TB
      B1["Input Tokens"] --> B2["Bidirectional Encoder Blocks"]
      B2 --> B3["Contextual Representations"]
  end
  subgraph "Decoder-Only: GPT"
      direction TB
      G1["Input Tokens"] --> G2["Masked Decoder Blocks"]
      G2 --> G3["Next Token Prediction"]
  end
  subgraph "Encoder-Decoder: T5"
      direction TB
      T1["Source Tokens"] --> T2["Encoder Blocks"]
      T2 --> T3["Cross-Attention"]
      T4["Target Tokens"] --> T5["Masked Decoder Blocks"]
      T5 --> T3
      T3 --> T6["Output Tokens"]
  end

Encoder-only (BERT)

BERT stacks encoder blocks only. Every token can attend to every other token in both directions. This bidirectional context is ideal for tasks that need full understanding of the input: text classification, named entity recognition, and extractive question answering. BERT is pre-trained by masking random tokens and predicting them from context, a task called masked language modeling. It uses cross-entropy loss between the predicted token distribution and the true token.

Decoder-only (GPT)

GPT stacks decoder blocks with causal (masked) self-attention. Each token can only attend to tokens at earlier positions. This left-to-right constraint makes the architecture a natural fit for text generation: given a sequence of tokens, predict the next one. GPT-2, GPT-3, GPT-4, and LLaMA all use this pattern. It has become the dominant architecture for large language models.

Encoder-decoder (T5)

T5 and the original transformer use both stacks. The encoder reads the full input with bidirectional attention. The decoder generates output using masked self-attention (so it cannot peek at future output tokens) plus cross-attention into the encoder representations. This setup works well for sequence-to-sequence tasks: translation, summarization, and structured generation.

Self-attention vs cross-attention

graph TD
  subgraph Self-Attention
      direction LR
      SA_IN["Same sequence provides
Q, K, and V"] --> SA_OUT["Each token attends
to all other tokens
in its own sequence"]
  end
  subgraph Cross-Attention
      direction LR
      DEC["Decoder provides Q"] --> CA_OUT["Decoder tokens attend
to encoder tokens"]
      ENC["Encoder provides K and V"] --> CA_OUT
  end

Self-attention lets tokens within the same sequence relate to each other. Cross-attention lets the decoder query the encoder, connecting input and output sequences.

Attention variants

Different scoring functions and structural changes have produced several attention variants over the years:

Name	Complexity	Use case	Year
Bahdanau (additive)	$O(n^2 \cdot d)$	Seq2seq translation	2014
Luong (dot-product)	$O(n^2 \cdot d)$	Seq2seq with simpler scoring	2015
Scaled dot-product	$O(n^2 \cdot d)$	Transformer core	2017
Multi-head	$O(n^2 \cdot d)$	Parallel attention heads	2017
Sparse attention	$O(n\sqrt{n})$	Long documents	2019
Linear attention	$O(n \cdot d^2)$	Efficient transformers	2020

The $O(n^2)$ cost of standard attention is the main bottleneck for long sequences. Sparse and linear variants trade some representational power for better scaling. For most tasks with moderate sequence lengths (under a few thousand tokens), scaled dot-product with multiple heads remains the default.

What comes next

We covered how attention removes the fixed-length bottleneck and how transformers stack attention with feed-forward layers to build deep sequence models. The three architecture variants, encoder-only, decoder-only, and encoder-decoder, serve different tasks but share the same building blocks.

The next article covers word embeddings: the learned vector representations that transformers operate on.

← Back to all series