Search…

Attention mechanism and transformers

In this series (25 parts)
  1. Neural networks: the basic building block
  2. Forward pass and backpropagation
  3. Training neural networks: a practical guide
  4. Convolutional neural networks
  5. Recurrent neural networks and LSTMs
  6. Attention mechanism and transformers
  7. Word embeddings: from one-hot to dense representations
  8. Transfer learning and fine-tuning
  9. Optimization techniques for deep networks
  10. Regularization for deep networks
  11. Encoder-decoder architectures
  12. Generative models: an overview
  13. Restricted Boltzmann Machines
  14. Deep Belief Networks
  15. Variational Autoencoders
  16. Generative Adversarial Networks: training and theory
  17. DCGAN, conditional GANs, and GAN variants
  18. Representation learning and self-supervised learning
  19. Domain adaptation and fine-tuning strategies
  20. Distributed representations and latent spaces
  21. AutoML and hyperparameter optimization
  22. Neural architecture search
  23. Network compression and efficient inference
  24. Graph neural networks
  25. Practical deep learning: debugging and tuning

Attention is a mechanism that lets a neural network decide which parts of the input matter most for each part of the output. Instead of compressing an entire sequence into one vector, the model learns to focus on the relevant pieces at each step. This single idea removed the biggest bottleneck in sequence models and led directly to the transformer, the architecture behind every major language model today.

Prerequisites

You should be comfortable with RNNs and LSTMs, especially the encoder-decoder (seq2seq) setup. You will also need matrix multiplication and the dot product. If you are rusty on either, review those articles first.

The big picture

When translating a sentence, some input words matter more for certain output words. Translating “The cat sat on the mat” into French: when generating “chat” (French for cat), the model should focus on “cat,” not “mat” or “on.” Attention makes this possible.

Here is how attention weights might look for this translation:

Output (French)Thecatsatonthemat
Le0.700.050.050.050.100.05
chat0.050.800.050.020.030.05
assis0.030.100.750.050.020.05
sur0.020.020.050.800.060.05
le0.050.030.020.050.600.25
tapis0.020.050.030.050.100.75

Each row sums to 1. The largest weight in each row shows where the model focuses most.

Attention: weighting input words by relevance

graph TD
  IN1["The"] --> W["Compute relevance
to current output word"]
  IN2["cat"] --> W
  IN3["sat"] --> W
  IN4["on"] --> W
  IN5["the"] --> W
  IN6["mat"] --> W
  W --> SOFT["Softmax:
weights sum to 1"]
  SOFT --> MIX["Weighted combination
of input representations"]
  MIX --> OUT["Output: chat"]

Instead of compressing the entire input into one vector, the model looks back at every input word and picks the relevant ones for each output step.

Now let’s formalize this mechanism.

The fixed-length bottleneck

A standard seq2seq model uses an encoder RNN to read an input sequence token by token, producing a single hidden vector at the end. The decoder RNN then generates the output sequence using only that one vector as its starting context.

This works for short sequences. But consider translating a 50-word sentence. The encoder must pack all meaning, word order, and relationships into a fixed-length vector, typically 256 or 512 dimensions. Important details get lost, especially for tokens near the start of the sequence. Performance drops steadily as input length grows.

The core problem: one vector is not enough to represent an entire sequence. We need a way for the decoder to go back and look at specific parts of the input.

Attention: letting the decoder look back

Attention weights for translating “The cat sat” to French. Each target word attends most strongly to its aligned source word.

The fix is straightforward. Instead of one compressed vector, give the decoder access to all encoder hidden states at every step.

The encoder produces hidden states h1,h2,,hnh_1, h_2, \ldots, h_n, one per input token. At decoder step tt, the decoder has its own hidden state sts_t. We compute an alignment score between sts_t and each encoder state:

etj=score(st,hj)e_{tj} = \text{score}(s_t, h_j)

Softmax turns these scores into weights that sum to 1:

αtj=exp(etj)k=1nexp(etk)\alpha_{tj} = \frac{\exp(e_{tj})}{\sum_{k=1}^{n} \exp(e_{tk})}

The context vector for step tt is the weighted sum of all encoder states:

ct=j=1nαtjhjc_t = \sum_{j=1}^{n} \alpha_{tj} \, h_j

Now the decoder uses both sts_t and ctc_t to produce its output. Different decoder steps can attend to different input positions. When translating “the cat sat on the mat,” the decoder focuses on “cat” when generating its translation of that word, then shifts to “sat” for the next word.

This was proposed by Bahdanau et al. in 2014. It improved translation quality dramatically, especially on longer sentences.

Scaled dot-product attention

Bahdanau’s original attention used a small neural network to compute scores. Luong (2015) simplified this to a plain dot product. The transformer (Vaswani et al., 2017) generalized the idea with three learned projections: Query, Key, and Value.

Given an input sequence, we project it into three matrices:

  • Query (QQ): what am I looking for?
  • Key (KK): what do I contain?
  • Value (VV): what information do I carry?

The attention formula is:

Attention(Q,K,V)=softmax ⁣(QKTdk)V\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^T}{\sqrt{d_k}}\right) V

Here dkd_k is the dimension of the key vectors. The product QKTQK^T gives a matrix of scores: how well each query matches each key. Row-wise softmax converts each row of raw scores into weights that sum to 1. Multiplying by VV produces the weighted combination of values.

Query-Key-Value in plain language

graph LR
  Q["Query:
What am I looking for?"] --> SCORE["Dot product:
how well does each
key match my query?"]
  K["Key:
What do I contain?"] --> SCORE
  SCORE --> SOFT["Softmax:
turn scores into weights"]
  SOFT --> MIX["Weighted sum
of values"]
  V["Value:
What info do I carry?"] --> MIX
  MIX --> OUT["Attention output"]

Worked example: three-token attention

Let dk=2d_k = 2 with the following matrices:

Q=[100111],K=[110110],V=[100100]Q = \begin{bmatrix} 1 & 0 \\ 0 & 1 \\ 1 & 1 \end{bmatrix}, \quad K = \begin{bmatrix} 1 & 1 \\ 0 & 1 \\ 1 & 0 \end{bmatrix}, \quad V = \begin{bmatrix} 1 & 0 \\ 0 & 1 \\ 0 & 0 \end{bmatrix}

Step 1: Compute QKTQK^T (3 x 3)

Each entry is a dot product between one row of QQ and one row of KK:

QKT=[11+0110+0111+0001+1100+1101+1011+1110+1111+10]=[101110211]QK^T = \begin{bmatrix} 1{\cdot}1 + 0{\cdot}1 & 1{\cdot}0 + 0{\cdot}1 & 1{\cdot}1 + 0{\cdot}0 \\ 0{\cdot}1 + 1{\cdot}1 & 0{\cdot}0 + 1{\cdot}1 & 0{\cdot}1 + 1{\cdot}0 \\ 1{\cdot}1 + 1{\cdot}1 & 1{\cdot}0 + 1{\cdot}1 & 1{\cdot}1 + 1{\cdot}0 \end{bmatrix} = \begin{bmatrix} 1 & 0 & 1 \\ 1 & 1 & 0 \\ 2 & 1 & 1 \end{bmatrix}

Step 2: Scale by dk=21.414\sqrt{d_k} = \sqrt{2} \approx 1.414

QKT2=[0.7070.0000.7070.7070.7070.0001.4140.7070.707]\frac{QK^T}{\sqrt{2}} = \begin{bmatrix} 0.707 & 0.000 & 0.707 \\ 0.707 & 0.707 & 0.000 \\ 1.414 & 0.707 & 0.707 \end{bmatrix}

Step 3: Apply softmax row by row

Row 0: scores [0.707,  0.000,  0.707][0.707,\; 0.000,\; 0.707]

exp(0.707)2.028,exp(0.000)=1.000,exp(0.707)2.028\exp(0.707) \approx 2.028, \quad \exp(0.000) = 1.000, \quad \exp(0.707) \approx 2.028

Sum =5.056= 5.056. Dividing: [0.401,  0.198,  0.401][0.401,\; 0.198,\; 0.401].

Row 1: scores [0.707,  0.707,  0.000][0.707,\; 0.707,\; 0.000]

exp(0.707)2.028,exp(0.707)2.028,exp(0.000)=1.000\exp(0.707) \approx 2.028, \quad \exp(0.707) \approx 2.028, \quad \exp(0.000) = 1.000

Sum =5.056= 5.056. Dividing: [0.401,  0.401,  0.198][0.401,\; 0.401,\; 0.198].

Row 2: scores [1.414,  0.707,  0.707][1.414,\; 0.707,\; 0.707]

exp(1.414)4.113,exp(0.707)2.028,exp(0.707)2.028\exp(1.414) \approx 4.113, \quad \exp(0.707) \approx 2.028, \quad \exp(0.707) \approx 2.028

Sum =8.169= 8.169. Dividing: [0.503,  0.248,  0.248][0.503,\; 0.248,\; 0.248].

The full attention weight matrix:

A=[0.4010.1980.4010.4010.4010.1980.5030.2480.248]A = \begin{bmatrix} 0.401 & 0.198 & 0.401 \\ 0.401 & 0.401 & 0.198 \\ 0.503 & 0.248 & 0.248 \end{bmatrix}

Step 4: Multiply A×VA \times V

Output=[0.4011+0.1980+0.40100.4010+0.1981+0.40100.4011+0.4010+0.19800.4010+0.4011+0.19800.5031+0.2480+0.24800.5030+0.2481+0.2480]=[0.4010.1980.4010.4010.5030.248]\text{Output} = \begin{bmatrix} 0.401 \cdot 1 + 0.198 \cdot 0 + 0.401 \cdot 0 & 0.401 \cdot 0 + 0.198 \cdot 1 + 0.401 \cdot 0 \\ 0.401 \cdot 1 + 0.401 \cdot 0 + 0.198 \cdot 0 & 0.401 \cdot 0 + 0.401 \cdot 1 + 0.198 \cdot 0 \\ 0.503 \cdot 1 + 0.248 \cdot 0 + 0.248 \cdot 0 & 0.503 \cdot 0 + 0.248 \cdot 1 + 0.248 \cdot 0 \end{bmatrix} = \begin{bmatrix} 0.401 & 0.198 \\ 0.401 & 0.401 \\ 0.503 & 0.248 \end{bmatrix}

Notice that token 3 (row 2) places the most weight on key 1, because its query [1,1][1, 1] aligns best with key [1,1][1, 1]. Token 1 (row 0) splits its attention equally between keys 1 and 3, ignoring key 2.

Why scale by dk\sqrt{d_k}?

When dkd_k is large, dot products grow large. Suppose each element of qq and kk is drawn independently from N(0,1)N(0, 1). The dot product qk=i=1dkqikiq \cdot k = \sum_{i=1}^{d_k} q_i k_i is a sum of dkd_k independent terms, each with mean 0 and variance 1. So:

E[qk]=0,Var(qk)=dk\mathbb{E}[q \cdot k] = 0, \qquad \text{Var}(q \cdot k) = d_k

With dk=64d_k = 64, the standard deviation is 64=8\sqrt{64} = 8. Dot products regularly hit values like +8+8 or 8-8. When softmax receives inputs this large, it produces near-one-hot outputs. Gradients become tiny, and learning stalls.

Dividing by dk\sqrt{d_k} rescales the variance back to 1:

Var ⁣(qkdk)=dkdk=1\text{Var}\!\left(\frac{q \cdot k}{\sqrt{d_k}}\right) = \frac{d_k}{d_k} = 1

Worked example: scaling in practice

Compare softmax on unscaled scores (simulating dk=64d_k = 64, where values have std 8\approx 8) versus properly scaled scores (std 1\approx 1).

Unscaled: softmax([8,  4,  6])\text{softmax}([8,\; -4,\; 6])

exp(8)2981.0,exp(4)0.018,exp(6)403.4\exp(8) \approx 2981.0, \quad \exp(-4) \approx 0.018, \quad \exp(6) \approx 403.4

Sum=3384.4\text{Sum} = 3384.4

softmax=[0.881,  0.000,  0.119]\text{softmax} = [0.881,\; 0.000,\; 0.119]

Almost all weight collapses to one position. The second position gets essentially zero weight, meaning gradients for that position vanish.

Scaled: softmax([1.0,  0.5,  0.75])\text{softmax}([1.0,\; -0.5,\; 0.75])

exp(1.0)2.718,exp(0.5)0.607,exp(0.75)2.117\exp(1.0) \approx 2.718, \quad \exp(-0.5) \approx 0.607, \quad \exp(0.75) \approx 2.117

Sum=5.442\text{Sum} = 5.442

softmax=[0.500,  0.111,  0.389]\text{softmax} = [0.500,\; 0.111,\; 0.389]

Weight spreads across all positions. Gradients flow everywhere, and the model can adjust all attention weights during backpropagation.

Multi-head attention

A single attention head learns one type of relationship. Maybe it captures word order, or subject-verb agreement, or coreference. But language has many simultaneous relationships. Multi-head attention runs several attention computations in parallel, each with its own learned projections.

For hh heads with model dimension dmodeld_{\text{model}}, each head operates on dimension dk=dmodel/hd_k = d_{\text{model}} / h:

headi=Attention(QWiQ,  KWiK,  VWiV)\text{head}_i = \text{Attention}(QW_i^Q, \; KW_i^K, \; VW_i^V)

MultiHead(Q,K,V)=Concat(head1,,headh)WO\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h) \, W^O

The projection matrices WiQ,WiK,WiVRdmodel×dkW_i^Q, W_i^K, W_i^V \in \mathbb{R}^{d_{\text{model}} \times d_k} and the output projection WORhdk×dmodelW^O \in \mathbb{R}^{h \cdot d_k \times d_{\text{model}}} are all learned parameters. The total computation cost is comparable to single-head attention at full dimensionality, but the model gains hh distinct “views” of the input.

In practice, BERT-base uses h=12h = 12 heads with dmodel=768d_{\text{model}} = 768, giving dk=64d_k = 64 per head. GPT-3 uses h=96h = 96 heads with dmodel=12288d_{\text{model}} = 12288.

Multi-head attention: split, attend, concatenate

graph LR
  INPUT["Input"] --> SPLIT["Split into h heads
(each sees d_model / h dims)"]
  SPLIT --> H1["Head 1:
attend independently"]
  SPLIT --> H2["Head 2:
attend independently"]
  SPLIT --> HN["Head h:
attend independently"]
  H1 --> CAT["Concatenate
all head outputs"]
  H2 --> CAT
  HN --> CAT
  CAT --> PROJ["Linear projection W_O"]
  PROJ --> OUT["Multi-head output"]

Each head learns a different type of relationship: one might capture word order, another subject-verb agreement, another coreference. Concatenating them gives the model multiple simultaneous views.

graph TD
  Q["Q"] --> WQ1["W₁Q"] & WQ2["W₂Q"] & WQh["WₕQ"]
  K["K"] --> WK1["W₁K"] & WK2["W₂K"] & WKh["WₕK"]
  V["V"] --> WV1["W₁V"] & WV2["W₂V"] & WVh["WₕV"]
  WQ1 --> H1["Head 1: Attention"]
  WK1 --> H1
  WV1 --> H1
  WQ2 --> H2["Head 2: Attention"]
  WK2 --> H2
  WV2 --> H2
  WQh --> Hh["Head h: Attention"]
  WKh --> Hh
  WVh --> Hh
  H1 --> Cat["Concatenate"]
  H2 --> Cat
  Hh --> Cat
  Cat --> WO["W_O Projection"]
  WO --> Out["Multi-Head Output"]

Positional encoding

Transformers process all tokens in parallel. Unlike RNNs, there is no built-in notion of sequence order. The sentences “dog bites man” and “man bites dog” would produce identical attention patterns without position information.

Positional encodings fix this by adding a position signal directly to the input embeddings. The original transformer uses sinusoidal functions:

PE(pos,2i)=sin ⁣(pos100002i/dmodel)PE_{(pos, 2i)} = \sin\!\left(\frac{pos}{10000^{\,2i / d_{\text{model}}}}\right)

PE(pos,2i+1)=cos ⁣(pos100002i/dmodel)PE_{(pos, 2i+1)} = \cos\!\left(\frac{pos}{10000^{\,2i / d_{\text{model}}}}\right)

Each dimension pair uses a different frequency. Low-index dimensions oscillate quickly, capturing fine position differences. High-index dimensions oscillate slowly, capturing broad position information. A key property: the encoding for position pos+kpos + k can be expressed as a linear function of the encoding at position pospos, which makes it easy for the model to learn relative position patterns.

Worked example: positional encoding matrix

Let dmodel=4d_{\text{model}} = 4 and compute encodings for positions 0, 1, and 2.

The four dimensions use two frequency pairs:

  • Dimension 0 (i=0i = 0, even): sin(pos/100000/4)=sin(pos)\sin(pos / 10000^{0/4}) = \sin(pos)
  • Dimension 1 (i=0i = 0, odd): cos(pos/100000/4)=cos(pos)\cos(pos / 10000^{0/4}) = \cos(pos)
  • Dimension 2 (i=1i = 1, even): sin(pos/100002/4)=sin(pos/100)\sin(pos / 10000^{2/4}) = \sin(pos / 100)
  • Dimension 3 (i=1i = 1, odd): cos(pos/100002/4)=cos(pos/100)\cos(pos / 10000^{2/4}) = \cos(pos / 100)

Computing each value:

Positionsin(pos)\sin(pos)cos(pos)\cos(pos)sin(pos/100)\sin(pos/100)cos(pos/100)\cos(pos/100)
00.0001.0000.0001.000
10.8410.5400.0101.000
20.909-0.4160.0201.000

The full positional encoding matrix:

PE=[0.0001.0000.0001.0000.8410.5400.0101.0000.9090.4160.0201.000]PE = \begin{bmatrix} 0.000 & 1.000 & 0.000 & 1.000 \\ 0.841 & 0.540 & 0.010 & 1.000 \\ 0.909 & -0.416 & 0.020 & 1.000 \end{bmatrix}

The first two columns change rapidly between positions, while the last two change very slowly. This gives the model both fine-grained and coarse position signals at different dimensions.

Positional encoding methods

MethodHow it worksExtrapolates?Learned?Used in
SinusoidalFixed sin/cos at varying frequenciesYes, in theoryNoOriginal transformer
LearnedEmbedding lookup table indexed by positionNo, fixed max lengthYesBERT, GPT-2
RoPERotates query and key vectors by position-dependent angleYes, with decayNoLLaMA, PaLM
ALiBiAdds linear bias to attention scores based on token distanceYesNoBLOOM, MPT

The transformer block

Each transformer encoder block chains four components together with residual connections:

  1. Multi-head attention: every token attends to every other token.
  2. Add and norm: a residual connection adds the input back to the attention output, followed by layer normalization.
  3. Feed-forward network (FFN): two linear layers with a ReLU or GELU activation between them, applied to each position independently.
  4. Add and norm: another residual connection and layer norm after the FFN.

Dropout is applied after each sub-layer (attention and FFN) during training.

The residual connections are critical for training deep stacks. They let gradients flow directly through the network via the chain rule, preventing the vanishing gradient problem. The feed-forward network gives the model per-token transformation capacity beyond what attention provides.

A typical transformer stacks many of these blocks. BERT-base uses 12, GPT-2 uses 48, and GPT-3 uses 96.

graph TD
  A["Input Embedding + Positional Encoding"] --> B["Multi-Head Attention"]
  B --> C["Add (residual)"]
  A --> C
  C --> D["Layer Norm"]
  D --> E["Feed-Forward Network"]
  E --> F["Add (residual)"]
  D --> F
  F --> G["Layer Norm"]
  G --> H["Block Output"]

Three transformer architectures

The original transformer has both an encoder and a decoder. But researchers found that using only one half works better for certain tasks. Three variants have become standard.

graph TD
  subgraph "Encoder-Only: BERT"
      direction TB
      B1["Input Tokens"] --> B2["Bidirectional Encoder Blocks"]
      B2 --> B3["Contextual Representations"]
  end
  subgraph "Decoder-Only: GPT"
      direction TB
      G1["Input Tokens"] --> G2["Masked Decoder Blocks"]
      G2 --> G3["Next Token Prediction"]
  end
  subgraph "Encoder-Decoder: T5"
      direction TB
      T1["Source Tokens"] --> T2["Encoder Blocks"]
      T2 --> T3["Cross-Attention"]
      T4["Target Tokens"] --> T5["Masked Decoder Blocks"]
      T5 --> T3
      T3 --> T6["Output Tokens"]
  end

Encoder-only (BERT)

BERT stacks encoder blocks only. Every token can attend to every other token in both directions. This bidirectional context is ideal for tasks that need full understanding of the input: text classification, named entity recognition, and extractive question answering. BERT is pre-trained by masking random tokens and predicting them from context, a task called masked language modeling. It uses cross-entropy loss between the predicted token distribution and the true token.

Decoder-only (GPT)

GPT stacks decoder blocks with causal (masked) self-attention. Each token can only attend to tokens at earlier positions. This left-to-right constraint makes the architecture a natural fit for text generation: given a sequence of tokens, predict the next one. GPT-2, GPT-3, GPT-4, and LLaMA all use this pattern. It has become the dominant architecture for large language models.

Encoder-decoder (T5)

T5 and the original transformer use both stacks. The encoder reads the full input with bidirectional attention. The decoder generates output using masked self-attention (so it cannot peek at future output tokens) plus cross-attention into the encoder representations. This setup works well for sequence-to-sequence tasks: translation, summarization, and structured generation.

Self-attention vs cross-attention

graph TD
  subgraph Self-Attention
      direction LR
      SA_IN["Same sequence provides
Q, K, and V"] --> SA_OUT["Each token attends
to all other tokens
in its own sequence"]
  end
  subgraph Cross-Attention
      direction LR
      DEC["Decoder provides Q"] --> CA_OUT["Decoder tokens attend
to encoder tokens"]
      ENC["Encoder provides K and V"] --> CA_OUT
  end

Self-attention lets tokens within the same sequence relate to each other. Cross-attention lets the decoder query the encoder, connecting input and output sequences.

Attention variants

Different scoring functions and structural changes have produced several attention variants over the years:

NameComplexityUse caseYear
Bahdanau (additive)O(n2d)O(n^2 \cdot d)Seq2seq translation2014
Luong (dot-product)O(n2d)O(n^2 \cdot d)Seq2seq with simpler scoring2015
Scaled dot-productO(n2d)O(n^2 \cdot d)Transformer core2017
Multi-headO(n2d)O(n^2 \cdot d)Parallel attention heads2017
Sparse attentionO(nn)O(n\sqrt{n})Long documents2019
Linear attentionO(nd2)O(n \cdot d^2)Efficient transformers2020

The O(n2)O(n^2) cost of standard attention is the main bottleneck for long sequences. Sparse and linear variants trade some representational power for better scaling. For most tasks with moderate sequence lengths (under a few thousand tokens), scaled dot-product with multiple heads remains the default.

What comes next

We covered how attention removes the fixed-length bottleneck and how transformers stack attention with feed-forward layers to build deep sequence models. The three architecture variants, encoder-only, decoder-only, and encoder-decoder, serve different tasks but share the same building blocks.

The next article covers word embeddings: the learned vector representations that transformers operate on.

Start typing to search across all content
navigate Enter open Esc close