Jan 24, 2026 · 22 min read · Deep Learning

Graph neural networks

In this series (25 parts)

Not all data lives on a grid. Social networks, molecules, citation graphs, road networks, protein structures: these are all graphs. Nodes have features, edges encode relationships, and the structure itself carries information. Graph neural networks (GNNs) extend deep learning to handle this irregular, non-Euclidean data.

Prerequisites

You should understand attention mechanisms (especially for GAT) and matrix operations. Familiarity with CNNs helps because GNNs generalize the same local-aggregation idea from grids to graphs.

Graphs: the basics

A graph $G = (V, E)$ has nodes $V$ and edges $E$ . For deep learning, we add:

Node features $X \in \mathbb{R}^{n \times d}$ : each node $v$ has a feature vector $x_v \in \mathbb{R}^d$
Adjacency matrix $A \in \{0, 1\}^{n \times n}$ : $A_{ij} = 1$ if there’s an edge from $i$ to $j$
Edge features (optional): attributes on edges, like bond type in a molecule

Why graphs need special treatment: standard neural networks expect fixed-size, ordered inputs. A fully connected layer on $n$ nodes would need $n^2$ parameters and wouldn’t generalize to graphs of different sizes. GNNs solve this by operating locally: each node looks at its neighbors, regardless of the total graph size.

The message passing framework

Almost all GNNs follow the same pattern: message passing. In each layer, every node:

Collects messages from its neighbors
Aggregates them (sum, mean, max, or attention-weighted)
Updates its own representation using the aggregated message

graph TD
  subgraph "Step 1: Collect"
      N1["Neighbor u₁"] -->|"message m₁"| V["Node v"]
      N2["Neighbor u₂"] -->|"message m₂"| V
      N3["Neighbor u₃"] -->|"message m₃"| V
  end
  subgraph "Step 2: Aggregate"
      V --> AGG["AGG(m₁, m₂, m₃)
e.g., mean or sum"]
  end
  subgraph "Step 3: Update"
      AGG --> UPD["UPDATE(h_v, agg)
new node state"]
  end

Formally, one message passing layer computes:

h_v^{(l+1)} = \text{UPDATE}\left(h_v^{(l)},\; \text{AGG}\left(\left\{ \text{MSG}(h_v^{(l)}, h_u^{(l)}) : u \in \mathcal{N}(v) \right\}\right)\right)

where $h_v^{(l)}$ is node $v$ ‘s representation at layer $l$ , $\mathcal{N}(v)$ is the set of neighbors of $v$ , and MSG, AGG, UPDATE are learnable functions.

Stacking $k$ message passing layers lets each node “see” information from nodes up to $k$ hops away. This is analogous to how stacking CNN layers increases the receptive field.

CNN as message passing on a grid

Here’s an insight that connects GNNs to what you already know. A 2D convolution on an image is message passing on a grid graph. Each pixel is a node. Its neighbors are the adjacent pixels (defined by the kernel size). The convolution kernel defines the MSG function. Summation is the AGG function. The result is the UPDATE.

The difference: on a grid, every node has the same number of neighbors in the same arrangement. On a general graph, nodes can have any number of neighbors in any arrangement. GNNs handle this by using permutation-invariant aggregation functions (sum, mean, max) that don’t depend on neighbor ordering.

GCN: graph convolutional network

Kipf and Welling (2017) proposed one of the simplest and most influential GNN architectures. The GCN layer computes:

H^{(l+1)} = \sigma\left(\hat{D}^{-1/2} \hat{A} \hat{D}^{-1/2} H^{(l)} W^{(l)}\right)

where:

$\hat{A} = A + I$ is the adjacency matrix with self-loops added
$\hat{D}$ is the degree matrix of $\hat{A}$ (diagonal, $\hat{D}_{ii} = \sum_j \hat{A}_{ij}$ )
$W^{(l)}$ is the learnable weight matrix for layer $l$
$\sigma$ is an activation function (typically ReLU)

The normalization $\hat{D}^{-1/2} \hat{A} \hat{D}^{-1/2}$ is called symmetric normalization. It prevents the scale of node features from growing with the number of neighbors. Without normalization, high-degree nodes would dominate.

Example 1: GCN forward pass

Consider a 4-node graph:

Edges: (0,1), (0,2), (1,2), (2,3). Undirected.

Adjacency matrix $A$ :

A = \begin{bmatrix} 0 & 1 & 1 & 0 \\ 1 & 0 & 1 & 0 \\ 1 & 1 & 0 & 1 \\ 0 & 0 & 1 & 0 \end{bmatrix}

Add self-loops: $\hat{A} = A + I$ :

\hat{A} = \begin{bmatrix} 1 & 1 & 1 & 0 \\ 1 & 1 & 1 & 0 \\ 1 & 1 & 1 & 1 \\ 0 & 0 & 1 & 1 \end{bmatrix}

Degree matrix $\hat{D}$ : row sums are [3, 3, 4, 2].

\hat{D}^{-1/2} = \text{diag}\left(\frac{1}{\sqrt{3}}, \frac{1}{\sqrt{3}}, \frac{1}{\sqrt{4}}, \frac{1}{\sqrt{2}}\right) = \text{diag}(0.577, 0.577, 0.500, 0.707)

Normalized adjacency $\tilde{A} = \hat{D}^{-1/2} \hat{A} \hat{D}^{-1/2}$ :

For entry $(i, j)$ : $\tilde{A}_{ij} = \frac{\hat{A}_{ij}}{\sqrt{\hat{D}_{ii}} \cdot \sqrt{\hat{D}_{jj}}}$

\tilde{A} = \begin{bmatrix} 0.333 & 0.333 & 0.289 & 0 \\ 0.333 & 0.333 & 0.289 & 0 \\ 0.289 & 0.289 & 0.250 & 0.354 \\ 0 & 0 & 0.354 & 0.500 \end{bmatrix}

Node features $X$ (4 nodes, 2 features each):

X = \begin{bmatrix} 1.0 & 0.0 \\ 0.0 & 1.0 \\ 1.0 & 1.0 \\ 0.5 & 0.5 \end{bmatrix}

Weight matrix $W$ (2x2):

W = \begin{bmatrix} 1.0 & 0.5 \\ 0.5 & 1.0 \end{bmatrix}

Step 1: Compute $\tilde{A} X$ . For node 0:

(\tilde{A}X)_0 = 0.333 \times [1,0] + 0.333 \times [0,1] + 0.289 \times [1,1] + 0 \times [0.5,0.5]

= [0.333, 0] + [0, 0.333] + [0.289, 0.289] = [0.622, 0.622]

Step 2: Multiply by $W$ :

(\tilde{A}XW)_0 = [0.622, 0.622] \times \begin{bmatrix} 1.0 & 0.5 \\ 0.5 & 1.0 \end{bmatrix} = [0.622 + 0.311, 0.311 + 0.622] = [0.933, 0.933]

Step 3: Apply ReLU (all positive, so no change): $h_0^{(1)} = [0.933, 0.933]$ .

The output for node 0 is a smooth mix of its own features and its neighbors’ features, weighted by the normalized adjacency.

GraphSAGE: sampling for scalability

GCN requires the full adjacency matrix, which is impractical for graphs with millions of nodes. GraphSAGE (Hamilton et al., 2017) solves this by sampling a fixed number of neighbors per node and aggregating only from the sample.

For each node, GraphSAGE:

Samples $k$ neighbors (e.g., $k = 10$ )
Aggregates their features (mean, LSTM, or max pooling)
Concatenates the aggregated features with the node’s own features
Applies a linear transformation and activation

This makes the computational cost per node constant, regardless of the actual degree. Training uses mini-batches of nodes rather than the full graph.

GAT: attention over neighbors

Graph Attention Networks (Velickovic et al., 2018) apply the attention mechanism to graph neighborhoods. Instead of treating all neighbors equally (GCN) or sampling randomly (GraphSAGE), GAT learns to weight neighbors by importance.

For each edge $(v, u)$ , GAT computes an attention coefficient:

e_{vu} = \text{LeakyReLU}\left(a^T [W h_v \| W h_u]\right)

where $\|$ denotes concatenation, $W$ is a shared linear transformation, and $a$ is a learnable attention vector. These raw scores are then normalized across all neighbors of $v$ :

\alpha_{vu} = \frac{\exp(e_{vu})}{\sum_{k \in \mathcal{N}(v)} \exp(e_{vk})}

The updated node representation:

h_v' = \sigma\left(\sum_{u \in \mathcal{N}(v)} \alpha_{vu} \, W h_u\right)

Multi-head attention (just like in Transformers) is used to stabilize learning: run $K$ independent attention heads and concatenate (or average) their outputs.

graph TD
  subgraph "GCN"
      GCN_A["All neighbors
weighted equally
by degree normalization"]
  end
  subgraph "GraphSAGE"
      SAGE_A["Sample fixed k
neighbors, aggregate
with mean/max/LSTM"]
  end
  subgraph "GAT"
      GAT_A["All neighbors
weighted by learned
attention coefficients"]
  end

Example 2: Message passing step

Node $v$ with features $h_v = [1, 2]$ has three neighbors:

$u_1$ : $h_{u_1} = [0.5, 1.0]$
$u_2$ : $h_{u_2} = [1.5, 0.5]$
$u_3$ : $h_{u_3} = [0.0, 2.0]$

Step 1: Mean aggregation of neighbor features:

\text{AGG} = \frac{1}{3}([0.5, 1.0] + [1.5, 0.5] + [0.0, 2.0]) = \frac{[2.0, 3.5]}{3} = [0.667, 1.167]

Step 2: Concatenate with node’s own features:

[\text{h}_v \| \text{AGG}] = [1, 2, 0.667, 1.167]

Step 3: Linear transformation with $W \in \mathbb{R}^{4 \times 2}$ :

W = \begin{bmatrix} 0.5 & 0.3 \\ -0.2 & 0.8 \\ 0.7 & -0.1 \\ 0.1 & 0.6 \end{bmatrix}

z = W^T [1, 2, 0.667, 1.167]

z_1 = 0.5(1) + (-0.2)(2) + 0.7(0.667) + 0.1(1.167) = 0.5 - 0.4 + 0.467 + 0.117 = 0.684

z_2 = 0.3(1) + 0.8(2) + (-0.1)(0.667) + 0.6(1.167) = 0.3 + 1.6 - 0.067 + 0.700 = 2.533

Step 4: ReLU activation:

h_v' = \text{ReLU}([0.684, 2.533]) = [0.684, 2.533]

Node $v$ ‘s new representation now encodes information from both itself and its neighbors.

Example 3: GAT attention computation

Node $v$ with $h_v = [1, 1]$ and neighbor $u$ with $h_u = [2, 0]$ .

Attention vector: $a = [0.2, -0.1, 0.3, 0.1]$ (dimension 4 because we concatenate two 2D vectors).

Concatenation: $[h_v \| h_u] = [1, 1, 2, 0]$ .

Raw attention score:

e_{vu} = \text{LeakyReLU}(a^T [h_v \| h_u]) = \text{LeakyReLU}(0.2(1) + (-0.1)(1) + 0.3(2) + 0.1(0))

= \text{LeakyReLU}(0.2 - 0.1 + 0.6 + 0) = \text{LeakyReLU}(0.7) = 0.7

(Since 0.7 > 0, LeakyReLU has no effect.)

Now suppose we also have neighbor $w$ with $e_{vw} = 0.2$ . Softmax normalization over neighbors:

Given $e_{vu} = 0.4$ and $e_{vw} = 0.2$ (using the values specified in the problem):

\alpha_{vu} = \frac{\exp(0.4)}{\exp(0.4) + \exp(0.2)} = \frac{1.492}{1.492 + 1.221} = \frac{1.492}{2.713} = 0.550

\alpha_{vw} = \frac{\exp(0.2)}{\exp(0.4) + \exp(0.2)} = \frac{1.221}{2.713} = 0.450

Neighbor $u$ gets 55% of the attention weight and neighbor $w$ gets 45%. The attention mechanism learned that $u$ is slightly more relevant to $v$ than $w$ is. The final aggregation would be:

h_v' = \sigma(0.550 \cdot W h_u + 0.450 \cdot W h_w)

Readout: from node representations to predictions

Different tasks need different outputs:

Node classification: predict a label for each node (e.g., classify users in a social network). Use the final node representations directly with a classifier on top.

Link prediction: predict whether an edge exists between two nodes. Compute a score from pairs of node representations, e.g., dot product: $\text{score}(u, v) = h_u^T h_v$ .

Graph classification: predict a label for the entire graph (e.g., predict whether a molecule is toxic). You need a readout function that aggregates all node representations into a single graph-level vector:

h_G = \text{READOUT}(\{h_v : v \in V\})

Common readout functions: mean pooling, sum pooling, or hierarchical pooling that coarsens the graph in stages.

GNN variants comparison

Name	Aggregation	Attention	Scalable	Edge features	Best for
GCN	Normalized mean	✗	Moderate	✗	Semi-supervised node classification
GraphSAGE	Sampled mean/max/LSTM	✗	✓	✗	Large-scale inductive tasks
GAT	Attention-weighted sum	✓	Moderate	With modification	Tasks needing neighbor importance
GIN	Sum (injective)	✗	Moderate	✗	Graph classification (most expressive)
MPNN	Flexible (learned)	Optional	Varies	✓	Molecular property prediction
SchNet	Continuous filter	✗	Moderate	✓ Distance	3D molecular geometry

Applications

GNNs have found use across many domains:

Chemistry: predicting molecular properties, drug discovery, reaction prediction
Social networks: recommendation systems, community detection, influence modeling
Computer vision: scene graphs, point cloud processing, human pose estimation
NLP: knowledge graph reasoning, semantic parsing, dependency parsing
Physics: particle simulations, fluid dynamics, material science
Infrastructure: traffic prediction, power grid analysis

Node classification accuracy vs GNN depth

Known limitations

Over-smoothing: as you stack more layers, node representations converge to the same vector. After too many message passing rounds, every node looks the same. This limits GNNs to relatively shallow architectures (2-4 layers typically).
Over-squashing: information from distant nodes gets compressed into fixed-size vectors, losing information. Think of it as a bottleneck in long-range message passing.
Expressiveness: the Weisfeiler-Lehman (WL) test sets an upper bound on what message passing GNNs can distinguish. Standard GNNs cannot distinguish certain non-isomorphic graphs that the 1-WL test also fails on.

What comes next

With architectures for grids (CNNs), sequences (RNNs, Transformers), and graphs (GNNs) under your belt, you have the building blocks for nearly any deep learning system. The final piece is knowing how to make it all work in practice. Practical deep learning: debugging and tuning covers the skills that turn theoretical knowledge into working models.

← Back to all series