Jan 19, 2026 · 20 min read · Deep Learning

Network compression and efficient inference

In this series (25 parts)

A ResNet-50 has 25 million parameters and needs 4 billion FLOPs for one forward pass. That’s fine on a server with a GPU. It’s not fine on a phone, a drone, or an IoT sensor. Network compression makes large models small and fast enough to run where they actually need to run: at the edge, under latency constraints, with limited memory and power.

Prerequisites

You should be comfortable with convolutional neural networks, transfer learning, and matrix decompositions (especially SVD). Understanding how training works will help with the fine-tuning steps that most compression methods require.

Why compression matters

Three practical reasons drive the need for smaller models:

Latency: a self-driving car can’t wait 200ms for a prediction. Real-time applications need inference in single-digit milliseconds.
Memory: mobile devices have limited RAM. A 400MB model won’t fit alongside the rest of an app.
Energy: every FLOP costs energy. On battery-powered devices, fewer FLOPs means longer battery life. In data centers, less compute means lower electricity bills.

The good news: most neural networks are heavily overparameterized. They contain far more capacity than they need for the task. Compression exploits this redundancy.

Pruning: removing unnecessary weights

Pruning removes weights (or entire neurons/filters) that contribute little to the output. The simplest approach: remove weights with the smallest magnitude.

Unstructured vs structured pruning

Unstructured pruning zeroes out individual weights anywhere in the network. You get a sparse weight matrix. The compression ratio can be very high (90%+ of weights removed), but sparse matrix operations are not well supported on most hardware. You need special sparse libraries or hardware to see speedups.

Structured pruning removes entire filters, channels, or layers. The resulting network is a regular, smaller dense network that runs faster on standard hardware without special support. The compression ratio is usually lower, but the speedup is real and immediate.

graph LR
  A[Train full
model] --> B[Rank weights
by magnitude]
  B --> C[Remove smallest
weights/filters]
  C --> D[Fine-tune to
recover accuracy]
  D --> E{Accuracy
acceptable?}
  E -->|No| B
  E -->|Yes| F[Deploy compressed
model]

Example 1: Magnitude pruning

Weight matrix before pruning:

W = \begin{bmatrix} 0.80 & -0.10 & 0.50 \\ 0.02 & -0.70 & 0.30 \\ 0.15 & 0.60 & -0.04 \end{bmatrix}

Threshold: prune all weights with $|w| < 0.2$ .

Check each weight:

$|0.80| = 0.80 \geq 0.2$ ✓ keep
$|-0.10| = 0.10 < 0.2$ ✗ prune
$|0.50| = 0.50 \geq 0.2$ ✓ keep
$|0.02| = 0.02 < 0.2$ ✗ prune
$|-0.70| = 0.70 \geq 0.2$ ✓ keep
$|0.30| = 0.30 \geq 0.2$ ✓ keep
$|0.15| = 0.15 < 0.2$ ✗ prune
$|0.60| = 0.60 \geq 0.2$ ✓ keep
$|-0.04| = 0.04 < 0.2$ ✗ prune

Sparse matrix after pruning:

W_{\text{pruned}} = \begin{bmatrix} 0.80 & 0 & 0.50 \\ 0 & -0.70 & 0.30 \\ 0 & 0.60 & 0 \end{bmatrix}

We removed 4 out of 9 weights. Compression ratio: $9/5 = 1.8\times$ . If we stored only non-zero values plus indices, we’d need $5 \times (\text{value} + \text{index})$ instead of $9 \times \text{value}$ .

In practice, you’d fine-tune the remaining weights for a few epochs to recover accuracy lost from pruning.

The lottery ticket hypothesis

Frankle and Carlin (2019) proposed a striking idea: within a randomly initialized dense network, there exists a sparse subnetwork (the “winning ticket”) that, when trained in isolation from the same initialization, reaches the same accuracy as the full network.

The practical implication: you can find small networks that work just as well as large ones. The catch is that finding the winning ticket currently requires training the full network first, then pruning, then rewinding to the original initialization and retraining. This is expensive, but it tells us something deep about overparameterization.

Quantization: fewer bits per weight

Standard neural networks use 32-bit floating point (float32) for weights and activations. Quantization reduces this to 16-bit, 8-bit, or even lower. Fewer bits means less memory, faster computation, and lower energy.

Post-training quantization (PTQ): take a trained float32 model and convert weights to int8. No retraining needed. Simple but can lose accuracy, especially at very low bit widths.

The mapping from float32 to int8:

q = \text{round}\left(\frac{x - x_{\min}}{x_{\max} - x_{\min}} \times 255\right)

\hat{x} = \frac{q}{255} \times (x_{\max} - x_{\min}) + x_{\min}

Quantization-aware training (QAT): simulate quantization during training. Forward passes use quantized values; backward passes use the straight-through estimator (gradients flow through the rounding operation as if it were the identity). This gives the network a chance to adapt to the quantization noise.

Mixed-precision: use lower precision (float16 or int8) where it doesn’t hurt and full precision where it does. Typically, the first and last layers are kept at higher precision because they handle raw inputs and final logits.

Key numbers to remember:

float32 to float16: 2x memory reduction, minimal accuracy loss
float32 to int8: 4x memory reduction, usually < 1% accuracy loss with QAT
float32 to int4: 8x memory reduction, requires careful calibration

Knowledge distillation: teacher-student learning

Knowledge distillation trains a small “student” network to mimic a large “teacher” network. The key insight from Hinton et al. (2015): the teacher’s soft probability outputs contain more information than hard labels.

When a teacher classifies an image of a cat, its softmax output might be [0.7, 0.2, 0.1] for [cat, dog, car]. The hard label is just “cat.” But the soft output tells you that this image looks somewhat like a dog and not at all like a car. This “dark knowledge” helps the student learn better representations.

graph TD
  Input["Input x"] --> Teacher["Teacher
(large model)"]
  Input --> Student["Student
(small model)"]
  Teacher --> SoftT["Soft targets
(temperature T)"]
  Student --> SoftS["Soft predictions
(temperature T)"]
  SoftT --> KL["KL divergence
loss"]
  SoftS --> KL
  Student --> HardS["Hard predictions"]
  Labels["True labels"] --> CE["Cross-entropy
loss"]
  HardS --> CE
  KL --> Total["Total loss =
α·KL + (1-α)·CE"]
  CE --> Total

The temperature parameter $T$ controls how soft the distributions are. At $T = 1$ , you get the standard softmax. At higher $T$ , the distribution becomes smoother, revealing more information about relative similarities:

p_i = \frac{\exp(z_i / T)}{\sum_j \exp(z_j / T)}

Example 2: Knowledge distillation with temperature

Teacher logits: $z^T = [2.8, 0.8, 0.4]$

At $T = 1$ (standard softmax):

p^T = \frac{[\exp(2.8), \exp(0.8), \exp(0.4)]}{\exp(2.8) + \exp(0.8) + \exp(0.4)} = \frac{[16.44, 2.23, 1.49]}{20.16} = [0.816, 0.111, 0.074]

At $T = 4$ (soft targets):

z^T / 4 = [0.70, 0.20, 0.10]

p^T_{T=4} = \frac{[\exp(0.70), \exp(0.20), \exp(0.10)]}{\exp(0.70) + \exp(0.20) + \exp(0.10)} = \frac{[2.014, 1.221, 1.105]}{4.340} = [0.464, 0.281, 0.255]

Student logits: $z^S = [2.1, 0.8, 0.3]$

At $T = 4$ :

z^S / 4 = [0.525, 0.200, 0.075]

p^S_{T=4} = \frac{[\exp(0.525), \exp(0.200), \exp(0.075)]}{\text{sum}} = \frac{[1.691, 1.221, 1.078]}{3.990} = [0.424, 0.306, 0.270]

KL divergence from student to teacher (at $T = 4$ ):

D_{KL}(p^T \| p^S) = \sum_i p^T_i \log\frac{p^T_i}{p^S_i}

= 0.464 \log\frac{0.464}{0.424} + 0.281 \log\frac{0.281}{0.306} + 0.255 \log\frac{0.255}{0.270}

= 0.464 \times 0.090 + 0.281 \times (-0.085) + 0.255 \times (-0.057)

= 0.0418 - 0.0239 - 0.0145 = 0.0034

The KL divergence is small (0.0034), meaning the student’s soft predictions are close to the teacher’s. Notice how $T = 4$ spreads the probability mass, making the teacher’s “dark knowledge” visible. The student can learn that class 2 (0.281) is more similar to the input than class 3 (0.255).

Low-rank factorization

A weight matrix $W \in \mathbb{R}^{m \times n}$ can be approximated by two smaller matrices using SVD:

W \approx U_r \Sigma_r V_r^T

where $U_r \in \mathbb{R}^{m \times r}$ , $\Sigma_r \in \mathbb{R}^{r \times r}$ , $V_r \in \mathbb{R}^{n \times r}$ , and $r \ll \min(m, n)$ .

This replaces one layer with two smaller layers. The original layer computes $Wx$ at cost $mn$ . The factorized version computes $U_r(\Sigma_r(V_r^T x))$ at cost $nr + r + mr = (m + n)r + r$ .

Example 3: Low-rank approximation savings

Consider a $4 \times 4$ weight matrix with singular values $[5.0, 3.0, 0.1, 0.05]$ .

The first two singular values (5.0 and 3.0) capture most of the energy. The last two (0.1 and 0.05) are tiny. Keeping rank $r = 2$ :

Original parameters: $4 \times 4 = 16$
Factorized: $U_r$ is $4 \times 2$ , $\Sigma_r$ is $2 \times 2$ (diagonal, so 2 values), $V_r$ is $4 \times 2$ . Total: $8 + 2 + 8 = 18$ .

For this tiny matrix, factorization actually uses more parameters. The savings come with larger matrices.

Scaling to a realistic layer: $W$ is $100 \times 100$ , rank $r = 10$ .

Original: $100 \times 100 = 10{,}000$ parameters
Factorized: $100 \times 10 + 10 + 10 \times 100 = 1{,}000 + 10 + 1{,}000 = 2{,}010$ parameters
Compression ratio: $10{,}000 / 2{,}010 \approx 4.98\times$
Energy captured: $(5.0^2 + 3.0^2) / (5.0^2 + 3.0^2 + 0.1^2 + 0.05^2) = 34.0 / 34.0125 = 99.96\%$

So we keep 99.96% of the information with 5x fewer parameters. In practice, you’d merge $\Sigma_r$ into $U_r$ or $V_r$ to avoid the extra diagonal matrix.

Mobile architectures: depthwise separable convolutions

Instead of compressing an existing model, you can design efficient architectures from scratch. MobileNet uses depthwise separable convolutions, which factor a standard convolution into two steps:

Depthwise convolution: apply one filter per input channel (no cross-channel mixing)
Pointwise convolution: 1x1 convolution to mix channels

A standard convolution with kernel $k \times k$ , $C_{in}$ input channels, and $C_{out}$ output channels costs:

k^2 \cdot C_{in} \cdot C_{out} \cdot H \cdot W \text{ FLOPs}

Depthwise separable convolution costs:

k^2 \cdot C_{in} \cdot H \cdot W + C_{in} \cdot C_{out} \cdot H \cdot W

The ratio:

\frac{k^2 \cdot C_{in} + C_{in} \cdot C_{out}}{k^2 \cdot C_{in} \cdot C_{out}} = \frac{1}{C_{out}} + \frac{1}{k^2}

For $k = 3$ and $C_{out} = 256$ : reduction factor $\approx 1/256 + 1/9 \approx 0.115$ . That’s roughly 8-9x fewer FLOPs.

Model size vs accuracy after compression

Compression methods comparison

Method	Compression ratio	Accuracy drop	Hardware friendly	Training needed
Unstructured pruning	5-20x	0.5-2%	✗ Needs sparse support	Fine-tuning
Structured pruning	2-5x	1-3%	✓ Standard dense ops	Fine-tuning
PTQ (int8)	4x	0.5-1%	✓ Widely supported	None
QAT (int8)	4x	< 0.5%	✓ Widely supported	Full retraining
Knowledge distillation	3-10x (model dependent)	1-3%	✓ Student is standard	Full training
Low-rank factorization	2-5x	1-2%	✓ Standard dense ops	Fine-tuning
Depthwise separable	8-9x FLOPs	Architecture dependent	✓ Optimized on mobile	Full training

Combining methods

These methods are not mutually exclusive. A common pipeline:

Start with a large, accurate teacher model
Use NAS or manual design for a small student architecture (perhaps with depthwise separable convolutions)
Train the student with knowledge distillation
Apply quantization-aware training
Optionally prune and fine-tune

Each step gives an independent compression factor. A 4x from distillation, 4x from quantization, and 2x from pruning gives 32x total compression.

What comes next

With efficient models ready for deployment, we can tackle a different class of problems: data that lives on graphs rather than grids. Graph neural networks extend the ideas from CNNs and attention mechanisms to irregular, non-Euclidean structures like social networks, molecules, and knowledge graphs.

← Back to all series