Network compression and efficient inference
In this series (25 parts)
- Neural networks: the basic building block
- Forward pass and backpropagation
- Training neural networks: a practical guide
- Convolutional neural networks
- Recurrent neural networks and LSTMs
- Attention mechanism and transformers
- Word embeddings: from one-hot to dense representations
- Transfer learning and fine-tuning
- Optimization techniques for deep networks
- Regularization for deep networks
- Encoder-decoder architectures
- Generative models: an overview
- Restricted Boltzmann Machines
- Deep Belief Networks
- Variational Autoencoders
- Generative Adversarial Networks: training and theory
- DCGAN, conditional GANs, and GAN variants
- Representation learning and self-supervised learning
- Domain adaptation and fine-tuning strategies
- Distributed representations and latent spaces
- AutoML and hyperparameter optimization
- Neural architecture search
- Network compression and efficient inference
- Graph neural networks
- Practical deep learning: debugging and tuning
A ResNet-50 has 25 million parameters and needs 4 billion FLOPs for one forward pass. That’s fine on a server with a GPU. It’s not fine on a phone, a drone, or an IoT sensor. Network compression makes large models small and fast enough to run where they actually need to run: at the edge, under latency constraints, with limited memory and power.
Prerequisites
You should be comfortable with convolutional neural networks, transfer learning, and matrix decompositions (especially SVD). Understanding how training works will help with the fine-tuning steps that most compression methods require.
Why compression matters
Three practical reasons drive the need for smaller models:
- Latency: a self-driving car can’t wait 200ms for a prediction. Real-time applications need inference in single-digit milliseconds.
- Memory: mobile devices have limited RAM. A 400MB model won’t fit alongside the rest of an app.
- Energy: every FLOP costs energy. On battery-powered devices, fewer FLOPs means longer battery life. In data centers, less compute means lower electricity bills.
The good news: most neural networks are heavily overparameterized. They contain far more capacity than they need for the task. Compression exploits this redundancy.
Pruning: removing unnecessary weights
Pruning removes weights (or entire neurons/filters) that contribute little to the output. The simplest approach: remove weights with the smallest magnitude.
Unstructured vs structured pruning
Unstructured pruning zeroes out individual weights anywhere in the network. You get a sparse weight matrix. The compression ratio can be very high (90%+ of weights removed), but sparse matrix operations are not well supported on most hardware. You need special sparse libraries or hardware to see speedups.
Structured pruning removes entire filters, channels, or layers. The resulting network is a regular, smaller dense network that runs faster on standard hardware without special support. The compression ratio is usually lower, but the speedup is real and immediate.
graph LR
A[Train full
model] --> B[Rank weights
by magnitude]
B --> C[Remove smallest
weights/filters]
C --> D[Fine-tune to
recover accuracy]
D --> E{Accuracy
acceptable?}
E -->|No| B
E -->|Yes| F[Deploy compressed
model]
Example 1: Magnitude pruning
Weight matrix before pruning:
Threshold: prune all weights with .
Check each weight:
- ✓ keep
- ✗ prune
- ✓ keep
- ✗ prune
- ✓ keep
- ✓ keep
- ✗ prune
- ✓ keep
- ✗ prune
Sparse matrix after pruning:
We removed 4 out of 9 weights. Compression ratio: . If we stored only non-zero values plus indices, we’d need instead of .
In practice, you’d fine-tune the remaining weights for a few epochs to recover accuracy lost from pruning.
The lottery ticket hypothesis
Frankle and Carlin (2019) proposed a striking idea: within a randomly initialized dense network, there exists a sparse subnetwork (the “winning ticket”) that, when trained in isolation from the same initialization, reaches the same accuracy as the full network.
The practical implication: you can find small networks that work just as well as large ones. The catch is that finding the winning ticket currently requires training the full network first, then pruning, then rewinding to the original initialization and retraining. This is expensive, but it tells us something deep about overparameterization.
Quantization: fewer bits per weight
Standard neural networks use 32-bit floating point (float32) for weights and activations. Quantization reduces this to 16-bit, 8-bit, or even lower. Fewer bits means less memory, faster computation, and lower energy.
Post-training quantization (PTQ): take a trained float32 model and convert weights to int8. No retraining needed. Simple but can lose accuracy, especially at very low bit widths.
The mapping from float32 to int8:
Quantization-aware training (QAT): simulate quantization during training. Forward passes use quantized values; backward passes use the straight-through estimator (gradients flow through the rounding operation as if it were the identity). This gives the network a chance to adapt to the quantization noise.
Mixed-precision: use lower precision (float16 or int8) where it doesn’t hurt and full precision where it does. Typically, the first and last layers are kept at higher precision because they handle raw inputs and final logits.
Key numbers to remember:
- float32 to float16: 2x memory reduction, minimal accuracy loss
- float32 to int8: 4x memory reduction, usually < 1% accuracy loss with QAT
- float32 to int4: 8x memory reduction, requires careful calibration
Knowledge distillation: teacher-student learning
Knowledge distillation trains a small “student” network to mimic a large “teacher” network. The key insight from Hinton et al. (2015): the teacher’s soft probability outputs contain more information than hard labels.
When a teacher classifies an image of a cat, its softmax output might be [0.7, 0.2, 0.1] for [cat, dog, car]. The hard label is just “cat.” But the soft output tells you that this image looks somewhat like a dog and not at all like a car. This “dark knowledge” helps the student learn better representations.
graph TD Input["Input x"] --> Teacher["Teacher (large model)"] Input --> Student["Student (small model)"] Teacher --> SoftT["Soft targets (temperature T)"] Student --> SoftS["Soft predictions (temperature T)"] SoftT --> KL["KL divergence loss"] SoftS --> KL Student --> HardS["Hard predictions"] Labels["True labels"] --> CE["Cross-entropy loss"] HardS --> CE KL --> Total["Total loss = α·KL + (1-α)·CE"] CE --> Total
The temperature parameter controls how soft the distributions are. At , you get the standard softmax. At higher , the distribution becomes smoother, revealing more information about relative similarities:
Example 2: Knowledge distillation with temperature
Teacher logits:
At (standard softmax):
At (soft targets):
Student logits:
At :
KL divergence from student to teacher (at ):
The KL divergence is small (0.0034), meaning the student’s soft predictions are close to the teacher’s. Notice how spreads the probability mass, making the teacher’s “dark knowledge” visible. The student can learn that class 2 (0.281) is more similar to the input than class 3 (0.255).
Low-rank factorization
A weight matrix can be approximated by two smaller matrices using SVD:
where , , , and .
This replaces one layer with two smaller layers. The original layer computes at cost . The factorized version computes at cost .
Example 3: Low-rank approximation savings
Consider a weight matrix with singular values .
The first two singular values (5.0 and 3.0) capture most of the energy. The last two (0.1 and 0.05) are tiny. Keeping rank :
- Original parameters:
- Factorized: is , is (diagonal, so 2 values), is . Total: .
For this tiny matrix, factorization actually uses more parameters. The savings come with larger matrices.
Scaling to a realistic layer: is , rank .
- Original: parameters
- Factorized: parameters
- Compression ratio:
- Energy captured:
So we keep 99.96% of the information with 5x fewer parameters. In practice, you’d merge into or to avoid the extra diagonal matrix.
Mobile architectures: depthwise separable convolutions
Instead of compressing an existing model, you can design efficient architectures from scratch. MobileNet uses depthwise separable convolutions, which factor a standard convolution into two steps:
- Depthwise convolution: apply one filter per input channel (no cross-channel mixing)
- Pointwise convolution: 1x1 convolution to mix channels
A standard convolution with kernel , input channels, and output channels costs:
Depthwise separable convolution costs:
The ratio:
For and : reduction factor . That’s roughly 8-9x fewer FLOPs.
Model size vs accuracy after compression
Compression methods comparison
| Method | Compression ratio | Accuracy drop | Hardware friendly | Training needed |
|---|---|---|---|---|
| Unstructured pruning | 5-20x | 0.5-2% | ✗ Needs sparse support | Fine-tuning |
| Structured pruning | 2-5x | 1-3% | ✓ Standard dense ops | Fine-tuning |
| PTQ (int8) | 4x | 0.5-1% | ✓ Widely supported | None |
| QAT (int8) | 4x | < 0.5% | ✓ Widely supported | Full retraining |
| Knowledge distillation | 3-10x (model dependent) | 1-3% | ✓ Student is standard | Full training |
| Low-rank factorization | 2-5x | 1-2% | ✓ Standard dense ops | Fine-tuning |
| Depthwise separable | 8-9x FLOPs | Architecture dependent | ✓ Optimized on mobile | Full training |
Combining methods
These methods are not mutually exclusive. A common pipeline:
- Start with a large, accurate teacher model
- Use NAS or manual design for a small student architecture (perhaps with depthwise separable convolutions)
- Train the student with knowledge distillation
- Apply quantization-aware training
- Optionally prune and fine-tune
Each step gives an independent compression factor. A 4x from distillation, 4x from quantization, and 2x from pruning gives 32x total compression.
What comes next
With efficient models ready for deployment, we can tackle a different class of problems: data that lives on graphs rather than grids. Graph neural networks extend the ideas from CNNs and attention mechanisms to irregular, non-Euclidean structures like social networks, molecules, and knowledge graphs.