Neural architecture search
In this series (25 parts)
- Neural networks: the basic building block
- Forward pass and backpropagation
- Training neural networks: a practical guide
- Convolutional neural networks
- Recurrent neural networks and LSTMs
- Attention mechanism and transformers
- Word embeddings: from one-hot to dense representations
- Transfer learning and fine-tuning
- Optimization techniques for deep networks
- Regularization for deep networks
- Encoder-decoder architectures
- Generative models: an overview
- Restricted Boltzmann Machines
- Deep Belief Networks
- Variational Autoencoders
- Generative Adversarial Networks: training and theory
- DCGAN, conditional GANs, and GAN variants
- Representation learning and self-supervised learning
- Domain adaptation and fine-tuning strategies
- Distributed representations and latent spaces
- AutoML and hyperparameter optimization
- Neural architecture search
- Network compression and efficient inference
- Graph neural networks
- Practical deep learning: debugging and tuning
Designing neural network architectures is one of the most time-consuming parts of deep learning research. People spend months trying different layer configurations, connection patterns, and operation choices. Neural architecture search (NAS) automates this: you define a search space of possible architectures, and an algorithm finds a good one for you.
Prerequisites
You should understand hyperparameter optimization methods (especially Bayesian optimization and Hyperband) and convolutional neural networks. Familiarity with attention mechanisms helps for understanding some modern search spaces.
The search space
Before searching, you need to define what you are searching over. A NAS search space has three components:
Operations: what each layer can do. Common choices include convolution with different kernel sizes (3x3, 5x5, 7x7), dilated convolutions, depthwise separable convolutions, pooling (max, average), skip connections (identity), and zero (no connection).
Connections: how layers connect to each other. This includes which previous layers feed into each new layer. Skip connections, dense connections, and branching patterns all live here.
Macrostructure: how many cells to stack, channel widths at each stage, and downsampling locations. Often the macrostructure is fixed by hand and only the cell structure is searched.
Most modern NAS methods search for a small repeating unit called a cell, then stack copies of that cell to build the full network. This reduces the search space dramatically while still producing diverse architectures.
RL-based NAS: controller as RNN
The original NAS paper by Zoph and Le (2017) used reinforcement learning. A controller network (an RNN) generates architecture descriptions token by token. Each token specifies an operation or connection. The generated architecture is trained from scratch, and its validation accuracy becomes the reward signal for the controller.
graph LR A[Controller RNN samples architecture] --> B[Train child network] B --> C[Evaluate on validation set] C --> D[Use accuracy as reward] D --> E[Update controller with REINFORCE] E --> A
The controller learns a policy: a probability distribution over architectures. REINFORCE updates push probability toward architectures that achieved high validation accuracy.
The problem: this is incredibly expensive. The original NAS paper used 800 GPUs for 28 days. Each sampled architecture had to be trained from scratch. That’s roughly 22,400 GPU-days to find one architecture.
Evolutionary NAS
Instead of RL, you can use evolutionary algorithms. Start with a population of random architectures. Evaluate each one. Select the best (tournament selection), apply mutations (add a layer, change an operation, rewire a connection), and repeat.
Mutations in NAS might include:
- Replacing one operation with another (e.g., conv3x3 becomes conv5x5)
- Adding or removing a connection between nodes
- Changing the input to a node
Real and Aggarwal (2019) showed that evolutionary NAS can match or beat RL-based NAS with similar computational budgets. Evolution has the advantage of maintaining diversity in the population, which helps avoid getting stuck in local optima.
DARTS: continuous relaxation
Differentiable Architecture Search (Liu et al., 2019) made NAS dramatically cheaper by making the search differentiable. Instead of sampling discrete architectures, DARTS puts all candidate operations in parallel and learns continuous weights over them.
For each edge in the cell, instead of picking one operation, you compute a weighted sum of all operations:
where are the candidate operations and are architecture parameters. The softmax converts the raw values into weights that sum to 1.
graph TD Input["Input x"] --> Conv3["conv 3×3"] Input --> Conv5["conv 5×5"] Input --> Skip["skip connect"] Conv3 --> Mix["Weighted sum (softmax of α)"] Conv5 --> Mix Skip --> Mix Mix --> Output["Output"]
Training alternates between two sets of parameters:
- Update network weights by gradient descent on the training loss
- Update architecture parameters by gradient descent on the validation loss
After search, you discretize: for each edge, keep only the operation with the highest . Then retrain the discrete architecture from scratch.
DARTS reduces search cost from thousands of GPU-days to about 1-4 GPU-days. The tradeoff is that the continuous relaxation is an approximation: the best architecture in the relaxed space might not be the best after discretization.
Example 1: DARTS mixed operation
A cell edge has 3 candidate operations: conv3x3, conv5x5, and skip-connect.
Architecture parameters: .
Softmax weights:
Now suppose input produces:
- conv3x3 output: 0.8
- conv5x5 output: 0.7
- skip output: 1.0
Mixed output:
After search, conv3x3 has the highest weight (0.391), so we’d select it as the final operation for this edge.
One-shot NAS and weight sharing
One-shot NAS trains a single “supernet” that contains all possible architectures as subnetworks. During search, you evaluate architectures by extracting their subnetwork from the supernet, avoiding retraining from scratch.
The key idea is weight sharing: all architectures share the same weight parameters. When you evaluate architecture A, you use the same conv3x3 weights that architecture B would use. This makes evaluation nearly free (just a forward pass) but introduces a bias: weights trained for the full supernet may not be optimal for any individual subnetwork.
After finding a good architecture, you retrain it from scratch with fresh weights. The supernet weights are only used during the search phase.
Example 2: Search space size
Consider a cell with 4 intermediate nodes. Each node takes input from 2 of the previous nodes (including the cell inputs). For each chosen pair of inputs, there are 7 possible operations.
Node 1 can choose from 2 input nodes (the 2 cell inputs), picking 2 from 2: way. With 7 operations per input edge: choices.
Node 2 can choose 2 inputs from 3 available nodes: ways. Operations: . Total: .
Node 3: input pairs, operations each. Total: .
Node 4: input pairs, operations each. Total: .
Total architectures for one cell:
That’s over a billion possible cell architectures. And this is just for one cell type. If you search for both a normal cell and a reduction cell independently, the space squares to roughly . This is why exhaustive search is impossible and why smart search strategies matter.
Efficiency constraints: FLOPs and latency
Real deployment cares about more than accuracy. You need architectures that fit within a latency budget on target hardware. Modern NAS methods add hardware constraints directly into the search.
FLOPs-aware NAS: add a penalty term for computational cost. The loss becomes:
Latency-aware NAS: build a lookup table mapping each operation to its measured latency on target hardware. The total latency of an architecture is the sum of operation latencies along the critical path.
Multi-objective NAS: use Pareto optimization. Instead of a single best architecture, find the Pareto front of architectures that trade off accuracy against efficiency. No architecture on the front can improve one metric without worsening the other.
Results: NASNet and EfficientNet
Architecture search results: top-5 candidates by validation accuracy
NASNet (Zoph et al., 2018) searched for cells on CIFAR-10, then transferred the cell design to ImageNet by stacking more copies. The discovered architecture matched hand-designed networks while demonstrating that transferable cells work across datasets.
EfficientNet (Tan and Le, 2019) combined NAS with a compound scaling method. The idea: instead of scaling only depth, only width, or only resolution, scale all three together using a fixed ratio.
Given a baseline network with depth , width , and resolution , EfficientNet scales them as:
subject to (so FLOPs roughly double with each increment of ).
Example 3: EfficientNet scaling
Baseline: . Scaling coefficients found by small grid search: .
Check the constraint: . ✓
For :
So at , the network is 1.44 times deeper, 1.21 times wider, and uses 1.32 times higher resolution input. FLOPs scale by approximately times the baseline.
In practice, gets rounded to whole numbers of layers and to multiples of 8 (for hardware efficiency).
NAS methods comparison
| Method | Search strategy | Cost (GPU-days) | Retraining needed | Handles constraints | Key strength |
|---|---|---|---|---|---|
| NAS (RL) | REINFORCE | ~22,400 | ✓ | ✗ | First to work |
| NASNet | RL + cell transfer | ~1,800 | ✓ | ✗ | Transferable cells |
| AmoebaNet | Evolution | ~3,150 | ✓ | ✗ | Robust search |
| DARTS | Gradient-based | ~1-4 | ✓ | With penalty | Very fast |
| One-shot | Weight sharing | ~1-2 | ✓ | ✓ | Cheapest search |
| MnasNet | RL + latency | ~40,000 | ✓ | ✓ Latency | Hardware-aware |
| EfficientNAS | RL + compound | ~3,800 | ✓ | ✓ FLOPs | Best accuracy/FLOPs |
Limitations and open problems
NAS has known issues you should be aware of:
- Reproducibility: many NAS papers show high variance across runs. Small changes in the search procedure can lead to very different architectures.
- Search space bias: the search space is designed by humans. If the optimal architecture isn’t in your search space, NAS can’t find it.
- Supernet gap: in one-shot methods, the ranking of architectures in the supernet doesn’t always match the ranking after retraining.
- Transfer across tasks: cells found on CIFAR-10 don’t always transfer well to different domains like medical imaging or NLP.
Despite these issues, NAS has produced genuinely useful architectures. EfficientNet was the ImageNet state of the art for its FLOPs budget. NAS-discovered cells are used in production systems at Google and other companies.
What comes next
Finding the best architecture is one challenge. Making that architecture small and fast enough to deploy on real hardware is another. Network compression and efficient inference covers pruning, quantization, knowledge distillation, and other techniques for shrinking models without losing too much accuracy.