Jan 14, 2026 · 20 min read · Deep Learning

Neural architecture search

In this series (25 parts)

Designing neural network architectures is one of the most time-consuming parts of deep learning research. People spend months trying different layer configurations, connection patterns, and operation choices. Neural architecture search (NAS) automates this: you define a search space of possible architectures, and an algorithm finds a good one for you.

Prerequisites

You should understand hyperparameter optimization methods (especially Bayesian optimization and Hyperband) and convolutional neural networks. Familiarity with attention mechanisms helps for understanding some modern search spaces.

The search space

Before searching, you need to define what you are searching over. A NAS search space has three components:

Operations: what each layer can do. Common choices include convolution with different kernel sizes (3x3, 5x5, 7x7), dilated convolutions, depthwise separable convolutions, pooling (max, average), skip connections (identity), and zero (no connection).

Connections: how layers connect to each other. This includes which previous layers feed into each new layer. Skip connections, dense connections, and branching patterns all live here.

Macrostructure: how many cells to stack, channel widths at each stage, and downsampling locations. Often the macrostructure is fixed by hand and only the cell structure is searched.

Most modern NAS methods search for a small repeating unit called a cell, then stack copies of that cell to build the full network. This reduces the search space dramatically while still producing diverse architectures.

RL-based NAS: controller as RNN

The original NAS paper by Zoph and Le (2017) used reinforcement learning. A controller network (an RNN) generates architecture descriptions token by token. Each token specifies an operation or connection. The generated architecture is trained from scratch, and its validation accuracy becomes the reward signal for the controller.

graph LR
  A[Controller RNN
samples architecture] --> B[Train child
network]
  B --> C[Evaluate on
validation set]
  C --> D[Use accuracy
as reward]
  D --> E[Update controller
with REINFORCE]
  E --> A

The controller learns a policy: a probability distribution over architectures. REINFORCE updates push probability toward architectures that achieved high validation accuracy.

The problem: this is incredibly expensive. The original NAS paper used 800 GPUs for 28 days. Each sampled architecture had to be trained from scratch. That’s roughly 22,400 GPU-days to find one architecture.

Evolutionary NAS

Instead of RL, you can use evolutionary algorithms. Start with a population of random architectures. Evaluate each one. Select the best (tournament selection), apply mutations (add a layer, change an operation, rewire a connection), and repeat.

Mutations in NAS might include:

Replacing one operation with another (e.g., conv3x3 becomes conv5x5)
Adding or removing a connection between nodes
Changing the input to a node

Real and Aggarwal (2019) showed that evolutionary NAS can match or beat RL-based NAS with similar computational budgets. Evolution has the advantage of maintaining diversity in the population, which helps avoid getting stuck in local optima.

DARTS: continuous relaxation

Differentiable Architecture Search (Liu et al., 2019) made NAS dramatically cheaper by making the search differentiable. Instead of sampling discrete architectures, DARTS puts all candidate operations in parallel and learns continuous weights over them.

For each edge in the cell, instead of picking one operation, you compute a weighted sum of all operations:

\bar{o}(x) = \sum_{k=1}^{K} \frac{\exp(\alpha_k)}{\sum_{j=1}^{K} \exp(\alpha_j)} \cdot o_k(x)

where $o_k$ are the candidate operations and $\alpha_k$ are architecture parameters. The softmax converts the raw $\alpha$ values into weights that sum to 1.

graph TD
  Input["Input x"] --> Conv3["conv 3×3"]
  Input --> Conv5["conv 5×5"]
  Input --> Skip["skip connect"]
  Conv3 --> Mix["Weighted sum
(softmax of α)"]
  Conv5 --> Mix
  Skip --> Mix
  Mix --> Output["Output"]

Training alternates between two sets of parameters:

Update network weights $w$ by gradient descent on the training loss
Update architecture parameters $\alpha$ by gradient descent on the validation loss

After search, you discretize: for each edge, keep only the operation with the highest $\alpha$ . Then retrain the discrete architecture from scratch.

DARTS reduces search cost from thousands of GPU-days to about 1-4 GPU-days. The tradeoff is that the continuous relaxation is an approximation: the best architecture in the relaxed space might not be the best after discretization.

Example 1: DARTS mixed operation

A cell edge has 3 candidate operations: conv3x3, conv5x5, and skip-connect.

Architecture parameters: $\alpha = [0.5, 0.3, 0.2]$ .

Softmax weights:

w_k = \frac{\exp(\alpha_k)}{\sum_j \exp(\alpha_j)}

\exp(0.5) = 1.649, \quad \exp(0.3) = 1.350, \quad \exp(0.2) = 1.221

\text{sum} = 1.649 + 1.350 + 1.221 = 4.220

w = \left[\frac{1.649}{4.220}, \frac{1.350}{4.220}, \frac{1.221}{4.220}\right] = [0.391, 0.320, 0.289]

Now suppose input $x = 1.0$ produces:

conv3x3 output: 0.8
conv5x5 output: 0.7
skip output: 1.0

Mixed output:

\bar{o}(x) = 0.391 \times 0.8 + 0.320 \times 0.7 + 0.289 \times 1.0

= 0.313 + 0.224 + 0.289 = 0.826

After search, conv3x3 has the highest weight (0.391), so we’d select it as the final operation for this edge.

One-shot NAS trains a single “supernet” that contains all possible architectures as subnetworks. During search, you evaluate architectures by extracting their subnetwork from the supernet, avoiding retraining from scratch.

The key idea is weight sharing: all architectures share the same weight parameters. When you evaluate architecture A, you use the same conv3x3 weights that architecture B would use. This makes evaluation nearly free (just a forward pass) but introduces a bias: weights trained for the full supernet may not be optimal for any individual subnetwork.

After finding a good architecture, you retrain it from scratch with fresh weights. The supernet weights are only used during the search phase.

Example 2: Search space size

Consider a cell with 4 intermediate nodes. Each node takes input from 2 of the previous nodes (including the cell inputs). For each chosen pair of inputs, there are 7 possible operations.

Node 1 can choose from 2 input nodes (the 2 cell inputs), picking 2 from 2: $\binom{2}{2} = 1$ way. With 7 operations per input edge: $7^2 = 49$ choices.

Node 2 can choose 2 inputs from 3 available nodes: $\binom{3}{2} = 3$ ways. Operations: $7^2 = 49$ . Total: $3 \times 49 = 147$ .

Node 3: $\binom{4}{2} = 6$ input pairs, $7^2 = 49$ operations each. Total: $6 \times 49 = 294$ .

Node 4: $\binom{5}{2} = 10$ input pairs, $7^2 = 49$ operations each. Total: $10 \times 49 = 490$ .

Total architectures for one cell:

1 \times 49 \times 147 \times 294 \times 490 = 1.038 \times 10^{9}

That’s over a billion possible cell architectures. And this is just for one cell type. If you search for both a normal cell and a reduction cell independently, the space squares to roughly $10^{18}$ . This is why exhaustive search is impossible and why smart search strategies matter.

Efficiency constraints: FLOPs and latency

Real deployment cares about more than accuracy. You need architectures that fit within a latency budget on target hardware. Modern NAS methods add hardware constraints directly into the search.

FLOPs-aware NAS: add a penalty term for computational cost. The loss becomes:

\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{task}} + \lambda \cdot \text{FLOPs}(a)

Latency-aware NAS: build a lookup table mapping each operation to its measured latency on target hardware. The total latency of an architecture is the sum of operation latencies along the critical path.

Multi-objective NAS: use Pareto optimization. Instead of a single best architecture, find the Pareto front of architectures that trade off accuracy against efficiency. No architecture on the front can improve one metric without worsening the other.

Results: NASNet and EfficientNet

Architecture search results: top-5 candidates by validation accuracy

NASNet (Zoph et al., 2018) searched for cells on CIFAR-10, then transferred the cell design to ImageNet by stacking more copies. The discovered architecture matched hand-designed networks while demonstrating that transferable cells work across datasets.

EfficientNet (Tan and Le, 2019) combined NAS with a compound scaling method. The idea: instead of scaling only depth, only width, or only resolution, scale all three together using a fixed ratio.

Given a baseline network with depth $d$ , width $w$ , and resolution $r$ , EfficientNet scales them as:

d' = d \cdot \alpha^\phi, \quad w' = w \cdot \beta^\phi, \quad r' = r \cdot \gamma^\phi

subject to $\alpha \cdot \beta^2 \cdot \gamma^2 \approx 2$ (so FLOPs roughly double with each increment of $\phi$ ).

Example 3: EfficientNet scaling

Baseline: $d = 1, w = 1, r = 1$ . Scaling coefficients found by small grid search: $\alpha = 1.2, \beta = 1.1, \gamma = 1.15$ .

Check the constraint: $\alpha \cdot \beta^2 \cdot \gamma^2 = 1.2 \times 1.21 \times 1.3225 = 1.919 \approx 2$ . ✓

For $\phi = 2$ :

d' = 1 \times 1.2^2 = 1.44

w' = 1 \times 1.1^2 = 1.21

r' = 1 \times 1.15^2 = 1.3225

So at $\phi = 2$ , the network is 1.44 times deeper, 1.21 times wider, and uses 1.32 times higher resolution input. FLOPs scale by approximately $\alpha^\phi \cdot (\beta^\phi)^2 \cdot (\gamma^\phi)^2 = 1.44 \times 1.464 \times 1.749 \approx 3.69$ times the baseline.

In practice, $d'$ gets rounded to whole numbers of layers and $w'$ to multiples of 8 (for hardware efficiency).

NAS methods comparison

Method	Search strategy	Cost (GPU-days)	Retraining needed	Handles constraints	Key strength
NAS (RL)	REINFORCE	~22,400	✓	✗	First to work
NASNet	RL + cell transfer	~1,800	✓	✗	Transferable cells
AmoebaNet	Evolution	~3,150	✓	✗	Robust search
DARTS	Gradient-based	~1-4	✓	With penalty	Very fast
One-shot	Weight sharing	~1-2	✓	✓	Cheapest search
MnasNet	RL + latency	~40,000	✓	✓ Latency	Hardware-aware
EfficientNAS	RL + compound	~3,800	✓	✓ FLOPs	Best accuracy/FLOPs

Limitations and open problems

NAS has known issues you should be aware of:

Reproducibility: many NAS papers show high variance across runs. Small changes in the search procedure can lead to very different architectures.
Search space bias: the search space is designed by humans. If the optimal architecture isn’t in your search space, NAS can’t find it.
Supernet gap: in one-shot methods, the ranking of architectures in the supernet doesn’t always match the ranking after retraining.
Transfer across tasks: cells found on CIFAR-10 don’t always transfer well to different domains like medical imaging or NLP.

Despite these issues, NAS has produced genuinely useful architectures. EfficientNet was the ImageNet state of the art for its FLOPs budget. NAS-discovered cells are used in production systems at Google and other companies.

What comes next

Finding the best architecture is one challenge. Making that architecture small and fast enough to deploy on real hardware is another. Network compression and efficient inference covers pruning, quantization, knowledge distillation, and other techniques for shrinking models without losing too much accuracy.

← Back to all series