Search…

AutoML and hyperparameter optimization

In this series (25 parts)
  1. Neural networks: the basic building block
  2. Forward pass and backpropagation
  3. Training neural networks: a practical guide
  4. Convolutional neural networks
  5. Recurrent neural networks and LSTMs
  6. Attention mechanism and transformers
  7. Word embeddings: from one-hot to dense representations
  8. Transfer learning and fine-tuning
  9. Optimization techniques for deep networks
  10. Regularization for deep networks
  11. Encoder-decoder architectures
  12. Generative models: an overview
  13. Restricted Boltzmann Machines
  14. Deep Belief Networks
  15. Variational Autoencoders
  16. Generative Adversarial Networks: training and theory
  17. DCGAN, conditional GANs, and GAN variants
  18. Representation learning and self-supervised learning
  19. Domain adaptation and fine-tuning strategies
  20. Distributed representations and latent spaces
  21. AutoML and hyperparameter optimization
  22. Neural architecture search
  23. Network compression and efficient inference
  24. Graph neural networks
  25. Practical deep learning: debugging and tuning

Training a neural network involves two kinds of decisions. The model learns weights through gradient descent. But learning rate, batch size, number of layers, dropout rate, weight decay: you set these before training starts. These are hyperparameters, and picking them well can mean the difference between a model that barely works and one that reaches state of the art.

Prerequisites

You should be familiar with training neural networks and model selection with cross-validation. Understanding gradient descent and SGD variants helps with the intuition behind search strategies.

The tuning problem

You have a model that works, but there are dozens of settings to tune: learning rate, batch size, number of layers, dropout rate, weight decay. Each setting interacts with the others. A learning rate that works with batch size 32 might fail with batch size 256. Manual tuning is slow, unreliable, and does not scale.

HyperparameterTypical RangeWhy It Matters
Learning rate0.0001 to 0.1Too high causes divergence, too low wastes time
Batch size16 to 512Affects gradient noise and generalization
Number of layers2 to 50+Controls model capacity and depth of feature hierarchy
Dropout rate0.0 to 0.5Prevents overfitting but hurts capacity if too high
Weight decay0.0001 to 0.01Shrinks weights to reduce overfitting

With 5 hyperparameters and 10 candidate values each, the search space has 105=100,00010^5 = 100{,}000 combinations. You cannot try them all. AutoML methods search this space intelligently.

The AutoML loop

graph TD
  A["Define Search Space"] --> B["Pick Configuration"]
  B --> C["Train and Evaluate"]
  C --> D["Record Result"]
  D --> E["Choose Next Configuration"]
  E --> B

  style A fill:#4a9eff,color:#fff
  style C fill:#ff9,stroke:#333,color:#000

Every AutoML method follows this loop. They differ in how they choose the next configuration. Grid search tries every point on a grid. Random search samples randomly. Bayesian optimization uses past results to pick promising candidates.

Now let’s formalize each approach.

What hyperparameters are and why they matter

A hyperparameter is any setting you choose before training. Weights are learned; hyperparameters are chosen. Common examples:

  • Learning rate: too high and training diverges, too low and it crawls
  • Batch size: affects gradient noise and generalization
  • Architecture choices: number of layers, hidden units, kernel sizes
  • Regularization: dropout rate, weight decay strength
  • Optimizer settings: momentum, beta values for Adam

The problem is that these interact in complex ways. A learning rate that works great with batch size 32 might fail with batch size 256. Manual tuning is slow, unreliable, and doesn’t scale. AutoML methods automate this search.

Grid search: exhaustive but expensive

The simplest approach. You define a set of values for each hyperparameter, then evaluate every combination.

Setup: pick a discrete set for each hyperparameter and train on every point in the Cartesian product.

For two hyperparameters with 5 values each, you get 5×5=255 \times 5 = 25 evaluations. For three hyperparameters with 5 values each, that’s 53=1255^3 = 125. The cost grows exponentially with the number of hyperparameters.

Grid search has one advantage: it is thorough within its grid. But it wastes enormous budget on unimportant hyperparameters, and it misses values between grid points.

Random search: why it beats grid

Bergstra and Bengio (2012) showed a simple but powerful result: random search finds good hyperparameters faster than grid search in most practical settings.

The reason is that hyperparameters usually have different importance. One might matter a lot (say, learning rate) while another barely affects performance (say, a specific regularization coefficient). Grid search allocates equal budget to every dimension. Random search, by sampling randomly, explores more unique values of the important dimension.

Example 1: Random vs grid efficiency

Suppose you have 2 hyperparameters and a budget of 25 evaluations.

Grid search: 5 values for param A, 5 values for param B. You get 25 evaluations, but only 5 distinct values of each parameter.

Random search: 25 random points in the 2D space. You get 25 distinct values of param A and 25 distinct values of param B.

If param A is the only one that really matters, grid search effectively tests 5 settings for it. Random search tests 25. With the same budget, random search covers 5 times more of the important dimension.

This gets worse as dimensions grow. With 5 hyperparameters on a grid of 3 values each, you need 35=2433^5 = 243 evaluations but still only see 3 values per parameter. Random search with 243 evaluations sees 243 distinct values of each.

When to use random search: as a baseline, always. It is simple to implement, easy to parallelize, and surprisingly effective.

Comparing search strategies

graph TD
  subgraph Grid["Grid Search"]
      G1["Fixed grid of values"]
      G2["Tests every combination"]
      G3["Wastes budget on
unimportant dimensions"]
  end
  subgraph Random["Random Search"]
      R1["Random samples"]
      R2["More unique values
per dimension"]
      R3["Simple and parallelizable"]
  end
  subgraph Bayesian["Bayesian Optimization"]
      B1["Surrogate model of
objective function"]
      B2["Picks next point using
past results"]
      B3["Sample-efficient but
sequential"]
  end

  style Grid fill:#ffe6e6,stroke:#333,color:#000
  style Random fill:#fff3e6,stroke:#333,color:#000
  style Bayesian fill:#e6ffe6,stroke:#333,color:#000

Grid search is thorough but scales badly. Random search is the best simple baseline. Bayesian optimization squeezes the most from a small evaluation budget by learning from every past trial.

Bayesian optimization: surrogate model + acquisition function

Hyperparameter search results: each point is one trial. Higher learning rates tend to perform worse, with a sweet spot around 0.001 to 0.01.

Random search ignores what you have already learned. If you tried learning rate 0.01 and got loss 0.8, and tried 0.001 and got loss 0.5, it makes sense to explore near 0.001 next. Bayesian optimization does exactly this.

The idea has two parts:

  1. Surrogate model: a cheap-to-evaluate model that approximates the true objective (validation loss as a function of hyperparameters)
  2. Acquisition function: decides where to sample next, balancing exploration (trying uncertain regions) and exploitation (trying regions likely to be good)
graph LR
  A[Fit surrogate to
observed results] --> B[Optimize acquisition
function]
  B --> C[Evaluate true
objective at
suggested point]
  C --> D[Add result to
observation set]
  D --> A

Gaussian process surrogate

The most common surrogate is a Gaussian process (GP). A GP defines a distribution over functions. Given observed points {(x1,y1),,(xn,yn)}\{(x_1, y_1), \ldots, (x_n, y_n)\}, it provides a posterior mean μ(x)\mu(x) and uncertainty σ(x)\sigma(x) at any new point xx.

Key properties:

  • μ(x)\mu(x) is the predicted performance at xx
  • σ(x)\sigma(x) is high in unexplored regions and low near observed points
  • The GP naturally balances what we know and what we don’t

The main limitation: GPs scale as O(n3)O(n^3) in the number of observations, so they slow down after a few hundred evaluations. For larger budgets, tree-based surrogates (like TPE in Hyperopt) or random forests (like SMAC) are more practical.

Bayesian optimization: learn, suggest, evaluate, repeat

graph LR
  FIT["Fit surrogate model
to all observations"] --> ACQ["Find point that
maximizes acquisition
function"]
  ACQ --> EVAL["Train model with
those hyperparameters"]
  EVAL --> ADD["Record validation
performance"]
  ADD --> FIT

  style FIT fill:#4a9eff,color:#fff
  style ACQ fill:#ff9,stroke:#333,color:#000
  style EVAL fill:#ff6b6b,color:#fff
  style ADD fill:#51cf66,color:#fff

The surrogate predicts performance at untried configurations. The acquisition function balances exploring uncertain regions with exploiting regions that look promising. Each iteration adds one observation, making the surrogate more accurate.

Acquisition functions

Given μ(x)\mu(x) and σ(x)\sigma(x) from the surrogate, how do you pick the next point?

Expected Improvement (EI): how much improvement over the current best ff^* do we expect?

EI(x)=(μ(x)f)Φ(Z)+σ(x)ϕ(Z),Z=μ(x)fσ(x)\text{EI}(x) = (\mu(x) - f^*)\,\Phi(Z) + \sigma(x)\,\phi(Z), \quad Z = \frac{\mu(x) - f^*}{\sigma(x)}

where Φ\Phi is the standard normal CDF and ϕ\phi is the standard normal PDF.

Upper Confidence Bound (UCB): pick the point with the best optimistic estimate:

UCB(x)=μ(x)+κσ(x)\text{UCB}(x) = \mu(x) + \kappa \, \sigma(x)

where κ\kappa controls exploration. Higher κ\kappa means more exploration.

Example 2: Computing Expected Improvement

We are minimizing loss, so we want μ(x)<f\mu(x) < f^*. Let’s flip the sign convention: define improvement as fμ(x)f^* - \mu(x) (higher is better when μ\mu is lower than the best seen).

Given:

  • μ(x)=0.6\mu(x) = 0.6 (predicted loss at candidate point)
  • σ(x)=0.3\sigma(x) = 0.3 (uncertainty)
  • f=0.7f^* = 0.7 (best loss seen so far, lower is better)

We want to improve on 0.7. The candidate predicts 0.6, which is already better.

Z=fμ(x)σ(x)=0.70.60.3=0.10.3=0.333Z = \frac{f^* - \mu(x)}{\sigma(x)} = \frac{0.7 - 0.6}{0.3} = \frac{0.1}{0.3} = 0.333

Looking up standard normal values: Φ(0.333)0.630\Phi(0.333) \approx 0.630, ϕ(0.333)0.378\phi(0.333) \approx 0.378.

EI(x)=(fμ(x))Φ(Z)+σ(x)ϕ(Z)\text{EI}(x) = (f^* - \mu(x))\,\Phi(Z) + \sigma(x)\,\phi(Z) EI(x)=0.1×0.630+0.3×0.378=0.063+0.113=0.176\text{EI}(x) = 0.1 \times 0.630 + 0.3 \times 0.378 = 0.063 + 0.113 = 0.176

This means we expect an improvement of 0.176 at this point. The first term (0.063) comes from exploitation (the mean is already better than ff^*). The second term (0.113) comes from exploration (high uncertainty means there might be even more improvement).

Hyperband: bandit-based early stopping

Bayesian optimization is sequential: it evaluates one configuration at a time. Hyperband takes a different approach. Instead of being smart about which configurations to try, it is smart about how long to train them.

The insight: most bad configurations show they are bad early. You don’t need to train a terrible learning rate for 100 epochs to know it is terrible. Hyperband starts many configurations with a small budget, then keeps only the best ones and gives them more budget.

graph TD
  subgraph "Round 0"
      R0["81 configs × 1 epoch each"]
  end
  subgraph "Round 1"
      R1["27 configs × 3 epochs each"]
  end
  subgraph "Round 2"
      R2["9 configs × 9 epochs each"]
  end
  subgraph "Round 3"
      R3["3 configs × 27 epochs each"]
  end
  subgraph "Round 4"
      R4["1 config × 81 epochs"]
  end
  R0 -->|"Keep top 1/3"| R1
  R1 -->|"Keep top 1/3"| R2
  R2 -->|"Keep top 1/3"| R3
  R3 -->|"Keep top 1/3"| R4

Example 3: Hyperband bracket computation

Parameters: n=81n = 81 initial configurations, η=3\eta = 3 (reduction factor).

The maximum number of rounds is smax=logη(n)=log3(81)=4s_{\max} = \lfloor \log_\eta(n) \rfloor = \lfloor \log_3(81) \rfloor = 4.

Each round keeps the top 1/η1/\eta fraction and multiplies the budget by η\eta:

RoundConfigs remainingBudget per configTotal budget used
081181
181/3=27\lfloor 81/3 \rfloor = 271×3=31 \times 3 = 381
227/3=9\lfloor 27/3 \rfloor = 93×3=93 \times 3 = 981
39/3=3\lfloor 9/3 \rfloor = 39×3=279 \times 3 = 2781
43/3=1\lfloor 3/3 \rfloor = 127×3=8127 \times 3 = 8181

Notice something elegant: each round uses the same total budget (81 units). The whole bracket uses 5×81=4055 \times 81 = 405 budget units to evaluate 81 configurations, with the best one getting the full 81 epochs.

Compare to training all 81 configurations for 81 epochs: that would cost 81×81=6,56181 \times 81 = 6{,}561 units. Hyperband uses about 16 times less compute.

Early stopping in hyperparameter search: cut losses early

graph TD
  START["Start 81 configurations
with minimal budget"] --> EVAL1["Evaluate after 1 epoch"]
  EVAL1 --> KEEP1["Keep top 27"]
  EVAL1 --> DROP1["Stop bottom 54
(clearly bad)"]
  KEEP1 --> EVAL2["Train to 3 epochs"]
  EVAL2 --> KEEP2["Keep top 9"]
  EVAL2 --> DROP2["Stop next 18"]
  KEEP2 --> EVAL3["Train to full budget"]
  EVAL3 --> BEST["Best configuration"]

  style DROP1 fill:#ff6b6b,color:#fff
  style DROP2 fill:#ff6b6b,color:#fff
  style BEST fill:#51cf66,color:#fff

Most bad configurations reveal themselves early. A learning rate of 10.0 will diverge after one epoch. There is no need to train it for 100 epochs. Successive halving kills bad configurations fast and allocates saved compute to promising ones.

ASHA: asynchronous Hyperband

Standard Hyperband is synchronous: you must wait for all configurations in a round to finish before promoting the best. ASHA (Asynchronous Successive Halving Algorithm) fixes this. Workers pick up the next available task, and promotions happen as soon as enough results at a given rung are available.

This is critical for distributed training where different configurations take different amounts of wall-clock time. ASHA keeps all workers busy instead of waiting for stragglers.

PBT: population-based training

Population-Based Training combines hyperparameter search with training in a clever way. Instead of training separate runs to completion, PBT maintains a population of models training in parallel. Periodically, each model checks if another member is doing better. If so, it copies that member’s weights (exploit) and perturbs the hyperparameters (explore).

The key difference from other methods: PBT can change hyperparameters during training. A learning rate schedule emerges automatically. This is especially useful for settings where the optimal hyperparameters change over the course of training.

HPO methods comparison

MethodEvaluations neededParallelizableHandles conditional HPsBest forLimitation
Grid searchExponential in dims✓ Fully≤3 hyperparamsCurse of dimensionality
Random searchLinear in budget✓ FullyQuick baselineIgnores past results
Bayesian (GP)Low (10-100)Limited✓ With encodingExpensive evaluationsScales poorly past ~1000 obs
TPELow-moderateLimited✓ NativelyMixed types, conditionalsLess sample-efficient than GP
HyperbandModerate✓ Within roundsLarge search spacesAssumes early stopping signal
ASHAModerate✓ Fully asyncDistributed clustersSame assumption as Hyperband
PBTPopulation size✓ FullySchedules, long trainingComplex to implement

NAS as AutoML

Neural architecture search takes AutoML further. Instead of just tuning hyperparameters for a fixed architecture, NAS searches over the architecture itself: which operations to use, how to connect layers, how deep and wide to go. Many NAS methods use the same optimization techniques from this article (Bayesian optimization, evolutionary search, bandit methods) applied to a much larger search space.

Practical recommendations

  1. Start with random search. It is your baseline and often good enough.
  2. Use Bayesian optimization when each evaluation is expensive (hours or days per run) and you have fewer than ~100 evaluations to spend.
  3. Use Hyperband/ASHA when you can evaluate configurations cheaply at small budgets (early epochs) and you have a large search space.
  4. Use PBT for very long training runs where you suspect the optimal hyperparameters change during training.
  5. Log everything. Whatever method you use, save all results. Even failed runs contain useful information.

What comes next

Once you can find good hyperparameters, the natural next step is to search over the architecture itself. Neural architecture search applies these optimization ideas to design network structures automatically.

Start typing to search across all content
navigate Enter open Esc close