AutoML and hyperparameter optimization
In this series (25 parts)
- Neural networks: the basic building block
- Forward pass and backpropagation
- Training neural networks: a practical guide
- Convolutional neural networks
- Recurrent neural networks and LSTMs
- Attention mechanism and transformers
- Word embeddings: from one-hot to dense representations
- Transfer learning and fine-tuning
- Optimization techniques for deep networks
- Regularization for deep networks
- Encoder-decoder architectures
- Generative models: an overview
- Restricted Boltzmann Machines
- Deep Belief Networks
- Variational Autoencoders
- Generative Adversarial Networks: training and theory
- DCGAN, conditional GANs, and GAN variants
- Representation learning and self-supervised learning
- Domain adaptation and fine-tuning strategies
- Distributed representations and latent spaces
- AutoML and hyperparameter optimization
- Neural architecture search
- Network compression and efficient inference
- Graph neural networks
- Practical deep learning: debugging and tuning
Training a neural network involves two kinds of decisions. The model learns weights through gradient descent. But learning rate, batch size, number of layers, dropout rate, weight decay: you set these before training starts. These are hyperparameters, and picking them well can mean the difference between a model that barely works and one that reaches state of the art.
Prerequisites
You should be familiar with training neural networks and model selection with cross-validation. Understanding gradient descent and SGD variants helps with the intuition behind search strategies.
The tuning problem
You have a model that works, but there are dozens of settings to tune: learning rate, batch size, number of layers, dropout rate, weight decay. Each setting interacts with the others. A learning rate that works with batch size 32 might fail with batch size 256. Manual tuning is slow, unreliable, and does not scale.
| Hyperparameter | Typical Range | Why It Matters |
|---|---|---|
| Learning rate | 0.0001 to 0.1 | Too high causes divergence, too low wastes time |
| Batch size | 16 to 512 | Affects gradient noise and generalization |
| Number of layers | 2 to 50+ | Controls model capacity and depth of feature hierarchy |
| Dropout rate | 0.0 to 0.5 | Prevents overfitting but hurts capacity if too high |
| Weight decay | 0.0001 to 0.01 | Shrinks weights to reduce overfitting |
With 5 hyperparameters and 10 candidate values each, the search space has combinations. You cannot try them all. AutoML methods search this space intelligently.
The AutoML loop
graph TD A["Define Search Space"] --> B["Pick Configuration"] B --> C["Train and Evaluate"] C --> D["Record Result"] D --> E["Choose Next Configuration"] E --> B style A fill:#4a9eff,color:#fff style C fill:#ff9,stroke:#333,color:#000
Every AutoML method follows this loop. They differ in how they choose the next configuration. Grid search tries every point on a grid. Random search samples randomly. Bayesian optimization uses past results to pick promising candidates.
Now let’s formalize each approach.
What hyperparameters are and why they matter
A hyperparameter is any setting you choose before training. Weights are learned; hyperparameters are chosen. Common examples:
- Learning rate: too high and training diverges, too low and it crawls
- Batch size: affects gradient noise and generalization
- Architecture choices: number of layers, hidden units, kernel sizes
- Regularization: dropout rate, weight decay strength
- Optimizer settings: momentum, beta values for Adam
The problem is that these interact in complex ways. A learning rate that works great with batch size 32 might fail with batch size 256. Manual tuning is slow, unreliable, and doesn’t scale. AutoML methods automate this search.
Grid search: exhaustive but expensive
The simplest approach. You define a set of values for each hyperparameter, then evaluate every combination.
Setup: pick a discrete set for each hyperparameter and train on every point in the Cartesian product.
For two hyperparameters with 5 values each, you get evaluations. For three hyperparameters with 5 values each, that’s . The cost grows exponentially with the number of hyperparameters.
Grid search has one advantage: it is thorough within its grid. But it wastes enormous budget on unimportant hyperparameters, and it misses values between grid points.
Random search: why it beats grid
Bergstra and Bengio (2012) showed a simple but powerful result: random search finds good hyperparameters faster than grid search in most practical settings.
The reason is that hyperparameters usually have different importance. One might matter a lot (say, learning rate) while another barely affects performance (say, a specific regularization coefficient). Grid search allocates equal budget to every dimension. Random search, by sampling randomly, explores more unique values of the important dimension.
Example 1: Random vs grid efficiency
Suppose you have 2 hyperparameters and a budget of 25 evaluations.
Grid search: 5 values for param A, 5 values for param B. You get 25 evaluations, but only 5 distinct values of each parameter.
Random search: 25 random points in the 2D space. You get 25 distinct values of param A and 25 distinct values of param B.
If param A is the only one that really matters, grid search effectively tests 5 settings for it. Random search tests 25. With the same budget, random search covers 5 times more of the important dimension.
This gets worse as dimensions grow. With 5 hyperparameters on a grid of 3 values each, you need evaluations but still only see 3 values per parameter. Random search with 243 evaluations sees 243 distinct values of each.
When to use random search: as a baseline, always. It is simple to implement, easy to parallelize, and surprisingly effective.
Comparing search strategies
graph TD
subgraph Grid["Grid Search"]
G1["Fixed grid of values"]
G2["Tests every combination"]
G3["Wastes budget on
unimportant dimensions"]
end
subgraph Random["Random Search"]
R1["Random samples"]
R2["More unique values
per dimension"]
R3["Simple and parallelizable"]
end
subgraph Bayesian["Bayesian Optimization"]
B1["Surrogate model of
objective function"]
B2["Picks next point using
past results"]
B3["Sample-efficient but
sequential"]
end
style Grid fill:#ffe6e6,stroke:#333,color:#000
style Random fill:#fff3e6,stroke:#333,color:#000
style Bayesian fill:#e6ffe6,stroke:#333,color:#000
Grid search is thorough but scales badly. Random search is the best simple baseline. Bayesian optimization squeezes the most from a small evaluation budget by learning from every past trial.
Bayesian optimization: surrogate model + acquisition function
Hyperparameter search results: each point is one trial. Higher learning rates tend to perform worse, with a sweet spot around 0.001 to 0.01.
Random search ignores what you have already learned. If you tried learning rate 0.01 and got loss 0.8, and tried 0.001 and got loss 0.5, it makes sense to explore near 0.001 next. Bayesian optimization does exactly this.
The idea has two parts:
- Surrogate model: a cheap-to-evaluate model that approximates the true objective (validation loss as a function of hyperparameters)
- Acquisition function: decides where to sample next, balancing exploration (trying uncertain regions) and exploitation (trying regions likely to be good)
graph LR A[Fit surrogate to observed results] --> B[Optimize acquisition function] B --> C[Evaluate true objective at suggested point] C --> D[Add result to observation set] D --> A
Gaussian process surrogate
The most common surrogate is a Gaussian process (GP). A GP defines a distribution over functions. Given observed points , it provides a posterior mean and uncertainty at any new point .
Key properties:
- is the predicted performance at
- is high in unexplored regions and low near observed points
- The GP naturally balances what we know and what we don’t
The main limitation: GPs scale as in the number of observations, so they slow down after a few hundred evaluations. For larger budgets, tree-based surrogates (like TPE in Hyperopt) or random forests (like SMAC) are more practical.
Bayesian optimization: learn, suggest, evaluate, repeat
graph LR FIT["Fit surrogate model to all observations"] --> ACQ["Find point that maximizes acquisition function"] ACQ --> EVAL["Train model with those hyperparameters"] EVAL --> ADD["Record validation performance"] ADD --> FIT style FIT fill:#4a9eff,color:#fff style ACQ fill:#ff9,stroke:#333,color:#000 style EVAL fill:#ff6b6b,color:#fff style ADD fill:#51cf66,color:#fff
The surrogate predicts performance at untried configurations. The acquisition function balances exploring uncertain regions with exploiting regions that look promising. Each iteration adds one observation, making the surrogate more accurate.
Acquisition functions
Given and from the surrogate, how do you pick the next point?
Expected Improvement (EI): how much improvement over the current best do we expect?
where is the standard normal CDF and is the standard normal PDF.
Upper Confidence Bound (UCB): pick the point with the best optimistic estimate:
where controls exploration. Higher means more exploration.
Example 2: Computing Expected Improvement
We are minimizing loss, so we want . Let’s flip the sign convention: define improvement as (higher is better when is lower than the best seen).
Given:
- (predicted loss at candidate point)
- (uncertainty)
- (best loss seen so far, lower is better)
We want to improve on 0.7. The candidate predicts 0.6, which is already better.
Looking up standard normal values: , .
This means we expect an improvement of 0.176 at this point. The first term (0.063) comes from exploitation (the mean is already better than ). The second term (0.113) comes from exploration (high uncertainty means there might be even more improvement).
Hyperband: bandit-based early stopping
Bayesian optimization is sequential: it evaluates one configuration at a time. Hyperband takes a different approach. Instead of being smart about which configurations to try, it is smart about how long to train them.
The insight: most bad configurations show they are bad early. You don’t need to train a terrible learning rate for 100 epochs to know it is terrible. Hyperband starts many configurations with a small budget, then keeps only the best ones and gives them more budget.
graph TD
subgraph "Round 0"
R0["81 configs × 1 epoch each"]
end
subgraph "Round 1"
R1["27 configs × 3 epochs each"]
end
subgraph "Round 2"
R2["9 configs × 9 epochs each"]
end
subgraph "Round 3"
R3["3 configs × 27 epochs each"]
end
subgraph "Round 4"
R4["1 config × 81 epochs"]
end
R0 -->|"Keep top 1/3"| R1
R1 -->|"Keep top 1/3"| R2
R2 -->|"Keep top 1/3"| R3
R3 -->|"Keep top 1/3"| R4
Example 3: Hyperband bracket computation
Parameters: initial configurations, (reduction factor).
The maximum number of rounds is .
Each round keeps the top fraction and multiplies the budget by :
| Round | Configs remaining | Budget per config | Total budget used |
|---|---|---|---|
| 0 | 81 | 1 | 81 |
| 1 | 81 | ||
| 2 | 81 | ||
| 3 | 81 | ||
| 4 | 81 |
Notice something elegant: each round uses the same total budget (81 units). The whole bracket uses budget units to evaluate 81 configurations, with the best one getting the full 81 epochs.
Compare to training all 81 configurations for 81 epochs: that would cost units. Hyperband uses about 16 times less compute.
Early stopping in hyperparameter search: cut losses early
graph TD START["Start 81 configurations with minimal budget"] --> EVAL1["Evaluate after 1 epoch"] EVAL1 --> KEEP1["Keep top 27"] EVAL1 --> DROP1["Stop bottom 54 (clearly bad)"] KEEP1 --> EVAL2["Train to 3 epochs"] EVAL2 --> KEEP2["Keep top 9"] EVAL2 --> DROP2["Stop next 18"] KEEP2 --> EVAL3["Train to full budget"] EVAL3 --> BEST["Best configuration"] style DROP1 fill:#ff6b6b,color:#fff style DROP2 fill:#ff6b6b,color:#fff style BEST fill:#51cf66,color:#fff
Most bad configurations reveal themselves early. A learning rate of 10.0 will diverge after one epoch. There is no need to train it for 100 epochs. Successive halving kills bad configurations fast and allocates saved compute to promising ones.
ASHA: asynchronous Hyperband
Standard Hyperband is synchronous: you must wait for all configurations in a round to finish before promoting the best. ASHA (Asynchronous Successive Halving Algorithm) fixes this. Workers pick up the next available task, and promotions happen as soon as enough results at a given rung are available.
This is critical for distributed training where different configurations take different amounts of wall-clock time. ASHA keeps all workers busy instead of waiting for stragglers.
PBT: population-based training
Population-Based Training combines hyperparameter search with training in a clever way. Instead of training separate runs to completion, PBT maintains a population of models training in parallel. Periodically, each model checks if another member is doing better. If so, it copies that member’s weights (exploit) and perturbs the hyperparameters (explore).
The key difference from other methods: PBT can change hyperparameters during training. A learning rate schedule emerges automatically. This is especially useful for settings where the optimal hyperparameters change over the course of training.
HPO methods comparison
| Method | Evaluations needed | Parallelizable | Handles conditional HPs | Best for | Limitation |
|---|---|---|---|---|---|
| Grid search | Exponential in dims | ✓ Fully | ✗ | ≤3 hyperparams | Curse of dimensionality |
| Random search | Linear in budget | ✓ Fully | ✗ | Quick baseline | Ignores past results |
| Bayesian (GP) | Low (10-100) | Limited | ✓ With encoding | Expensive evaluations | Scales poorly past ~1000 obs |
| TPE | Low-moderate | Limited | ✓ Natively | Mixed types, conditionals | Less sample-efficient than GP |
| Hyperband | Moderate | ✓ Within rounds | ✗ | Large search spaces | Assumes early stopping signal |
| ASHA | Moderate | ✓ Fully async | ✗ | Distributed clusters | Same assumption as Hyperband |
| PBT | Population size | ✓ Fully | ✓ | Schedules, long training | Complex to implement |
NAS as AutoML
Neural architecture search takes AutoML further. Instead of just tuning hyperparameters for a fixed architecture, NAS searches over the architecture itself: which operations to use, how to connect layers, how deep and wide to go. Many NAS methods use the same optimization techniques from this article (Bayesian optimization, evolutionary search, bandit methods) applied to a much larger search space.
Practical recommendations
- Start with random search. It is your baseline and often good enough.
- Use Bayesian optimization when each evaluation is expensive (hours or days per run) and you have fewer than ~100 evaluations to spend.
- Use Hyperband/ASHA when you can evaluate configurations cheaply at small budgets (early epochs) and you have a large search space.
- Use PBT for very long training runs where you suspect the optimal hyperparameters change during training.
- Log everything. Whatever method you use, save all results. Even failed runs contain useful information.
What comes next
Once you can find good hyperparameters, the natural next step is to search over the architecture itself. Neural architecture search applies these optimization ideas to design network structures automatically.