Jan 9, 2026 · 20 min read · Deep Learning

AutoML and hyperparameter optimization

In this series (25 parts)

Training a neural network involves two kinds of decisions. The model learns weights through gradient descent. But learning rate, batch size, number of layers, dropout rate, weight decay: you set these before training starts. These are hyperparameters, and picking them well can mean the difference between a model that barely works and one that reaches state of the art.

Prerequisites

You should be familiar with training neural networks and model selection with cross-validation. Understanding gradient descent and SGD variants helps with the intuition behind search strategies.

The tuning problem

You have a model that works, but there are dozens of settings to tune: learning rate, batch size, number of layers, dropout rate, weight decay. Each setting interacts with the others. A learning rate that works with batch size 32 might fail with batch size 256. Manual tuning is slow, unreliable, and does not scale.

Hyperparameter	Typical Range	Why It Matters
Learning rate	0.0001 to 0.1	Too high causes divergence, too low wastes time
Batch size	16 to 512	Affects gradient noise and generalization
Number of layers	2 to 50+	Controls model capacity and depth of feature hierarchy
Dropout rate	0.0 to 0.5	Prevents overfitting but hurts capacity if too high
Weight decay	0.0001 to 0.01	Shrinks weights to reduce overfitting

With 5 hyperparameters and 10 candidate values each, the search space has $10^5 = 100{,}000$ combinations. You cannot try them all. AutoML methods search this space intelligently.

The AutoML loop

graph TD
  A["Define Search Space"] --> B["Pick Configuration"]
  B --> C["Train and Evaluate"]
  C --> D["Record Result"]
  D --> E["Choose Next Configuration"]
  E --> B

  style A fill:#4a9eff,color:#fff
  style C fill:#ff9,stroke:#333,color:#000

Every AutoML method follows this loop. They differ in how they choose the next configuration. Grid search tries every point on a grid. Random search samples randomly. Bayesian optimization uses past results to pick promising candidates.

Now let’s formalize each approach.

What hyperparameters are and why they matter

A hyperparameter is any setting you choose before training. Weights are learned; hyperparameters are chosen. Common examples:

Learning rate: too high and training diverges, too low and it crawls
Batch size: affects gradient noise and generalization
Architecture choices: number of layers, hidden units, kernel sizes
Regularization: dropout rate, weight decay strength
Optimizer settings: momentum, beta values for Adam

The problem is that these interact in complex ways. A learning rate that works great with batch size 32 might fail with batch size 256. Manual tuning is slow, unreliable, and doesn’t scale. AutoML methods automate this search.

Grid search: exhaustive but expensive

The simplest approach. You define a set of values for each hyperparameter, then evaluate every combination.

Setup: pick a discrete set for each hyperparameter and train on every point in the Cartesian product.

For two hyperparameters with 5 values each, you get $5 \times 5 = 25$ evaluations. For three hyperparameters with 5 values each, that’s $5^3 = 125$ . The cost grows exponentially with the number of hyperparameters.

Grid search has one advantage: it is thorough within its grid. But it wastes enormous budget on unimportant hyperparameters, and it misses values between grid points.

Random search: why it beats grid

Bergstra and Bengio (2012) showed a simple but powerful result: random search finds good hyperparameters faster than grid search in most practical settings.

The reason is that hyperparameters usually have different importance. One might matter a lot (say, learning rate) while another barely affects performance (say, a specific regularization coefficient). Grid search allocates equal budget to every dimension. Random search, by sampling randomly, explores more unique values of the important dimension.

Example 1: Random vs grid efficiency

Suppose you have 2 hyperparameters and a budget of 25 evaluations.

Grid search: 5 values for param A, 5 values for param B. You get 25 evaluations, but only 5 distinct values of each parameter.

Random search: 25 random points in the 2D space. You get 25 distinct values of param A and 25 distinct values of param B.

If param A is the only one that really matters, grid search effectively tests 5 settings for it. Random search tests 25. With the same budget, random search covers 5 times more of the important dimension.

This gets worse as dimensions grow. With 5 hyperparameters on a grid of 3 values each, you need $3^5 = 243$ evaluations but still only see 3 values per parameter. Random search with 243 evaluations sees 243 distinct values of each.

When to use random search: as a baseline, always. It is simple to implement, easy to parallelize, and surprisingly effective.

Comparing search strategies

graph TD
  subgraph Grid["Grid Search"]
      G1["Fixed grid of values"]
      G2["Tests every combination"]
      G3["Wastes budget on
unimportant dimensions"]
  end
  subgraph Random["Random Search"]
      R1["Random samples"]
      R2["More unique values
per dimension"]
      R3["Simple and parallelizable"]
  end
  subgraph Bayesian["Bayesian Optimization"]
      B1["Surrogate model of
objective function"]
      B2["Picks next point using
past results"]
      B3["Sample-efficient but
sequential"]
  end

  style Grid fill:#ffe6e6,stroke:#333,color:#000
  style Random fill:#fff3e6,stroke:#333,color:#000
  style Bayesian fill:#e6ffe6,stroke:#333,color:#000

Grid search is thorough but scales badly. Random search is the best simple baseline. Bayesian optimization squeezes the most from a small evaluation budget by learning from every past trial.

Bayesian optimization: surrogate model + acquisition function

Hyperparameter search results: each point is one trial. Higher learning rates tend to perform worse, with a sweet spot around 0.001 to 0.01.

Random search ignores what you have already learned. If you tried learning rate 0.01 and got loss 0.8, and tried 0.001 and got loss 0.5, it makes sense to explore near 0.001 next. Bayesian optimization does exactly this.

The idea has two parts:

Surrogate model: a cheap-to-evaluate model that approximates the true objective (validation loss as a function of hyperparameters)
Acquisition function: decides where to sample next, balancing exploration (trying uncertain regions) and exploitation (trying regions likely to be good)

graph LR
  A[Fit surrogate to
observed results] --> B[Optimize acquisition
function]
  B --> C[Evaluate true
objective at
suggested point]
  C --> D[Add result to
observation set]
  D --> A

Gaussian process surrogate

The most common surrogate is a Gaussian process (GP). A GP defines a distribution over functions. Given observed points $\{(x_1, y_1), \ldots, (x_n, y_n)\}$ , it provides a posterior mean $\mu(x)$ and uncertainty $\sigma(x)$ at any new point $x$ .

Key properties:

$\mu(x)$ is the predicted performance at $x$
$\sigma(x)$ is high in unexplored regions and low near observed points
The GP naturally balances what we know and what we don’t

The main limitation: GPs scale as $O(n^3)$ in the number of observations, so they slow down after a few hundred evaluations. For larger budgets, tree-based surrogates (like TPE in Hyperopt) or random forests (like SMAC) are more practical.

Bayesian optimization: learn, suggest, evaluate, repeat

graph LR
  FIT["Fit surrogate model
to all observations"] --> ACQ["Find point that
maximizes acquisition
function"]
  ACQ --> EVAL["Train model with
those hyperparameters"]
  EVAL --> ADD["Record validation
performance"]
  ADD --> FIT

  style FIT fill:#4a9eff,color:#fff
  style ACQ fill:#ff9,stroke:#333,color:#000
  style EVAL fill:#ff6b6b,color:#fff
  style ADD fill:#51cf66,color:#fff

The surrogate predicts performance at untried configurations. The acquisition function balances exploring uncertain regions with exploiting regions that look promising. Each iteration adds one observation, making the surrogate more accurate.

Acquisition functions

Given $\mu(x)$ and $\sigma(x)$ from the surrogate, how do you pick the next point?

Expected Improvement (EI): how much improvement over the current best $f^*$ do we expect?

\text{EI}(x) = (\mu(x) - f^*)\,\Phi(Z) + \sigma(x)\,\phi(Z), \quad Z = \frac{\mu(x) - f^*}{\sigma(x)}

where $\Phi$ is the standard normal CDF and $\phi$ is the standard normal PDF.

Upper Confidence Bound (UCB): pick the point with the best optimistic estimate:

\text{UCB}(x) = \mu(x) + \kappa \, \sigma(x)

where $\kappa$ controls exploration. Higher $\kappa$ means more exploration.

Example 2: Computing Expected Improvement

We are minimizing loss, so we want $\mu(x) < f^*$ . Let’s flip the sign convention: define improvement as $f^* - \mu(x)$ (higher is better when $\mu$ is lower than the best seen).

Given:

$\mu(x) = 0.6$ (predicted loss at candidate point)
$\sigma(x) = 0.3$ (uncertainty)
$f^* = 0.7$ (best loss seen so far, lower is better)

We want to improve on 0.7. The candidate predicts 0.6, which is already better.

Z = \frac{f^* - \mu(x)}{\sigma(x)} = \frac{0.7 - 0.6}{0.3} = \frac{0.1}{0.3} = 0.333

Looking up standard normal values: $\Phi(0.333) \approx 0.630$ , $\phi(0.333) \approx 0.378$ .

\text{EI}(x) = (f^* - \mu(x))\,\Phi(Z) + \sigma(x)\,\phi(Z)

\text{EI}(x) = 0.1 \times 0.630 + 0.3 \times 0.378 = 0.063 + 0.113 = 0.176

This means we expect an improvement of 0.176 at this point. The first term (0.063) comes from exploitation (the mean is already better than $f^*$ ). The second term (0.113) comes from exploration (high uncertainty means there might be even more improvement).

Hyperband: bandit-based early stopping

Bayesian optimization is sequential: it evaluates one configuration at a time. Hyperband takes a different approach. Instead of being smart about which configurations to try, it is smart about how long to train them.

The insight: most bad configurations show they are bad early. You don’t need to train a terrible learning rate for 100 epochs to know it is terrible. Hyperband starts many configurations with a small budget, then keeps only the best ones and gives them more budget.

graph TD
  subgraph "Round 0"
      R0["81 configs × 1 epoch each"]
  end
  subgraph "Round 1"
      R1["27 configs × 3 epochs each"]
  end
  subgraph "Round 2"
      R2["9 configs × 9 epochs each"]
  end
  subgraph "Round 3"
      R3["3 configs × 27 epochs each"]
  end
  subgraph "Round 4"
      R4["1 config × 81 epochs"]
  end
  R0 -->|"Keep top 1/3"| R1
  R1 -->|"Keep top 1/3"| R2
  R2 -->|"Keep top 1/3"| R3
  R3 -->|"Keep top 1/3"| R4

Example 3: Hyperband bracket computation

Parameters: $n = 81$ initial configurations, $\eta = 3$ (reduction factor).

The maximum number of rounds is $s_{\max} = \lfloor \log_\eta(n) \rfloor = \lfloor \log_3(81) \rfloor = 4$ .

Each round keeps the top $1/\eta$ fraction and multiplies the budget by $\eta$ :

Round	Configs remaining	Budget per config	Total budget used
0	81	1	81
1	$\lfloor 81/3 \rfloor = 27$	$1 \times 3 = 3$	81
2	$\lfloor 27/3 \rfloor = 9$	$3 \times 3 = 9$	81
3	$\lfloor 9/3 \rfloor = 3$	$9 \times 3 = 27$	81
4	$\lfloor 3/3 \rfloor = 1$	$27 \times 3 = 81$	81

Notice something elegant: each round uses the same total budget (81 units). The whole bracket uses $5 \times 81 = 405$ budget units to evaluate 81 configurations, with the best one getting the full 81 epochs.

Compare to training all 81 configurations for 81 epochs: that would cost $81 \times 81 = 6{,}561$ units. Hyperband uses about 16 times less compute.

Early stopping in hyperparameter search: cut losses early

graph TD
  START["Start 81 configurations
with minimal budget"] --> EVAL1["Evaluate after 1 epoch"]
  EVAL1 --> KEEP1["Keep top 27"]
  EVAL1 --> DROP1["Stop bottom 54
(clearly bad)"]
  KEEP1 --> EVAL2["Train to 3 epochs"]
  EVAL2 --> KEEP2["Keep top 9"]
  EVAL2 --> DROP2["Stop next 18"]
  KEEP2 --> EVAL3["Train to full budget"]
  EVAL3 --> BEST["Best configuration"]

  style DROP1 fill:#ff6b6b,color:#fff
  style DROP2 fill:#ff6b6b,color:#fff
  style BEST fill:#51cf66,color:#fff

Most bad configurations reveal themselves early. A learning rate of 10.0 will diverge after one epoch. There is no need to train it for 100 epochs. Successive halving kills bad configurations fast and allocates saved compute to promising ones.

ASHA: asynchronous Hyperband

Standard Hyperband is synchronous: you must wait for all configurations in a round to finish before promoting the best. ASHA (Asynchronous Successive Halving Algorithm) fixes this. Workers pick up the next available task, and promotions happen as soon as enough results at a given rung are available.

This is critical for distributed training where different configurations take different amounts of wall-clock time. ASHA keeps all workers busy instead of waiting for stragglers.

PBT: population-based training

Population-Based Training combines hyperparameter search with training in a clever way. Instead of training separate runs to completion, PBT maintains a population of models training in parallel. Periodically, each model checks if another member is doing better. If so, it copies that member’s weights (exploit) and perturbs the hyperparameters (explore).

The key difference from other methods: PBT can change hyperparameters during training. A learning rate schedule emerges automatically. This is especially useful for settings where the optimal hyperparameters change over the course of training.

HPO methods comparison

Method	Evaluations needed	Parallelizable	Handles conditional HPs	Best for	Limitation
Grid search	Exponential in dims	✓ Fully	✗	≤3 hyperparams	Curse of dimensionality
Random search	Linear in budget	✓ Fully	✗	Quick baseline	Ignores past results
Bayesian (GP)	Low (10-100)	Limited	✓ With encoding	Expensive evaluations	Scales poorly past ~1000 obs
TPE	Low-moderate	Limited	✓ Natively	Mixed types, conditionals	Less sample-efficient than GP
Hyperband	Moderate	✓ Within rounds	✗	Large search spaces	Assumes early stopping signal
ASHA	Moderate	✓ Fully async	✗	Distributed clusters	Same assumption as Hyperband
PBT	Population size	✓ Fully	✓	Schedules, long training	Complex to implement

NAS as AutoML

Neural architecture search takes AutoML further. Instead of just tuning hyperparameters for a fixed architecture, NAS searches over the architecture itself: which operations to use, how to connect layers, how deep and wide to go. Many NAS methods use the same optimization techniques from this article (Bayesian optimization, evolutionary search, bandit methods) applied to a much larger search space.

Practical recommendations

Start with random search. It is your baseline and often good enough.
Use Bayesian optimization when each evaluation is expensive (hours or days per run) and you have fewer than ~100 evaluations to spend.
Use Hyperband/ASHA when you can evaluate configurations cheaply at small budgets (early epochs) and you have a large search space.
Use PBT for very long training runs where you suspect the optimal hyperparameters change during training.
Log everything. Whatever method you use, save all results. Even failed runs contain useful information.

What comes next

Once you can find good hyperparameters, the natural next step is to search over the architecture itself. Neural architecture search applies these optimization ideas to design network structures automatically.

← Back to all series