Mar 21, 2026 · 18 min read · ML Optimization

Optimization in dynamic programming and optimal control

In this series (18 parts)

Prerequisites

You should understand what an optimization problem is (from what is optimization) and basic probability, including expected values and conditional probabilities.

Sequential decisions as optimization

Most optimization problems we have studied so far are “one-shot”: pick a vector $x$ , get a value $f(x)$ , done. But many real problems involve making decisions over time, where each decision affects future options.

A robot navigating a maze. A trader buying and selling stocks. A doctor choosing treatments. In all these cases, the best action now depends on what might happen later. Dynamic programming (DP) is the framework for solving these problems optimally.

The core idea is deceptively simple: if you know the optimal strategy from every possible future state, you can figure out the optimal action right now by just looking one step ahead.

The Bellman equation

Consider a system with states $s \in \mathcal{S}$ , actions $a \in \mathcal{A}$ , and a transition function. Taking action $a$ in state $s$ gives immediate reward $r(s, a)$ and moves you to a new state $s'$ according to a transition probability $P(s' | s, a)$ .

The value function $V(s)$ is the maximum total expected reward starting from state $s$ . It satisfies the Bellman equation:

V(s) = \max_{a \in \mathcal{A}} \left[ r(s, a) + \gamma \sum_{s' \in \mathcal{S}} P(s' | s, a) \, V(s') \right]

where $\gamma \in [0, 1)$ is the discount factor. The discount ensures the sum converges and models the preference for sooner rewards.

The Bellman equation says: the value of a state equals the best action’s immediate reward plus the discounted expected value of wherever you end up.

graph TD
  S["State s"] -->|"Action a₁"| S1["State s₁<br/>r(s, a₁) + γV(s₁)"]
  S -->|"Action a₂"| S2["State s₂<br/>r(s, a₂) + γV(s₂)"]
  S -->|"Action a₃"| S3["State s₃<br/>r(s, a₃) + γV(s₃)"]

Why this is optimization

The “max” in the Bellman equation is where the optimization happens. At every state, you solve a (usually small) optimization problem: which action gives the highest expected return? The remarkable thing about DP is that solving these local problems gives you the globally optimal strategy.

Value iteration

Value iteration is the simplest algorithm for solving the Bellman equation. Start with any initial guess $V_0(s)$ (say, all zeros) and iterate:

V_{k+1}(s) = \max_{a \in \mathcal{A}} \left[ r(s, a) + \gamma \sum_{s'} P(s' | s, a) \, V_k(s') \right]

This is a contraction mapping: $\|V_{k+1} - V^*\|_\infty \le \gamma \|V_k - V^*\|_\infty$ . So value iteration converges geometrically to the true value function $V^*$ , and convergence is faster when $\gamma$ is smaller.

Example 1: A 3-state MDP

Setup: Three states $\{A, B, C\}$ . Two actions: “left” and “right.” Deterministic transitions (for simplicity). Discount factor $\gamma = 0.9$ .

State	Action	Next state	Reward
A	left	A	2
A	right	B	5
B	left	A	1
B	right	C	3
C	left	B	4
C	right	C	1

Value iteration, starting with $V_0(A) = V_0(B) = V_0(C) = 0$ :

Iteration 1:

V_1(A) = \max\{r(A, \text{left}) + 0.9 \cdot V_0(A), \; r(A, \text{right}) + 0.9 \cdot V_0(B)\}

= \max\{2 + 0.9(0), \; 5 + 0.9(0)\} = \max\{2, 5\} = 5

Best action at $A$ : right.

V_1(B) = \max\{1 + 0.9(0), \; 3 + 0.9(0)\} = \max\{1, 3\} = 3

Best action at $B$ : right.

V_1(C) = \max\{4 + 0.9(0), \; 1 + 0.9(0)\} = \max\{4, 1\} = 4

Best action at $C$ : left.

Iteration 2:

V_2(A) = \max\{2 + 0.9(5), \; 5 + 0.9(3)\} = \max\{2 + 4.5, \; 5 + 2.7\} = \max\{6.5, 7.7\} = 7.7

Best action at $A$ : right (go to $B$ ).

V_2(B) = \max\{1 + 0.9(5), \; 3 + 0.9(4)\} = \max\{1 + 4.5, \; 3 + 3.6\} = \max\{5.5, 6.6\} = 6.6

Best action at $B$ : right (go to $C$ ).

V_2(C) = \max\{4 + 0.9(3), \; 1 + 0.9(4)\} = \max\{4 + 2.7, \; 1 + 3.6\} = \max\{6.7, 4.6\} = 6.7

Best action at $C$ : left (go to $B$ ).

Iteration 3:

V_3(A) = \max\{2 + 0.9(7.7), \; 5 + 0.9(6.6)\} = \max\{2 + 6.93, \; 5 + 5.94\} = \max\{8.93, 10.94\} = 10.94

V_3(B) = \max\{1 + 0.9(7.7), \; 3 + 0.9(6.7)\} = \max\{1 + 6.93, \; 3 + 6.03\} = \max\{7.93, 9.03\} = 9.03

V_3(C) = \max\{4 + 0.9(6.6), \; 1 + 0.9(6.7)\} = \max\{4 + 5.94, \; 1 + 6.03\} = \max\{9.94, 7.03\} = 9.94

	$V_1$	$V_2$	$V_3$
A	5	7.7	10.94
B	3	6.6	9.03
C	4	6.7	9.94

The values are growing because the rewards accumulate over the infinite horizon. The policy is stabilizing: $A \to \text{right}$ , $B \to \text{right}$ , $C \to \text{left}$ . This means the optimal cycle is $A \to B \to C \to B \to C \to \cdots$ collecting rewards $5, 3, 4, 3, 4, \dots$ after the first step.

Value function V(s) after convergence. Higher-numbered states have greater expected cumulative reward, reflecting their proximity to the goal.

Policy iteration

Value iteration updates values. Policy iteration alternates between two steps:

Policy evaluation: Given a fixed policy $\pi$ , compute $V^\pi(s)$ by solving the linear system:

V^\pi(s) = r(s, \pi(s)) + \gamma \sum_{s'} P(s' | s, \pi(s)) V^\pi(s')

This is a system of $|\mathcal{S}|$ linear equations in $|\mathcal{S}|$ unknowns.

Policy improvement: Update the policy:

\pi'(s) = \arg\max_{a} \left[ r(s, a) + \gamma \sum_{s'} P(s' | s, a) V^\pi(s') \right]

If $\pi' = \pi$ , the policy is optimal. Otherwise, repeat with $\pi'$ .

Policy iteration converges in at most $|\mathcal{A}|^{|\mathcal{S}|}$ iterations (the number of possible policies), but in practice it converges in very few iterations, often 3 to 5.

Example 2: Policy evaluation by solving a linear system

Using the MDP from Example 1, suppose the current policy is: $\pi(A) = \text{right}$ , $\pi(B) = \text{right}$ , $\pi(C) = \text{left}$ .

The policy evaluation equations ( $\gamma = 0.9$ ):

V(A) = 5 + 0.9 \, V(B)

V(B) = 3 + 0.9 \, V(C)

V(C) = 4 + 0.9 \, V(B)

From equations 2 and 3, substitute:

V(B) = 3 + 0.9(4 + 0.9 \, V(B)) = 3 + 3.6 + 0.81 \, V(B) = 6.6 + 0.81 \, V(B)

0.19 \, V(B) = 6.6

V(B) = 6.6 / 0.19 = 34.74

V(C) = 4 + 0.9(34.74) = 4 + 31.26 = 35.26

V(A) = 5 + 0.9(34.74) = 5 + 31.26 = 36.26

Now check policy improvement. For state $A$ :

Left: $2 + 0.9(36.26) = 2 + 32.63 = 34.63$
Right: $5 + 0.9(34.74) = 5 + 31.26 = 36.26$

Right is still better. ✓

For state $B$ :

Left: $1 + 0.9(36.26) = 1 + 32.63 = 33.63$
Right: $3 + 0.9(35.26) = 3 + 31.74 = 34.74$

Right is still better. ✓

For state $C$ :

Left: $4 + 0.9(34.74) = 4 + 31.26 = 35.26$
Right: $1 + 0.9(35.26) = 1 + 31.74 = 32.74$

Left is still better. ✓

The policy is unchanged, so it is optimal. The optimal values are $V^*(A) = 36.26$ , $V^*(B) = 34.74$ , $V^*(C) = 35.26$ .

Example 3: Shortest path as a DP problem

Problem: Find the shortest path from node $S$ to node $T$ in this graph:

graph LR
  S -->|"4"| A
  S -->|"2"| B
  A -->|"5"| T
  A -->|"1"| B
  B -->|"8"| T
  B -->|"3"| A

Edge weights represent costs. This is a deterministic DP problem with $\gamma = 1$ (no discounting) and we minimize cost instead of maximizing reward.

Bellman equation (cost minimization):

V(s) = \min_{a} \left[ c(s, a) + V(\text{next}(s, a)) \right], \quad V(T) = 0

Value iteration (backwards from $T$ ):

$V(T) = 0$ .

For nodes one step from $T$ :

V(A) = \min\{c(A, T) + V(T), \; c(A, B) + V(B)\}

We do not know $V(B)$ yet, so let us iterate.

Iteration 0: $V_0(S) = V_0(A) = V_0(B) = \infty$ , $V_0(T) = 0$ .

Iteration 1:

V_1(A) = \min\{5 + 0, \; 1 + \infty\} = 5

V_1(B) = \min\{8 + 0, \; 3 + \infty\} = 8

V_1(S) = \min\{4 + \infty, \; 2 + \infty\} = \infty

Hmm, we need values to propagate. Using the updated values within the same iteration (Gauss-Seidel style):

Actually, let me just run more iterations.

Iteration 2:

V_2(A) = \min\{5 + 0, \; 1 + 8\} = \min\{5, 9\} = 5

V_2(B) = \min\{8 + 0, \; 3 + 5\} = \min\{8, 8\} = 8

V_2(S) = \min\{4 + 5, \; 2 + 8\} = \min\{9, 10\} = 9

Iteration 3:

V_3(A) = \min\{5, \; 1 + 8\} = 5

V_3(B) = \min\{8, \; 3 + 5\} = 8

V_3(S) = \min\{4 + 5, \; 2 + 8\} = 9

Converged. Optimal values: $V(S) = 9$ , $V(A) = 5$ , $V(B) = 8$ .

Optimal path: From $S$ , action “go to $A$ ” gives $4 + 5 = 9$ (better than “go to $B$ ” which gives $2 + 8 = 10$ ). From $A$ , action “go to $T$ ” gives $5 + 0 = 5$ (better than “go to $B$ ” which gives $1 + 8 = 9$ ).

Shortest path: $S \to A \to T$ with cost 9.

Note: the path $S \to B \to A \to T$ has cost $2 + 3 + 5 = 10$ . The path $S \to A \to B \to T$ has cost $4 + 1 + 8 = 13$ . DP finds the global optimum without enumerating all paths.

Optimal control

Optimal control is the continuous-time version of DP. Instead of discrete states and actions, you have a continuous state $x(t)$ evolving according to a differential equation:

\dot{x}(t) = f(x(t), u(t))

where $u(t)$ is the control input. The goal is to minimize a cost:

J = \int_0^T L(x(t), u(t)) \, dt + \Phi(x(T))

The continuous Bellman equation becomes the Hamilton-Jacobi-Bellman (HJB) equation:

-\frac{\partial V}{\partial t}(x, t) = \min_{u} \left[ L(x, u) + \nabla_x V(x, t)^T f(x, u) \right]

This PDE is generally hard to solve, but for linear dynamics with quadratic cost (the LQR problem), it reduces to a matrix Riccati equation with a clean closed-form solution.

Connection to reinforcement learning

Reinforcement learning (RL) solves DP problems when you do not know the transition probabilities $P(s'|s,a)$ or reward function $r(s,a)$ . Instead of computing expectations analytically, you learn from experience.

DP concept	RL equivalent
Value iteration	Q-learning
Policy evaluation	TD learning
Policy improvement	Policy gradient
Known model	Model-free learning

Policy gradient as optimization

In policy gradient methods, you parameterize the policy as $\pi_\theta(a|s)$ and optimize the expected return:

J(\theta) = \mathbb{E}_{\pi_\theta} \left[ \sum_{t=0}^{\infty} \gamma^t r(s_t, a_t) \right]

The gradient of this objective is:

\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta} \left[ \sum_{t=0}^{\infty} \gamma^t \nabla_\theta \ln \pi_\theta(a_t | s_t) \, G_t \right]

where $G_t = \sum_{k=t}^{\infty} \gamma^{k-t} r(s_k, a_k)$ is the return from time $t$ . This is the REINFORCE estimator.

You then apply stochastic gradient ascent to improve $\theta$ . Modern RL algorithms like PPO and SAC add various tricks (baselines, clipping, entropy regularization), but at their core they are doing stochastic optimization of the policy parameters.

Curse of dimensionality

The main limitation of exact DP is that the state space must be enumerable. With $n$ continuous state variables, each discretized into $k$ bins, you get $k^n$ states. This exponential blowup is the curse of dimensionality (a term coined by Richard Bellman himself).

Solutions:

Function approximation: Represent $V(s)$ with a neural network instead of a table. This is deep RL.
Linear function approximation: $V(s) \approx \phi(s)^T w$ for features $\phi(s)$ .
Monte Carlo tree search: Only explore promising parts of the state space.

When to use DP

✓ Problems with natural sequential structure (routing, scheduling, inventory management).

✓ Small discrete state spaces where exact solutions are feasible.

✓ As the conceptual foundation for RL algorithms that handle large state spaces.

⚠ Exact DP is impractical for continuous or very large state spaces.

⚠ The model (transitions and rewards) must be known for exact DP. For unknown models, switch to RL.

Python: value iteration

import numpy as np

def value_iteration(n_states, n_actions, transitions, rewards, gamma=0.9, tol=1e-6):
    """
    transitions[s, a] = next_state (deterministic)
    rewards[s, a] = immediate reward
    """
    V = np.zeros(n_states)
    
    for _ in range(1000):
        V_new = np.zeros(n_states)
        for s in range(n_states):
            values = []
            for a in range(n_actions):
                s_next = transitions[s, a]
                values.append(rewards[s, a] + gamma * V[s_next])
            V_new[s] = max(values)
        
        if np.max(np.abs(V_new - V)) < tol:
            break
        V = V_new
    
    # Extract policy
    policy = np.zeros(n_states, dtype=int)
    for s in range(n_states):
        values = []
        for a in range(n_actions):
            s_next = transitions[s, a]
            values.append(rewards[s, a] + gamma * V[s_next])
        policy[s] = np.argmax(values)
    
    return V, policy

# Example: 3-state MDP from Example 1
# States: A=0, B=1, C=2. Actions: left=0, right=1.
trans = np.array([[0, 1], [0, 2], [1, 2]])  # transitions[s,a] = next state
rew = np.array([[2, 5], [1, 3], [4, 1]])    # rewards[s,a]

V, pi = value_iteration(3, 2, trans, rew, gamma=0.9)
actions = ["left", "right"]
for s, name in enumerate(["A", "B", "C"]):
    print(f"V({name}) = {V[s]:.2f}, policy: {actions[pi[s]]}")

What comes next

Policy gradient methods optimize the policy parameters using stochastic gradients. This connects DP to the world of stochastic gradient descent and its variants, where we will cover SGD, momentum, RMSProp, Adam, and learning rate schedules for training machine learning models.

← Back to all series