Optimization in dynamic programming and optimal control
In this series (18 parts)
- What is optimization and why ML needs it
- Convex sets and convex functions
- Optimality conditions: first order
- Optimality conditions: second order
- Line search methods
- Least squares: the closed-form solution
- Steepest descent (gradient descent)
- Newton's method for optimization
- Quasi-Newton methods: BFGS and L-BFGS
- Conjugate gradient methods
- Constrained optimization and Lagrangian duality
- KKT conditions
- Penalty and barrier methods
- Interior point methods
- The simplex method
- Frank-Wolfe method
- Optimization in dynamic programming and optimal control
- Stochastic gradient descent and variants
Prerequisites
You should understand what an optimization problem is (from what is optimization) and basic probability, including expected values and conditional probabilities.
Sequential decisions as optimization
Most optimization problems we have studied so far are “one-shot”: pick a vector , get a value , done. But many real problems involve making decisions over time, where each decision affects future options.
A robot navigating a maze. A trader buying and selling stocks. A doctor choosing treatments. In all these cases, the best action now depends on what might happen later. Dynamic programming (DP) is the framework for solving these problems optimally.
The core idea is deceptively simple: if you know the optimal strategy from every possible future state, you can figure out the optimal action right now by just looking one step ahead.
The Bellman equation
Consider a system with states , actions , and a transition function. Taking action in state gives immediate reward and moves you to a new state according to a transition probability .
The value function is the maximum total expected reward starting from state . It satisfies the Bellman equation:
where is the discount factor. The discount ensures the sum converges and models the preference for sooner rewards.
The Bellman equation says: the value of a state equals the best action’s immediate reward plus the discounted expected value of wherever you end up.
graph TD S["State s"] -->|"Action a₁"| S1["State s₁<br/>r(s, a₁) + γV(s₁)"] S -->|"Action a₂"| S2["State s₂<br/>r(s, a₂) + γV(s₂)"] S -->|"Action a₃"| S3["State s₃<br/>r(s, a₃) + γV(s₃)"]
Why this is optimization
The “max” in the Bellman equation is where the optimization happens. At every state, you solve a (usually small) optimization problem: which action gives the highest expected return? The remarkable thing about DP is that solving these local problems gives you the globally optimal strategy.
Value iteration
Value iteration is the simplest algorithm for solving the Bellman equation. Start with any initial guess (say, all zeros) and iterate:
This is a contraction mapping: . So value iteration converges geometrically to the true value function , and convergence is faster when is smaller.
Example 1: A 3-state MDP
Setup: Three states . Two actions: “left” and “right.” Deterministic transitions (for simplicity). Discount factor .
| State | Action | Next state | Reward |
|---|---|---|---|
| A | left | A | 2 |
| A | right | B | 5 |
| B | left | A | 1 |
| B | right | C | 3 |
| C | left | B | 4 |
| C | right | C | 1 |
Value iteration, starting with :
Iteration 1:
Best action at : right.
Best action at : right.
Best action at : left.
Iteration 2:
Best action at : right (go to ).
Best action at : right (go to ).
Best action at : left (go to ).
Iteration 3:
| A | 0 | 5 | 7.7 | 10.94 |
| B | 0 | 3 | 6.6 | 9.03 |
| C | 0 | 4 | 6.7 | 9.94 |
The values are growing because the rewards accumulate over the infinite horizon. The policy is stabilizing: , , . This means the optimal cycle is collecting rewards after the first step.
Value function V(s) after convergence. Higher-numbered states have greater expected cumulative reward, reflecting their proximity to the goal.
Policy iteration
Value iteration updates values. Policy iteration alternates between two steps:
- Policy evaluation: Given a fixed policy , compute by solving the linear system:
This is a system of linear equations in unknowns.
- Policy improvement: Update the policy:
If , the policy is optimal. Otherwise, repeat with .
Policy iteration converges in at most iterations (the number of possible policies), but in practice it converges in very few iterations, often 3 to 5.
Example 2: Policy evaluation by solving a linear system
Using the MDP from Example 1, suppose the current policy is: , , .
The policy evaluation equations ():
From equations 2 and 3, substitute:
Now check policy improvement. For state :
- Left:
- Right:
Right is still better. ✓
For state :
- Left:
- Right:
Right is still better. ✓
For state :
- Left:
- Right:
Left is still better. ✓
The policy is unchanged, so it is optimal. The optimal values are , , .
Example 3: Shortest path as a DP problem
Problem: Find the shortest path from node to node in this graph:
graph LR S -->|"4"| A S -->|"2"| B A -->|"5"| T A -->|"1"| B B -->|"8"| T B -->|"3"| A
Edge weights represent costs. This is a deterministic DP problem with (no discounting) and we minimize cost instead of maximizing reward.
Bellman equation (cost minimization):
Value iteration (backwards from ):
.
For nodes one step from :
We do not know yet, so let us iterate.
Iteration 0: , .
Iteration 1:
Hmm, we need values to propagate. Using the updated values within the same iteration (Gauss-Seidel style):
Actually, let me just run more iterations.
Iteration 2:
Iteration 3:
Converged. Optimal values: , , .
Optimal path: From , action “go to ” gives (better than “go to ” which gives ). From , action “go to ” gives (better than “go to ” which gives ).
Shortest path: with cost 9.
Note: the path has cost . The path has cost . DP finds the global optimum without enumerating all paths.
Optimal control
Optimal control is the continuous-time version of DP. Instead of discrete states and actions, you have a continuous state evolving according to a differential equation:
where is the control input. The goal is to minimize a cost:
The continuous Bellman equation becomes the Hamilton-Jacobi-Bellman (HJB) equation:
This PDE is generally hard to solve, but for linear dynamics with quadratic cost (the LQR problem), it reduces to a matrix Riccati equation with a clean closed-form solution.
Connection to reinforcement learning
Reinforcement learning (RL) solves DP problems when you do not know the transition probabilities or reward function . Instead of computing expectations analytically, you learn from experience.
| DP concept | RL equivalent |
|---|---|
| Value iteration | Q-learning |
| Policy evaluation | TD learning |
| Policy improvement | Policy gradient |
| Known model | Model-free learning |
Policy gradient as optimization
In policy gradient methods, you parameterize the policy as and optimize the expected return:
The gradient of this objective is:
where is the return from time . This is the REINFORCE estimator.
You then apply stochastic gradient ascent to improve . Modern RL algorithms like PPO and SAC add various tricks (baselines, clipping, entropy regularization), but at their core they are doing stochastic optimization of the policy parameters.
Curse of dimensionality
The main limitation of exact DP is that the state space must be enumerable. With continuous state variables, each discretized into bins, you get states. This exponential blowup is the curse of dimensionality (a term coined by Richard Bellman himself).
Solutions:
- Function approximation: Represent with a neural network instead of a table. This is deep RL.
- Linear function approximation: for features .
- Monte Carlo tree search: Only explore promising parts of the state space.
When to use DP
✓ Problems with natural sequential structure (routing, scheduling, inventory management).
✓ Small discrete state spaces where exact solutions are feasible.
✓ As the conceptual foundation for RL algorithms that handle large state spaces.
⚠ Exact DP is impractical for continuous or very large state spaces.
⚠ The model (transitions and rewards) must be known for exact DP. For unknown models, switch to RL.
Python: value iteration
import numpy as np
def value_iteration(n_states, n_actions, transitions, rewards, gamma=0.9, tol=1e-6):
"""
transitions[s, a] = next_state (deterministic)
rewards[s, a] = immediate reward
"""
V = np.zeros(n_states)
for _ in range(1000):
V_new = np.zeros(n_states)
for s in range(n_states):
values = []
for a in range(n_actions):
s_next = transitions[s, a]
values.append(rewards[s, a] + gamma * V[s_next])
V_new[s] = max(values)
if np.max(np.abs(V_new - V)) < tol:
break
V = V_new
# Extract policy
policy = np.zeros(n_states, dtype=int)
for s in range(n_states):
values = []
for a in range(n_actions):
s_next = transitions[s, a]
values.append(rewards[s, a] + gamma * V[s_next])
policy[s] = np.argmax(values)
return V, policy
# Example: 3-state MDP from Example 1
# States: A=0, B=1, C=2. Actions: left=0, right=1.
trans = np.array([[0, 1], [0, 2], [1, 2]]) # transitions[s,a] = next state
rew = np.array([[2, 5], [1, 3], [4, 1]]) # rewards[s,a]
V, pi = value_iteration(3, 2, trans, rew, gamma=0.9)
actions = ["left", "right"]
for s, name in enumerate(["A", "B", "C"]):
print(f"V({name}) = {V[s]:.2f}, policy: {actions[pi[s]]}")
What comes next
Policy gradient methods optimize the policy parameters using stochastic gradients. This connects DP to the world of stochastic gradient descent and its variants, where we will cover SGD, momentum, RMSProp, Adam, and learning rate schedules for training machine learning models.