Search…

Optimization in dynamic programming and optimal control

In this series (18 parts)
  1. What is optimization and why ML needs it
  2. Convex sets and convex functions
  3. Optimality conditions: first order
  4. Optimality conditions: second order
  5. Line search methods
  6. Least squares: the closed-form solution
  7. Steepest descent (gradient descent)
  8. Newton's method for optimization
  9. Quasi-Newton methods: BFGS and L-BFGS
  10. Conjugate gradient methods
  11. Constrained optimization and Lagrangian duality
  12. KKT conditions
  13. Penalty and barrier methods
  14. Interior point methods
  15. The simplex method
  16. Frank-Wolfe method
  17. Optimization in dynamic programming and optimal control
  18. Stochastic gradient descent and variants

Prerequisites

You should understand what an optimization problem is (from what is optimization) and basic probability, including expected values and conditional probabilities.


Sequential decisions as optimization

Most optimization problems we have studied so far are “one-shot”: pick a vector xx, get a value f(x)f(x), done. But many real problems involve making decisions over time, where each decision affects future options.

A robot navigating a maze. A trader buying and selling stocks. A doctor choosing treatments. In all these cases, the best action now depends on what might happen later. Dynamic programming (DP) is the framework for solving these problems optimally.

The core idea is deceptively simple: if you know the optimal strategy from every possible future state, you can figure out the optimal action right now by just looking one step ahead.


The Bellman equation

Consider a system with states sSs \in \mathcal{S}, actions aAa \in \mathcal{A}, and a transition function. Taking action aa in state ss gives immediate reward r(s,a)r(s, a) and moves you to a new state ss' according to a transition probability P(ss,a)P(s' | s, a).

The value function V(s)V(s) is the maximum total expected reward starting from state ss. It satisfies the Bellman equation:

V(s)=maxaA[r(s,a)+γsSP(ss,a)V(s)]V(s) = \max_{a \in \mathcal{A}} \left[ r(s, a) + \gamma \sum_{s' \in \mathcal{S}} P(s' | s, a) \, V(s') \right]

where γ[0,1)\gamma \in [0, 1) is the discount factor. The discount ensures the sum converges and models the preference for sooner rewards.

The Bellman equation says: the value of a state equals the best action’s immediate reward plus the discounted expected value of wherever you end up.

graph TD
  S["State s"] -->|"Action a₁"| S1["State s₁<br/>r(s, a₁) + γV(s₁)"]
  S -->|"Action a₂"| S2["State s₂<br/>r(s, a₂) + γV(s₂)"]
  S -->|"Action a₃"| S3["State s₃<br/>r(s, a₃) + γV(s₃)"]

Why this is optimization

The “max” in the Bellman equation is where the optimization happens. At every state, you solve a (usually small) optimization problem: which action gives the highest expected return? The remarkable thing about DP is that solving these local problems gives you the globally optimal strategy.


Value iteration

Value iteration is the simplest algorithm for solving the Bellman equation. Start with any initial guess V0(s)V_0(s) (say, all zeros) and iterate:

Vk+1(s)=maxaA[r(s,a)+γsP(ss,a)Vk(s)]V_{k+1}(s) = \max_{a \in \mathcal{A}} \left[ r(s, a) + \gamma \sum_{s'} P(s' | s, a) \, V_k(s') \right]

This is a contraction mapping: Vk+1VγVkV\|V_{k+1} - V^*\|_\infty \le \gamma \|V_k - V^*\|_\infty. So value iteration converges geometrically to the true value function VV^*, and convergence is faster when γ\gamma is smaller.


Example 1: A 3-state MDP

Setup: Three states {A,B,C}\{A, B, C\}. Two actions: “left” and “right.” Deterministic transitions (for simplicity). Discount factor γ=0.9\gamma = 0.9.

StateActionNext stateReward
AleftA2
ArightB5
BleftA1
BrightC3
CleftB4
CrightC1

Value iteration, starting with V0(A)=V0(B)=V0(C)=0V_0(A) = V_0(B) = V_0(C) = 0:

Iteration 1:

V1(A)=max{r(A,left)+0.9V0(A),  r(A,right)+0.9V0(B)}V_1(A) = \max\{r(A, \text{left}) + 0.9 \cdot V_0(A), \; r(A, \text{right}) + 0.9 \cdot V_0(B)\} =max{2+0.9(0),  5+0.9(0)}=max{2,5}=5= \max\{2 + 0.9(0), \; 5 + 0.9(0)\} = \max\{2, 5\} = 5

Best action at AA: right.

V1(B)=max{1+0.9(0),  3+0.9(0)}=max{1,3}=3V_1(B) = \max\{1 + 0.9(0), \; 3 + 0.9(0)\} = \max\{1, 3\} = 3

Best action at BB: right.

V1(C)=max{4+0.9(0),  1+0.9(0)}=max{4,1}=4V_1(C) = \max\{4 + 0.9(0), \; 1 + 0.9(0)\} = \max\{4, 1\} = 4

Best action at CC: left.

Iteration 2:

V2(A)=max{2+0.9(5),  5+0.9(3)}=max{2+4.5,  5+2.7}=max{6.5,7.7}=7.7V_2(A) = \max\{2 + 0.9(5), \; 5 + 0.9(3)\} = \max\{2 + 4.5, \; 5 + 2.7\} = \max\{6.5, 7.7\} = 7.7

Best action at AA: right (go to BB).

V2(B)=max{1+0.9(5),  3+0.9(4)}=max{1+4.5,  3+3.6}=max{5.5,6.6}=6.6V_2(B) = \max\{1 + 0.9(5), \; 3 + 0.9(4)\} = \max\{1 + 4.5, \; 3 + 3.6\} = \max\{5.5, 6.6\} = 6.6

Best action at BB: right (go to CC).

V2(C)=max{4+0.9(3),  1+0.9(4)}=max{4+2.7,  1+3.6}=max{6.7,4.6}=6.7V_2(C) = \max\{4 + 0.9(3), \; 1 + 0.9(4)\} = \max\{4 + 2.7, \; 1 + 3.6\} = \max\{6.7, 4.6\} = 6.7

Best action at CC: left (go to BB).

Iteration 3:

V3(A)=max{2+0.9(7.7),  5+0.9(6.6)}=max{2+6.93,  5+5.94}=max{8.93,10.94}=10.94V_3(A) = \max\{2 + 0.9(7.7), \; 5 + 0.9(6.6)\} = \max\{2 + 6.93, \; 5 + 5.94\} = \max\{8.93, 10.94\} = 10.94 V3(B)=max{1+0.9(7.7),  3+0.9(6.7)}=max{1+6.93,  3+6.03}=max{7.93,9.03}=9.03V_3(B) = \max\{1 + 0.9(7.7), \; 3 + 0.9(6.7)\} = \max\{1 + 6.93, \; 3 + 6.03\} = \max\{7.93, 9.03\} = 9.03 V3(C)=max{4+0.9(6.6),  1+0.9(6.7)}=max{4+5.94,  1+6.03}=max{9.94,7.03}=9.94V_3(C) = \max\{4 + 0.9(6.6), \; 1 + 0.9(6.7)\} = \max\{4 + 5.94, \; 1 + 6.03\} = \max\{9.94, 7.03\} = 9.94
V0V_0V1V_1V2V_2V3V_3
A057.710.94
B036.69.03
C046.79.94

The values are growing because the rewards accumulate over the infinite horizon. The policy is stabilizing: ArightA \to \text{right}, BrightB \to \text{right}, CleftC \to \text{left}. This means the optimal cycle is ABCBCA \to B \to C \to B \to C \to \cdots collecting rewards 5,3,4,3,4,5, 3, 4, 3, 4, \dots after the first step.

Value function V(s) after convergence. Higher-numbered states have greater expected cumulative reward, reflecting their proximity to the goal.


Policy iteration

Value iteration updates values. Policy iteration alternates between two steps:

  1. Policy evaluation: Given a fixed policy π\pi, compute Vπ(s)V^\pi(s) by solving the linear system:
Vπ(s)=r(s,π(s))+γsP(ss,π(s))Vπ(s)V^\pi(s) = r(s, \pi(s)) + \gamma \sum_{s'} P(s' | s, \pi(s)) V^\pi(s')

This is a system of S|\mathcal{S}| linear equations in S|\mathcal{S}| unknowns.

  1. Policy improvement: Update the policy:
π(s)=argmaxa[r(s,a)+γsP(ss,a)Vπ(s)]\pi'(s) = \arg\max_{a} \left[ r(s, a) + \gamma \sum_{s'} P(s' | s, a) V^\pi(s') \right]

If π=π\pi' = \pi, the policy is optimal. Otherwise, repeat with π\pi'.

Policy iteration converges in at most AS|\mathcal{A}|^{|\mathcal{S}|} iterations (the number of possible policies), but in practice it converges in very few iterations, often 3 to 5.


Example 2: Policy evaluation by solving a linear system

Using the MDP from Example 1, suppose the current policy is: π(A)=right\pi(A) = \text{right}, π(B)=right\pi(B) = \text{right}, π(C)=left\pi(C) = \text{left}.

The policy evaluation equations (γ=0.9\gamma = 0.9):

V(A)=5+0.9V(B)V(A) = 5 + 0.9 \, V(B) V(B)=3+0.9V(C)V(B) = 3 + 0.9 \, V(C) V(C)=4+0.9V(B)V(C) = 4 + 0.9 \, V(B)

From equations 2 and 3, substitute:

V(B)=3+0.9(4+0.9V(B))=3+3.6+0.81V(B)=6.6+0.81V(B)V(B) = 3 + 0.9(4 + 0.9 \, V(B)) = 3 + 3.6 + 0.81 \, V(B) = 6.6 + 0.81 \, V(B) 0.19V(B)=6.60.19 \, V(B) = 6.6 V(B)=6.6/0.19=34.74V(B) = 6.6 / 0.19 = 34.74 V(C)=4+0.9(34.74)=4+31.26=35.26V(C) = 4 + 0.9(34.74) = 4 + 31.26 = 35.26 V(A)=5+0.9(34.74)=5+31.26=36.26V(A) = 5 + 0.9(34.74) = 5 + 31.26 = 36.26

Now check policy improvement. For state AA:

  • Left: 2+0.9(36.26)=2+32.63=34.632 + 0.9(36.26) = 2 + 32.63 = 34.63
  • Right: 5+0.9(34.74)=5+31.26=36.265 + 0.9(34.74) = 5 + 31.26 = 36.26

Right is still better. ✓

For state BB:

  • Left: 1+0.9(36.26)=1+32.63=33.631 + 0.9(36.26) = 1 + 32.63 = 33.63
  • Right: 3+0.9(35.26)=3+31.74=34.743 + 0.9(35.26) = 3 + 31.74 = 34.74

Right is still better. ✓

For state CC:

  • Left: 4+0.9(34.74)=4+31.26=35.264 + 0.9(34.74) = 4 + 31.26 = 35.26
  • Right: 1+0.9(35.26)=1+31.74=32.741 + 0.9(35.26) = 1 + 31.74 = 32.74

Left is still better. ✓

The policy is unchanged, so it is optimal. The optimal values are V(A)=36.26V^*(A) = 36.26, V(B)=34.74V^*(B) = 34.74, V(C)=35.26V^*(C) = 35.26.


Example 3: Shortest path as a DP problem

Problem: Find the shortest path from node SS to node TT in this graph:

graph LR
  S -->|"4"| A
  S -->|"2"| B
  A -->|"5"| T
  A -->|"1"| B
  B -->|"8"| T
  B -->|"3"| A

Edge weights represent costs. This is a deterministic DP problem with γ=1\gamma = 1 (no discounting) and we minimize cost instead of maximizing reward.

Bellman equation (cost minimization):

V(s)=mina[c(s,a)+V(next(s,a))],V(T)=0V(s) = \min_{a} \left[ c(s, a) + V(\text{next}(s, a)) \right], \quad V(T) = 0

Value iteration (backwards from TT):

V(T)=0V(T) = 0.

For nodes one step from TT:

V(A)=min{c(A,T)+V(T),  c(A,B)+V(B)}V(A) = \min\{c(A, T) + V(T), \; c(A, B) + V(B)\}

We do not know V(B)V(B) yet, so let us iterate.

Iteration 0: V0(S)=V0(A)=V0(B)=V_0(S) = V_0(A) = V_0(B) = \infty, V0(T)=0V_0(T) = 0.

Iteration 1:

V1(A)=min{5+0,  1+}=5V_1(A) = \min\{5 + 0, \; 1 + \infty\} = 5 V1(B)=min{8+0,  3+}=8V_1(B) = \min\{8 + 0, \; 3 + \infty\} = 8 V1(S)=min{4+,  2+}=V_1(S) = \min\{4 + \infty, \; 2 + \infty\} = \infty

Hmm, we need values to propagate. Using the updated values within the same iteration (Gauss-Seidel style):

Actually, let me just run more iterations.

Iteration 2:

V2(A)=min{5+0,  1+8}=min{5,9}=5V_2(A) = \min\{5 + 0, \; 1 + 8\} = \min\{5, 9\} = 5 V2(B)=min{8+0,  3+5}=min{8,8}=8V_2(B) = \min\{8 + 0, \; 3 + 5\} = \min\{8, 8\} = 8 V2(S)=min{4+5,  2+8}=min{9,10}=9V_2(S) = \min\{4 + 5, \; 2 + 8\} = \min\{9, 10\} = 9

Iteration 3:

V3(A)=min{5,  1+8}=5V_3(A) = \min\{5, \; 1 + 8\} = 5 V3(B)=min{8,  3+5}=8V_3(B) = \min\{8, \; 3 + 5\} = 8 V3(S)=min{4+5,  2+8}=9V_3(S) = \min\{4 + 5, \; 2 + 8\} = 9

Converged. Optimal values: V(S)=9V(S) = 9, V(A)=5V(A) = 5, V(B)=8V(B) = 8.

Optimal path: From SS, action “go to AA” gives 4+5=94 + 5 = 9 (better than “go to BB” which gives 2+8=102 + 8 = 10). From AA, action “go to TT” gives 5+0=55 + 0 = 5 (better than “go to BB” which gives 1+8=91 + 8 = 9).

Shortest path: SATS \to A \to T with cost 9.

Note: the path SBATS \to B \to A \to T has cost 2+3+5=102 + 3 + 5 = 10. The path SABTS \to A \to B \to T has cost 4+1+8=134 + 1 + 8 = 13. DP finds the global optimum without enumerating all paths.


Optimal control

Optimal control is the continuous-time version of DP. Instead of discrete states and actions, you have a continuous state x(t)x(t) evolving according to a differential equation:

x˙(t)=f(x(t),u(t))\dot{x}(t) = f(x(t), u(t))

where u(t)u(t) is the control input. The goal is to minimize a cost:

J=0TL(x(t),u(t))dt+Φ(x(T))J = \int_0^T L(x(t), u(t)) \, dt + \Phi(x(T))

The continuous Bellman equation becomes the Hamilton-Jacobi-Bellman (HJB) equation:

Vt(x,t)=minu[L(x,u)+xV(x,t)Tf(x,u)]-\frac{\partial V}{\partial t}(x, t) = \min_{u} \left[ L(x, u) + \nabla_x V(x, t)^T f(x, u) \right]

This PDE is generally hard to solve, but for linear dynamics with quadratic cost (the LQR problem), it reduces to a matrix Riccati equation with a clean closed-form solution.


Connection to reinforcement learning

Reinforcement learning (RL) solves DP problems when you do not know the transition probabilities P(ss,a)P(s'|s,a) or reward function r(s,a)r(s,a). Instead of computing expectations analytically, you learn from experience.

DP conceptRL equivalent
Value iterationQ-learning
Policy evaluationTD learning
Policy improvementPolicy gradient
Known modelModel-free learning

Policy gradient as optimization

In policy gradient methods, you parameterize the policy as πθ(as)\pi_\theta(a|s) and optimize the expected return:

J(θ)=Eπθ[t=0γtr(st,at)]J(\theta) = \mathbb{E}_{\pi_\theta} \left[ \sum_{t=0}^{\infty} \gamma^t r(s_t, a_t) \right]

The gradient of this objective is:

θJ(θ)=Eπθ[t=0γtθlnπθ(atst)Gt]\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta} \left[ \sum_{t=0}^{\infty} \gamma^t \nabla_\theta \ln \pi_\theta(a_t | s_t) \, G_t \right]

where Gt=k=tγktr(sk,ak)G_t = \sum_{k=t}^{\infty} \gamma^{k-t} r(s_k, a_k) is the return from time tt. This is the REINFORCE estimator.

You then apply stochastic gradient ascent to improve θ\theta. Modern RL algorithms like PPO and SAC add various tricks (baselines, clipping, entropy regularization), but at their core they are doing stochastic optimization of the policy parameters.


Curse of dimensionality

The main limitation of exact DP is that the state space must be enumerable. With nn continuous state variables, each discretized into kk bins, you get knk^n states. This exponential blowup is the curse of dimensionality (a term coined by Richard Bellman himself).

Solutions:

  • Function approximation: Represent V(s)V(s) with a neural network instead of a table. This is deep RL.
  • Linear function approximation: V(s)ϕ(s)TwV(s) \approx \phi(s)^T w for features ϕ(s)\phi(s).
  • Monte Carlo tree search: Only explore promising parts of the state space.

When to use DP

✓ Problems with natural sequential structure (routing, scheduling, inventory management).

✓ Small discrete state spaces where exact solutions are feasible.

✓ As the conceptual foundation for RL algorithms that handle large state spaces.

⚠ Exact DP is impractical for continuous or very large state spaces.

⚠ The model (transitions and rewards) must be known for exact DP. For unknown models, switch to RL.


Python: value iteration

import numpy as np

def value_iteration(n_states, n_actions, transitions, rewards, gamma=0.9, tol=1e-6):
    """
    transitions[s, a] = next_state (deterministic)
    rewards[s, a] = immediate reward
    """
    V = np.zeros(n_states)
    
    for _ in range(1000):
        V_new = np.zeros(n_states)
        for s in range(n_states):
            values = []
            for a in range(n_actions):
                s_next = transitions[s, a]
                values.append(rewards[s, a] + gamma * V[s_next])
            V_new[s] = max(values)
        
        if np.max(np.abs(V_new - V)) < tol:
            break
        V = V_new
    
    # Extract policy
    policy = np.zeros(n_states, dtype=int)
    for s in range(n_states):
        values = []
        for a in range(n_actions):
            s_next = transitions[s, a]
            values.append(rewards[s, a] + gamma * V[s_next])
        policy[s] = np.argmax(values)
    
    return V, policy

# Example: 3-state MDP from Example 1
# States: A=0, B=1, C=2. Actions: left=0, right=1.
trans = np.array([[0, 1], [0, 2], [1, 2]])  # transitions[s,a] = next state
rew = np.array([[2, 5], [1, 3], [4, 1]])    # rewards[s,a]

V, pi = value_iteration(3, 2, trans, rew, gamma=0.9)
actions = ["left", "right"]
for s, name in enumerate(["A", "B", "C"]):
    print(f"V({name}) = {V[s]:.2f}, policy: {actions[pi[s]]}")

What comes next

Policy gradient methods optimize the policy parameters using stochastic gradients. This connects DP to the world of stochastic gradient descent and its variants, where we will cover SGD, momentum, RMSProp, Adam, and learning rate schedules for training machine learning models.

Start typing to search across all content
navigate Enter open Esc close