# From Dynamic Programming to Reinforcement Learning

Overview

• Markov Process, Markov Reward Process, and Markov Decision Process
• Dynamic Programming
• Approximate Dynamic Programming
• Reinforcement Learning

## 1 Markov Process, Markov Reward Process, and Markov Decision Process

Five key elements:

1. state ($s_t\in S$)
2. action ($a_t\in A$)
3. state transition probability ($p(s_{t+1}|s_t,a_t)$; one step transition matrix $F$)
4. reward function ($r$)
5. discounting factor ($\beta$)

Markov Property of States:

• The probability of next state depends only on current state, i.e., the future is independent of the past given the present.

Markov Process:

• A countable set of states
• Probability Distribution of Start States: $S\to [0,1]$
• Markov Property of States
• Transition Probability $P$
• Default: Discrete-Time, Countable-States Stationary Markov Process

Markov Reward Process:

• A Markov Process, along with a time-indexed sequence of Reward random variable $r_t$

Markov Decision Process:

• A mathematical model of a decision maker who is in. state $s_t$ at time $t=1\cdots,T$, and takes an action $a_t$ that determines current payoff $r(s_t,a_t)$ and affects the distribution of next period’s state $s_{t+1}$ via a Markov transition probability $p(s_{t+1}|s_t,a_t)$.

Policy
A policy $\pi$ is a mapping: $S\to A$.

Value Function
$V^{\pi}$: value function for policy $\pi$

• For a policy $\pi$, $V^{\pi}: S\to R$ ,is s.t. $V^{\pi}(s)$ the expected total payoff in the future for staring in state $s$ and executing policy $\pi$.

$V^{*}$ is the optimal value function, i.e., $V^{*}(s)=\underset{\pi}{\max}V^{\pi}(s)$

### Bellman equations (very important!)

According to the definition of $V^{\pi}(s)$,

$V^{\pi}(s)=E[r(s_0)+\beta r(s_1)+\beta^2 r(s_2)+\cdots|\pi, s_0]$ (if the sequence of states is determined and known)
$\Longrightarrow V^{\pi}(s)=E[r(s_0)+\beta V^{\pi}(s_1)]$

Since $s_1\sim P(s_1|s_0, \pi)$, then we get $V^{\pi}(s)=r(s)+\beta \sum_{s'}p(s'|s,\pi)V^{\pi}(s')$

• Given a policy $\pi$, we have a system of $S$ linear equations. Solving these equations, we can derive the value functions $V^{\pi}(s)$.

### Optimal Policy

The optimal policy is the policy that achieves the highest value for every state $\pi^* = \underset{\pi}{\text{argmax}}V^{\pi}(s)$ and it’s value function is written as $V^*$.

• There are an exponential number of policies ($|A|^{|S|}$). Hence, we do not use this formulation to find the optimal policy.

Using Bellman equation, we can directly define the optimal value function
$V^*(s)=r(s)+ \beta\underset{a}{\max}\sum_{s'}p(s'|s,a)V^*(s')$ and optimal policy is simply the action that attains this max. $\pi^*(s)=\underset{a}{\text{argmax}}\sum_{s'}p(s'|s,a)V^*(s')$

## 2 Dynamic Programming

### 2.1 General Introduction

• Original use of DP term: MDP theory and solution methods
• Bellman referred to DP as the Principle of Optimality
• Later, the usage of the term DP diffused out to other algorithms
• In CS, it means “recursive algorithms with overlapping subproblems”
• In general,
• DP for prediction: compute the Value Function of an MP
• DP for control: compute the Optimal Value Function of an MDP

The core of DP is to figure out Value Function.

• Value Iteration
• Policy Iteration

#### 2.1.1 Backward Induction

When we face with finite-horizon MDPs, we can simply walk backwards and set the value function from the horizon-end to the start

#### 2.1.2 Value Iteration

Repeatedly update an estimate of the optimal value function according to Bellman optimality equation:

1. Initialize an estimate for the value function arbitrarily:
$V(s) \leftarrow 0, \forall s \in S$
2. Repeat, update:
$\hat{V}(s)=r(s)+ \beta\underset{a}{\max}\sum_{s'}p(s'|s,a)\hat{V}(s')$
##### Theorem of Convergence
• Theorem: Value iteration converges to optimal value: $\hat{V}\to V^*$.
• Proof: (key point: Bellman operator is a contraction mapping)
##### How many iterations will it take to find optimal policy?

Assume the reward per period $\in [0, R]$, then $V^*(s)\leq \sum_{t=1}^{\infty}\beta^{t}R=\frac{R}{1-\beta}.$

Then letting $\hat{V}^k$ be value after $k$th iteration
$\underset{s\in S}{\max}||\hat{V}^k(s)-V^*(s)||\leq \frac{\beta^k R}{1-\beta}.$

Thus, we have linear convergence to optimal value function

#### 2.1.3 Policy Iteration

1. Initialize policy $\hat{\pi}$ randomly.
2. Compute value of policy, $V^{\pi}$.
3. Update $\pi$ to be greedy policy with respect to $V^{\pi}$:
$\hat{\pi}(s)\leftarrow\underset{a}{\text{argmax}}\sum_{s'} P(s'|s,a)V^{\pi}(s')$
4. If policy $\pi$ changed in last iteration, return to step 2.
##### Theorem of Convergence
• Theorem: Policy iteration converges to optimal policy: $\hat{\pi}\to \pi^*$.
• It takes less iterations for policy iteration to find optimal policy than value iteration.

### 2.2 SingleAgent Models – Rust’s Model (Rust 1987)

• Nested Fixed Point Algorithm
• Hotz–Miller’s CCP Method
• Recursive CCP Estimation (NPL algorithm)
• Constrained Optimization

#### 2.2.1 The Set-Up

At each period, the decision maker decides whether to replace a bus. The decision maker chooses an optimal replacement policy to minimize the cumulative discounted cost.

State
State $x_t$, refers to the mileage of a bus.

• For computational reasons, Rust discretizes the state space into 90 intervals (an interval covers 5000 miles), i.e., the state space is discrete.

Action
Action $a_t\in \{0,1\}$, where

• $a_t=1$ means replacing the engine,
• $a_t=1$ means keeping the engine and performing normal maintenance.

State Transition Probability
The state transition probability:
$p(x_{t+1} | x_{t} , a_{t} , \theta_{3}) = \left\{ \begin{array}{l l} g(x_{t+1} - x_{t}, \theta_{3}) & \quad \text{if } a_t = 0\\ g(x_{t+1} - 0, \theta_{3}) & \quad \text{if } a_t = 1 \end{array} \right.$

• Rust found that the increase of mileage of a bus between a period is no more than 2 interval. Hence, we form the transition probability as
• $p(x_{t+1} | x_{t} , a_{t} , \theta_{3}) =\begin{cases}p(x_{t+1}-x_t=0|x_t, a_t, \theta_3) = \theta_{30}\\p(x_{t+1}-x_t=1|x_t, a_t, \theta_3) = \theta_{31}\\p(x_{t+1}-x_t=2|x_t, a_t, \theta_3) = 1-\theta_{30}-\theta_{31}\end{cases}$

Reward Function
The per-period reward function:
$\small r(x_{t}, a_t, \theta_{1}) = \left\{ \begin{array}{l l} \quad u(x_t,0, \theta_1) + \epsilon_t(0) &= -c(x_t, \theta_1) + \epsilon_t(0) & \quad \text{if } a_t = 0\\ u(x_t,1, \theta_1) + \epsilon_t(1) &= -RC -c(0, \theta_1) + \epsilon_t(1) & \quad \text{if } a_t = 1 \end{array} \right.$

• $RC$: the net costs of replacing an engine,

• $c(x_t, \theta_1)$: regular maintenance costs,

• $\epsilon_t$: payoff shocks.

• $x_t$ is observable to both agent and econometrician, but $\epsilon_t$ is only observable to the agent.

• $\epsilon$ is necessary for a coherent model, for sometimes we observe the agent making different decisions for the same value of $x$

Decision Process
The agent must choose a sequence of decision rules to maximize expected discounted utility over an infinite horizon.
$V^*(x_t, \epsilon_t)=\underset{\pi}{sup}E[\sum_{j=t}^{\infty}\beta^{j-t}(r(s_j, a_j|\pi))|x_t, \epsilon_t]$ where the expectation is taken with respect to the stochastic process $\{(x_j,\epsilon_j):j=t,\cdots,\infty\}$.

Under certain regularity assumptions described in Rust (1987), the solution to this problem is given by a stationary decision rule. $V_{\theta}^*$ is the unique solution to Bellman’s equation given by
$V^*(x_t, \epsilon_t) = \underset{a_t\in A_t}{max}\{r(x_{t}, a_t, \epsilon_t) + \beta E[V^*(x_{t+1},\epsilon_{t+1}|x_t, \epsilon_t,a_t)]\} \hspace{1em}(1)$

where $E[V^*(x_{t+1},\epsilon_{t+1}|x_t, \epsilon_t,a_t)]=\int_{x'}\int_{\epsilon'}V^*(x', \epsilon')p(x', \epsilon'|x_t, a_t,\epsilon_t)dx'd\epsilon',$ i.e., the expectation is taken over the next period of values of states $(x', \epsilon')$.

Solving Equation (1) is very hard. Thus, Rust made three assumptions

• Conditional independence
• Extreme type 1 distribution

Let $EV^*(x,a)=E[V^*(x',\epsilon')|x,a)]$. We can express the Bellman equation in the expected value function space. $EV^*(x,a)$ has lower dimension than $V^*(x,\epsilon)$, since it dose not depend on $\epsilon$.
$EV^*(x,a)=\int_{x'}\int_{\epsilon'}V^*(x',\epsilon')q(\epsilon'|x')p(x'|x,a)dx'd\epsilon'$ where $V^*(x',\epsilon')=\underset{a'\in A}{\max}\{u(x',a')+\beta EV^*(x',a')+\epsilon'(a')\}$

Let $\tilde{V}^*(x)=E[V^*(x,\epsilon)|x)]$, which denotes the integrated (or smooth) value function. We can express the Bellman equation in the integrated value function space. $\tilde{V}^*(x)$ is further lower dimensional, since it does not depend on $\epsilon$ and $a$.
$\tilde{V}^*(x)=\int_{\epsilon}V^*(x,\epsilon)q(\epsilon|x)d\epsilon$ where $V^*(x,\epsilon)=\underset{a \in A}{\max}\{u(x,a)+\epsilon(a)+\beta\int_{x'}\tilde{V}^*(x')p(x'|x,a)dx'\}$
Since $\epsilon\sim$ I.I.D. standard Gumbel, $\tilde{V}^*(x)$ can be expressed as the “log-sum”.
$\tilde{V}^*(x)=log\sum_{a\in A}exp(v^*(x,a))$ where $v^*(x,a)=u(x,a)+\beta EV^*(x,a)$.

Return to $EV^*$. We update $\int_{\epsilon'}V^*(x',\epsilon')q(\epsilon'|x')d\epsilon'$, and get $\color{red}EV^*(x,a)=\int_{x'}log\sum_{a'\in A}exp(u(x',a')+\beta EV^*(x',a'))p(x'|x,a)dx'$

The core task of Rust (1987) is to estimate the parameters in MDP with the assumption that people make optimal decisions according to this MDP, based on people’s decision data.

Then, the question is how to write optimal decisions with respect to the states?

#### 2.2.2 Nested Fixed Point Algorithm (NFXP)

NFXP solves the unconstrained optimization problem $\underset{\theta}{\max} L(\theta, EV_{\theta})$

• Likelihood function for bus $m$
$l_m(x_1,\cdots, x_T,a_1, \cdots, a_T|x_0, a_0, \theta) =\prod_{t=1}^T p(a_t|x_t,\theta) p(x_t|x_{t-1}, a_{t-1}, \theta_3)$

• Log likelihood function for bus $m$
$L_m(\theta, EV_{\theta}) = \sum_{t=2}^T log(p(a_t|x_t,\theta)) + \sum_{t=2}^T p(x_t|x_{t-1}, a_{t-1}, \theta_3)$

• where $p(a|x,\theta) =E[1\{a=\underset{a'\in A}{argmax}\{v^*(x,a')+\epsilon(a')\}\}|x, \theta] \\=\frac{u(x,a,\theta_1) + \beta EV_{\theta}(x,a)}{\sum_{a' \in \{0,1\}} u(x,a',\theta_1) + \beta EV_{\theta}(x,a')}$
• and $EV_{\theta}^*(x,a) = \int_{x'} ln[\sum_{a' \in \{0,1\}} exp(u(x',a',\theta_1) + \beta EV_{\theta}^*(x',i'))]p(dx'|x,a,\theta_3)$

Outer loop:

• Given $\theta$, calculate $L(\theta, EV_{\theta}^*)$
• $EV_{\theta}^*$ is requried to be solved (inner loop)
• Repeat until we find the maximum (BFGS algorithm)

Inner loop:

• Use fixed point algorithm to find $EV_{\theta}^*(x,a)$ according to the relationship
$EV_{\theta}^*(x,a) = \int_{x'} ln[\sum_{a' \in \{0,1\}} exp(u(x',a',\theta_1) + \beta EV_{\theta}^*(x',i'))]p(dx'|x,a,\theta_3)$
• It is feasible since $x$ is discrete, and $a$ is discrete.

#### 2.2.3 Hotz–Miller’s CCP Method (Miller 1993)

Motivation
A disadvantage of Rust’s approach is that it can be computationally intensive that solving value functions (inner fixed point; $EV^*$).

• $\small EV^*(x,a)=\int_{x'}log\sum_{a'\in A}exp(u(x',a')+\beta EV^*(x',a'))p(x'|x,a)dx'$

Hotz and Miller’s idea is to use observable data to form an estimate of (differences in) the value function from conditional choice probabilities (CCP’s). Below, I leave out the notation of $^{*}$ for simplicity.

Mapping Normalized Value Function Differences to CCP
$P(a|x)$ is uniquely determined by the vector of normalized value function differences $\tilde{v}(x,a)=v(x,a)-v(x,1)$ (assuming $A:=\{1,2,\cdots,j,\cdots,|A|\}$; WLG, excluding the alternative $1$).

That is, there exist a vector mapping $Q_x(\cdot)$ such that $\{P(a|x):a>1)\}=Q_x(\{\tilde{v}(x,a):a>1\})$.

• Hots and Miller prove that under CI, $Q_x(\cdot)$ is invertible.

In general,
$Q_x^j(\{\tilde{v}(x,a):a>1\})=\frac{\partial S([0,\{\tilde{v}(x,a):a>1\}],x)}{\partial \tilde{v}(x,j)}$ where $S([v(x,a):a\in A \}],x)=\int \max\{v(x,a)+\epsilon(a)\}q(\epsilon|x)d\epsilon$, is McFadden’s social surplus function.

• Reminder: $v(x,a)=u(x,a)+\beta EV(x,a)$

Under assumption (xx), $S(\{v(x,a):a\in A \},x)$ is the well-known “log-sum” formula (also denoted as $\tilde{V}(x)$ in the note of Rust 1987). The $j$th component of $Q_x$ takes the well known logistic form (similar to $P(a|x)$).
$Q_x^j(\{\tilde{v}(x,a)\})=\frac{exp(\tilde{v}(x,j))}{1+\sum_{a=2}^{|A|}exp(\tilde{v}(x, a))}$

Mapping Choice Probabilities to Smooth Value Function

Smooth value function: $\tilde{V}(x)=E[V(x,\epsilon)|x)]$.
Smooth Bellman equation:
$\tilde{V}(x)=\int_{\epsilon}V(x,\epsilon)q(\epsilon|x)d\epsilon$ where $V(x,\epsilon)=\underset{a \in A}{\max}\{u(x,a)+\epsilon(a)+\beta\int_{x'}\tilde{V}(x')p(x'|x,a)dx'\}$

Re-write the smooth Bellman equation as
$\tilde{V}(x)=\sum_{a\in A}P(a|x)\left(u(x,a)+E[\epsilon(a)|x,a]+\beta\int_{x'}\tilde{V}(x')p(x'|x,a)dx'\right)$ where $E[\epsilon(a)|x,a]$ is the expectation of unobservable $\epsilon$ conditional on the optimal choice alternative $a$.
$E[\epsilon(a)|x,a]=\frac{\int_{\epsilon} \epsilon(a) 1\{\tilde{v}(x,a)+\epsilon(a)>\tilde{v}(x,j)+\epsilon(j), j\in A(x)\}q(\epsilon|x)d\epsilon}{P(a,x)}$
Under assumption (xx), we have $E[\epsilon(a)|x,a]=\gamma-ln P(a|x)$ where $\gamma=$Euler’s constant (0.5772156649…)

Use compact matrix notation, we can write
$V=\sum_{a\in A}P(a)*\left(u(a) +\epsilon(a,P)+\beta F(a)V \right)$ where $V, P(a), u(a), e(a,P)$ are all $|X|\times 1$ vectors, and $F(a)$ is the $|x|\times|X|$ matrix of conditional transition probabilities.

Thus, $V$ can be explicitly expressed by $P$:
$V=\psi(P)=[I-\beta \sum_{a\in A} P(a)*F(a)]^{-1}\sum_{a\in A}P(a)*\left(u(a)+e(a,P)\right)$

Mapping Choice Probabilities to Choice Probabilities
Recall that
$\color{red}V=\psi(P)=[I-\beta \sum_{a\in A} P(a)F(a)]^{-1}\sum_{a\in A}P(a)\left(u(a)+e(a,P)\right)$
and

$P(a|x)=\int_{\epsilon} 1\{a=\underset{a'\in A}{argmax}\{v(x,a')+\epsilon(a')\}\}|x]q(\epsilon|x)d\epsilon$ where $v(x,a')=u(x,a')+\beta \int_{x'}{\color{red}V(x)}p(x'|x,a)dx'$

Under assumptions (xx), $P(a)=\frac{exp(u(a)+F(a){\color{red}V})}{1^T(exp(u(a)+F(a){\color{red}V}))}$

Combine these two systems, we get $P=\Psi(P)=\Lambda(\psi(P))$

Hotz-Miller Approach (CCP Estimator)
They use a two-step estimator.

1. Estimate reduced form conditional choice probabilities (CCPs) using data on $x$ and $a$. Label them $\hat{p}(a|x)$.
2. Use the Hots-Miller inversion to map $\hat{p}(a|x)$ to estimated value function differences, and then measure continuation values that enter in the sample criterion (e.g. likelihood).

#### 2.2.4 Recursive CCP Estimation (NPL algorithm) (Aguirregabiria and Mira 2002)

Hotz-Miller Approach:

1. Estimate reduced-form conditional choice probabilities (CCPs) using data on $x$ and $a$. Label them $\hat{p}(a|x)$.
• E.g., Logit model
1. Use the Hots-Miller inversion to map $\hat{P}$ to estimate $V$. Then use $V$ to calculate $P$, and thus derive the partial likelihood $ln \sum P_i(a_i|x_i)$. Maximizing $ln \sum P_i(a_i|x_i)$ gives estimated parameters.

Further iteration:

1. Given $P^k$ we get in the $k$th iteration, maximizing $ln \sum P_i^k(a_i|x_i)$ gives estimated parameters.
2. Based on estimated parameters $\theta^{k+1}$, update $P^{k+1}=\Psi_{\theta^{k+1}} (P^k)$

Repeat the above iteration.

#### 2.2.5 Constrained Optimization (Su, C. L., & Judd, K. L. 2012)

Su, C. L., & Judd, K. L. (2012) propose a constrained optimization approaches to solve the above problem.

MPEC

$\underset{\theta, EV}{\max} L(\theta, EV)$
subject to $EV = T(EV, \theta)$

### 2.3 Dynamic Games

• There are multiple players in a game.
• Every period $t$ these players decide simultaneously a discrete action.
• At the beginning of period $t$, a player is characterized by two vectors of state variables which affect her current utility: $x_{it}$ and $\epsilon_{it}$.
• Variables in $x_{it}$ are common knowledge for all players in the game,
• but the vector $\epsilon_{it}$ is private information of player $i$.

Algorithms

• Two-Stage Methods
• Sequential Estimation
• Method Proposed by Lee and Wolpin (LW)

### 2.4 Competitive Equilibrium Models

#### Heckman et al. (1998)

E.g., skill prices is endogenously determined in the model as an equilibrium outcome.

## 3 Approximate Dynamic Programming

DP algorithms typically sweep through all states in each iteration. We can not do this for large finite spaces or infinite space. Then, we develop Approximate Dynamic Programming.

### 3.1 Introduction

Here, we describe the system using a little different concepts.

1. being in state $S_t$, from which we take an action $a_t$,
2. and then observe new information $W_{t+1}$ ,
3. which takes us to a new state $S_{t+1}$, i.e., $S_{t+1}=S^{M}(S_t, a_t, W_{t+1}).$

We make a decision in $t$ based on what we have know before $t$ and in $t$: $a_t=A^{\pi}(S_t).$

Theoretically, we can solve $\underset{\pi}{\max}\{E[\sum_{t=0}^T \beta^{t}R_t(S_t, A^{\pi}(S_t))]\}$ by recursively computing the optimality equations (mathematically equalling to Bellman equations) $V_t(S_t)=\underset{a_t}{\max}\{R_t(S_t, a_t)+\beta E[V_{t+1}(S_{t+1})|S_t]\}.$

Unfortunately, most of time, we cannot solve these equations exactly, since there are three curses of dimensionality.

1. Flat representation of state $S_t$: if $S_t$ is a multidimensional vector of discrete (or continuous) variables, it is almost intractable to list all possible combinations of state in a single list.
2. Expectation: sometimes it is impossible to compute the expectation exactly.
3. Decision $a_t$: same as $S_t$.

Approximate dynamic programming helps solve the curses of dimensionality. The key idea of approximate dynamic programming is stepping forward through time.

• In period $t$, to get $V_t(S_t)$ and choose optimal $a_t$, we must know something about $V_(t+1)(S_{t+1})$: we use an approximation of $V_(t+1)(S_{t+1})$.
• Moreover, we can fit the value function approximation around post-decision state $S^a_t$, instead of $S_t$, to further deal with curses of dimensionality.
• And, we generate a sample of random information $W_{t+1}$, and compute the next state $S_{t+1}$.
• Repeatedly generating a sample path $\omega^n$.

Post-Decision State
The post-decision state, represented by $S^a_t$ , is the state immediately after action $a_t$, but before the arrival of new information $W_{t+1}$.
$S^a_t = S^{M, a}(S_t, a_t)$

Example: Tic-tac-toe game (三连棋游戏)

• $t=0$
• Pre-decision stage, $S_0$ is an empty board.
• We make our move: $a_0$.
• The board immediately after $a_0$ is post-decision state $S^a_0$ (deterministic process).
• Our opponent makes his move (random process), and next pre-decision stage is reached $S_1$.
• $t=1$
• $\cdots$
• $\cdots$

Rewriting Optimality Equations Using Post-Decision State
If we substitute
$V_{t-1}^a(S_{t-1}^a)=E[V_t(S_t)|S_{t-1}^a]\hspace{0.5cm}\textit{\scriptsize (using approximation)}$ into $V_t(S_t)=\underset{a_t}{\max}\{R_t(S_t, a_t)+\beta V_t^a(S_t^a)\},$ we obtain $\color{blue}V_{t-1}^a(S_{t-1}^a)=E[\underset{a_t}{\max}\{R_t(S_t, a_t)+\beta V_t^a(S_t^a)\}]\hspace{0.5cm}\textit{\scriptsize (no expectation inside)}.$

Post-decision state and Q-learning: Post-decision state is exactly the key concept used in Q-learning, where focuses on learning the value of being state $S_t$ and taking an action $a_t$, denoted as $Q(S_t, a_t)$.

### 3.2 Forward Dynamic Programming Using the Post-Decision State Variable

Step 0. Initialization.

• Step 0a. Initialize $\bar{V}_t^{a,0}(S_t)$ for all states $S_t$ ($\color{green}\scriptsize \text{approximation of }V_t^a$).
• Step 0b. Set $n=1$.
• Step 0c. Choose an initial state $S^1_0$.

Step 1. Choose a sample path $\omega^n$.

• Step 2. Do for $t=0, 1, 2, \cdots, T$:
• Step 2a. Solve: $\hat{v}^n_t=\underset{a_t}{max}\{R_t(S_t^n, a_t)+\beta \bar{V}_t^{a,n-1}(S^{M,a}(S_t^n, a_t))\}$ and let $a_t^n$ be the action that solves the maximization problem.
• Step 2b. Update $\bar{V}_t^{a,n-1}$ using $\bar{V}_t^{a,n}(S_t)=\begin{cases}(1-\alpha_{n-1})\bar{V}_t^{a,n-1}(S_t^{n})+\alpha_{n-1}\hat{v}^n_t & S_t=S^{n}_t \\\bar{V}_t^{a,n-1}(S_t) & \text{otherwise}\end{cases}$ ($\color{green}\scriptsize \text{only updating the values of states that we actually visit}$)
• Step 2c. Compute $S^{a,n}_t=S^{M,a}(S_t^n, a_t^n)$ and $S^n_{t+1}=S^M(S_t^n, a_t^n, W_{t+1}(\omega)).$

Step 3. Let $n=n+1$. If $n\leq N$, go to step 1.

### 3.3 Fitting a Value Function Approximation

#### 3.3.1 Multilevel Aggregation

Assume $S_t$ multidimensional, and may include discrete, continuous and categorial elements.

Let $S$ be the original state space, and let $S^{(g)}$ be the state space at the $g$th level of aggregation. $G^g:S\to S^{(g)}$

• $g=0$ is the most disaggregate level, but even at this level, $S^{(0)}$ is a set of discrete values.

Estimate a single value for a discrete state $s$ at each level of aggregation. Also, only update states that we actually visit.

We may

• Choose a single level of aggregation.
• Which is the right level of aggregation?
• Use a weighted sum of estimates from different levels of aggregation.

#### 3.3.2 Basis Functions

$\bar{V}(S_t)|\theta)=\sum_{f\in F}\theta_f \phi_f(S_t)$

• $f$ is referred to as a feature
• $\phi_f(s)$ is a function that is believed to provide explanatory power to value
• Linear in the parameters $\theta$

The new challenge here is estimating the parameter vector. A simple strategy is to use a stochastic gradient algorithm, given by $\theta^n = \theta^{n-1}-\alpha_{n-1}(\bar{V}_t(S^n_t|\theta_{n-1})-\hat{v}^n_t(S^n_t)) {\color{green}\bigtriangledown_{\theta}\bar{V}_t(S^n_t|\theta_{n-1})}$

#### 3.3.3 Other Methods

E.g., Neural network, K-nearest neighbor.

#### 3.3.4 Example: Rust’s model for bus replacement

Using linear function as basis function to approach the value function for pre-decision state

• States: $s=\{mileage\}$, i.e., continuous state
• Approximation of value function: $\bar{V}(s) = \alpha_0$ + $\alpha_1 * s$

Algorithm:

• Initialize $s^1$
• For $n=1,\cdots, N$:
• $\hat{v}^n(s^n)=\underset{a}{max}\{R(s^n, a)+\beta E[\bar{V}^{n-1}(s'|s^n,a)\}]$
• Calculating $E[\bar{V}^{n-1}]$ by random sampling
• Update $\bar{V}^n$, based on new data point: $(s^n, \hat{v}^n(s^n))$

Using DNN to approach the value function for post-decision state
Approximation of Q value: $q(s,a) = \text{DNN}(s, a)$
Algorithm:

• Initialize $s^1$
• For $n=1,\cdots, N$:
• Decide action $a^n$ by $\epsilon$-greedy policy, i.e., best action with probability $1-\epsilon$, and random action with probability $\epsilon$
• Observe the next state $s'$
• Updata $q(s,a)$, by new data point$((s^n, a^n), q(s^n, a^n))$, where $q(s^n, a^n) = R(s^n, a^n) + \beta\underset{a'}{max}q(s',a')$

## 4 Reinforcement Learning

DP and ADP assume that we have known probability model. However, we often do not have access to these probabilities in real world. Even if we know probability model, it can be very challenging to update since the real-world model is too large or too complex.

Thus, we need to interact with the actual (or simulated) environment.

• That is, we obtain individual experiences of next state and reward.

### 4.1 Introduction

• RL learns Value Functions from a stream of individual experiences.
• In general,
• RL for prediction: estimate MSP value function for a policy $\pi$.
• RL for control: learn optimal value function.
• RL algorithms are mostly founded on the Bellman Equations
• RL control is based on Generalized Policy Iteration
• The key is incrementally updating Q-value function from experiences
• Q-value: the value of the combination of a state and an action, $(s,a)$.
• We need to find an appropriate approximation of Q-value function.

Value Function-based and Policy-based RL

• Value Function-based
• Learn Value Function (with a function approximation) Policy is implicit - readily derived from - Value Function (eg: -greedy)
• Policy-based
• Learn Policy (with a function approximation)
• No need to learn a Value Function
• Actor-Critic
• Learn Policy (Actor)
• Learn Value Function (Critic)

### 4.2 Interaction With Environment

• Monte Carlo (MD)
• Trace experience
• Temporal-Difference (TD)
• Atomic experience

### 4.3 Key: Policy Gradient

• In policy iteration, we need to improve policy by $\pi(s,a)=\underset{a}{\argmax}{Q(s,a)}$
• How do we do argmax when action space is large or continuous?
• Idea: Do Policy Improvement step with a Gradient Ascent instead.
• Objective: expected return from the present, $J(\theta)$.
• We have known (assumed) the Q-values, and we can sample experiences to obtain the expected value with respect to random information.
• We can estimate the gradient of $J(\theta)$: $\nabla_{\theta}J(\theta)$

Policy gradient has many equivalent forms, and thus leads to multiple algorithms:

• Reinforcement
• Action-Value (Q) Actor-Critic
• TD Actor-Critic

Action-Value (Q) Actor-Critic
Approximation of Q value: $q(s,a;\theta_q) = \text{DNN}(s, a;\theta_q)$

• Update $\theta_q$ by TD (Temperal Difference)

Approximation of policy: $\pi(s;\theta_{\pi}) = \text{DNN}(s;\theta_{\pi})$

• Update $\theta_{\pi}$ by policy gradient
• Parametrize the policy by Gaussian distribution

Algorithm:

• Initialize $s, \theta_q, \theta_{\pi}$
• Sample $a$ from $\pi(s;\theta_{\pi})$
• For each step:
• Get reward $R(s, a)$, and transition $s'$
• Sample action $a'$ from $\pi(s';\theta_{\pi})$
• Update $\theta_q$: $\Delta \theta_q = \text{learning rate}\cdot(R(s, a)+\beta q(s',a';\theta_q)-q(s,a;\theta_q))\cdot \nabla_{\theta_q}q(s',a';\theta_q)$
• Update $\theta_{\pi}$: $\Delta \theta_{\pi} = \text{learning rate}\cdot (\nabla_{\theta_{\pi}} log\pi(s,a:\theta_q))\cdot q(s,a;\theta_q)$
• $s \leftarrow s'$
• $a \leftarrow a'$

• Finds the best Stochastic Policy (Optimal Deterministic Policy, produced by other RL algorithms, can be unsuitable for POMDPs)
• Naturally explores due to Stochastic Policy representation
• Effective in high-dimensional or continuous action spaces
• Small changes in θ ⇒ small changes in π, and in state distribution T
• his avoids the convergence issues seen in argmax-based algorithms