From Dynamic Programming to Reinforcement Learning


1 Markov Process, Markov Reward Process, and Markov Decision Process

Five key elements:

  1. state (stSs_t\in S)
  2. action (atAa_t\in A)
  3. state transition probability (p(st+1st,at)p(s_{t+1}|s_t,a_t); one step transition matrix FF)
  4. reward function (rr)
  5. discounting factor (β\beta)

Markov Property of States:

Markov Process:

Markov Reward Process:

Markov Decision Process:

A policy π\pi is a mapping: SAS\to A.

Value Function
VπV^{\pi}: value function for policy π\pi

VV^{*} is the optimal value function, i.e., V(s)=maxπVπ(s)V^{*}(s)=\underset{\pi}{\max}V^{\pi}(s)

Bellman equations (very important!)

According to the definition of Vπ(s)V^{\pi}(s),

Vπ(s)=E[r(s0)+βr(s1)+β2r(s2)+π,s0]V^{\pi}(s)=E[r(s_0)+\beta r(s_1)+\beta^2 r(s_2)+\cdots|\pi, s_0] (if the sequence of states is determined and known)
Vπ(s)=E[r(s0)+βVπ(s1)]\Longrightarrow V^{\pi}(s)=E[r(s_0)+\beta V^{\pi}(s_1)]

Since s1P(s1s0,π)s_1\sim P(s_1|s_0, \pi), then we get Vπ(s)=r(s)+βsp(ss,π)Vπ(s)V^{\pi}(s)=r(s)+\beta \sum_{s'}p(s'|s,\pi)V^{\pi}(s')

Optimal Policy

The optimal policy is the policy that achieves the highest value for every state π=argmaxπVπ(s)\pi^* = \underset{\pi}{\text{argmax}}V^{\pi}(s) and it’s value function is written as VV^*.

Using Bellman equation, we can directly define the optimal value function
V(s)=r(s)+βmaxasp(ss,a)V(s)V^*(s)=r(s)+ \beta\underset{a}{\max}\sum_{s'}p(s'|s,a)V^*(s') and optimal policy is simply the action that attains this max. π(s)=argmaxasp(ss,a)V(s)\pi^*(s)=\underset{a}{\text{argmax}}\sum_{s'}p(s'|s,a)V^*(s')

2 Dynamic Programming

2.1 General Introduction

The core of DP is to figure out Value Function.

2.1.1 Backward Induction

When we face with finite-horizon MDPs, we can simply walk backwards and set the value function from the horizon-end to the start

2.1.2 Value Iteration

Repeatedly update an estimate of the optimal value function according to Bellman optimality equation:

  1. Initialize an estimate for the value function arbitrarily:
    V(s)0,sSV(s) \leftarrow 0, \forall s \in S
  2. Repeat, update:
    V^(s)=r(s)+βmaxasp(ss,a)V^(s)\hat{V}(s)=r(s)+ \beta\underset{a}{\max}\sum_{s'}p(s'|s,a)\hat{V}(s')
Theorem of Convergence
How many iterations will it take to find optimal policy?

Assume the reward per period [0,R]\in [0, R], then V(s)t=1βtR=R1β.V^*(s)\leq \sum_{t=1}^{\infty}\beta^{t}R=\frac{R}{1-\beta}.

Then letting V^k\hat{V}^k be value after kkth iteration
maxsSV^k(s)V(s)βkR1β.\underset{s\in S}{\max}||\hat{V}^k(s)-V^*(s)||\leq \frac{\beta^k R}{1-\beta}.

Thus, we have linear convergence to optimal value function

2.1.3 Policy Iteration

  1. Initialize policy π^\hat{\pi} randomly.
  2. Compute value of policy, VπV^{\pi}.
  3. Update π\pi to be greedy policy with respect to VπV^{\pi}:
    π^(s)argmaxasP(ss,a)Vπ(s)\hat{\pi}(s)\leftarrow\underset{a}{\text{argmax}}\sum_{s'} P(s'|s,a)V^{\pi}(s')
  4. If policy π\pi changed in last iteration, return to step 2.
Theorem of Convergence

2.2 SingleAgent Models – Rust’s Model (Rust 1987)

2.2.1 The Set-Up

At each period, the decision maker decides whether to replace a bus. The decision maker chooses an optimal replacement policy to minimize the cumulative discounted cost.

State xtx_t, refers to the mileage of a bus.

Action at{0,1}a_t\in \{0,1\}, where

State Transition Probability
The state transition probability:
p(xt+1xt,at,θ3)={g(xt+1xt,θ3)if at=0g(xt+10,θ3)if at=1 p(x_{t+1} | x_{t} , a_{t} , \theta_{3}) = \left\{ \begin{array}{l l} g(x_{t+1} - x_{t}, \theta_{3}) & \quad \text{if } a_t = 0\\ g(x_{t+1} - 0, \theta_{3}) & \quad \text{if } a_t = 1 \end{array} \right.

Reward Function
The per-period reward function:
r(xt,at,θ1)={u(xt,0,θ1)+ϵt(0)=c(xt,θ1)+ϵt(0)if at=0u(xt,1,θ1)+ϵt(1)=RCc(0,θ1)+ϵt(1)if at=1\small r(x_{t}, a_t, \theta_{1}) = \left\{ \begin{array}{l l} \quad u(x_t,0, \theta_1) + \epsilon_t(0) &= -c(x_t, \theta_1) + \epsilon_t(0) & \quad \text{if } a_t = 0\\ u(x_t,1, \theta_1) + \epsilon_t(1) &= -RC -c(0, \theta_1) + \epsilon_t(1) & \quad \text{if } a_t = 1 \end{array} \right.

Decision Process
The agent must choose a sequence of decision rules to maximize expected discounted utility over an infinite horizon.
V(xt,ϵt)=supπE[j=tβjt(r(sj,ajπ))xt,ϵt]V^*(x_t, \epsilon_t)=\underset{\pi}{sup}E[\sum_{j=t}^{\infty}\beta^{j-t}(r(s_j, a_j|\pi))|x_t, \epsilon_t] where the expectation is taken with respect to the stochastic process {(xj,ϵj):j=t,,}\{(x_j,\epsilon_j):j=t,\cdots,\infty\}.

Under certain regularity assumptions described in Rust (1987), the solution to this problem is given by a stationary decision rule. VθV_{\theta}^* is the unique solution to Bellman’s equation given by
V(xt,ϵt)=maxatAt{r(xt,at,ϵt)+βE[V(xt+1,ϵt+1xt,ϵt,at)]}(1)V^*(x_t, \epsilon_t) = \underset{a_t\in A_t}{max}\{r(x_{t}, a_t, \epsilon_t) + \beta E[V^*(x_{t+1},\epsilon_{t+1}|x_t, \epsilon_t,a_t)]\} \hspace{1em}(1)

where E[V(xt+1,ϵt+1xt,ϵt,at)]=xϵV(x,ϵ)p(x,ϵxt,at,ϵt)dxdϵ,E[V^*(x_{t+1},\epsilon_{t+1}|x_t, \epsilon_t,a_t)]=\int_{x'}\int_{\epsilon'}V^*(x', \epsilon')p(x', \epsilon'|x_t, a_t,\epsilon_t)dx'd\epsilon', i.e., the expectation is taken over the next period of values of states (x,ϵ)(x', \epsilon').

Solving Equation (1) is very hard. Thus, Rust made three assumptions

Let EV(x,a)=E[V(x,ϵ)x,a)]EV^*(x,a)=E[V^*(x',\epsilon')|x,a)]. We can express the Bellman equation in the expected value function space. EV(x,a)EV^*(x,a) has lower dimension than V(x,ϵ)V^*(x,\epsilon), since it dose not depend on ϵ\epsilon.
EV(x,a)=xϵV(x,ϵ)q(ϵx)p(xx,a)dxdϵEV^*(x,a)=\int_{x'}\int_{\epsilon'}V^*(x',\epsilon')q(\epsilon'|x')p(x'|x,a)dx'd\epsilon' where V(x,ϵ)=maxaA{u(x,a)+βEV(x,a)+ϵ(a)}V^*(x',\epsilon')=\underset{a'\in A}{\max}\{u(x',a')+\beta EV^*(x',a')+\epsilon'(a')\}

Let V~(x)=E[V(x,ϵ)x)]\tilde{V}^*(x)=E[V^*(x,\epsilon)|x)], which denotes the integrated (or smooth) value function. We can express the Bellman equation in the integrated value function space. V~(x)\tilde{V}^*(x) is further lower dimensional, since it does not depend on ϵ\epsilon and aa.
V~(x)=ϵV(x,ϵ)q(ϵx)dϵ\tilde{V}^*(x)=\int_{\epsilon}V^*(x,\epsilon)q(\epsilon|x)d\epsilon where V(x,ϵ)=maxaA{u(x,a)+ϵ(a)+βxV~(x)p(xx,a)dx}V^*(x,\epsilon)=\underset{a \in A}{\max}\{u(x,a)+\epsilon(a)+\beta\int_{x'}\tilde{V}^*(x')p(x'|x,a)dx'\}
Since ϵ\epsilon\sim I.I.D. standard Gumbel, V~(x)\tilde{V}^*(x) can be expressed as the “log-sum”.
V~(x)=logaAexp(v(x,a))\tilde{V}^*(x)=log\sum_{a\in A}exp(v^*(x,a)) where v(x,a)=u(x,a)+βEV(x,a)v^*(x,a)=u(x,a)+\beta EV^*(x,a).

Return to EVEV^*. We update ϵV(x,ϵ)q(ϵx)dϵ\int_{\epsilon'}V^*(x',\epsilon')q(\epsilon'|x')d\epsilon', and get EV(x,a)=xlogaAexp(u(x,a)+βEV(x,a))p(xx,a)dx\color{red}EV^*(x,a)=\int_{x'}log\sum_{a'\in A}exp(u(x',a')+\beta EV^*(x',a'))p(x'|x,a)dx'

The core task of Rust (1987) is to estimate the parameters in MDP with the assumption that people make optimal decisions according to this MDP, based on people’s decision data.

Then, the question is how to write optimal decisions with respect to the states?

2.2.2 Nested Fixed Point Algorithm (NFXP)

NFXP solves the unconstrained optimization problem maxθL(θ,EVθ)\underset{\theta}{\max} L(\theta, EV_{\theta})

Outer loop:

Inner loop:

2.2.3 Hotz–Miller’s CCP Method (Miller 1993)

A disadvantage of Rust’s approach is that it can be computationally intensive that solving value functions (inner fixed point; EVEV^*).

Hotz and Miller’s idea is to use observable data to form an estimate of (differences in) the value function from conditional choice probabilities (CCP’s). Below, I leave out the notation of ^{*} for simplicity.

Mapping Normalized Value Function Differences to CCP
P(ax)P(a|x) is uniquely determined by the vector of normalized value function differences v~(x,a)=v(x,a)v(x,1)\tilde{v}(x,a)=v(x,a)-v(x,1) (assuming A:={1,2,,j,,A}A:=\{1,2,\cdots,j,\cdots,|A|\}; WLG, excluding the alternative 11).

That is, there exist a vector mapping Qx()Q_x(\cdot) such that {P(ax):a>1)}=Qx({v~(x,a):a>1})\{P(a|x):a>1)\}=Q_x(\{\tilde{v}(x,a):a>1\}).

In general,
Qxj({v~(x,a):a>1})=S([0,{v~(x,a):a>1}],x)v~(x,j)Q_x^j(\{\tilde{v}(x,a):a>1\})=\frac{\partial S([0,\{\tilde{v}(x,a):a>1\}],x)}{\partial \tilde{v}(x,j)} where S([v(x,a):aA}],x)=max{v(x,a)+ϵ(a)}q(ϵx)dϵS([v(x,a):a\in A \}],x)=\int \max\{v(x,a)+\epsilon(a)\}q(\epsilon|x)d\epsilon, is McFadden’s social surplus function.

Under assumption (xx), S({v(x,a):aA},x)S(\{v(x,a):a\in A \},x) is the well-known “log-sum” formula (also denoted as V~(x)\tilde{V}(x) in the note of Rust 1987). The jjth component of QxQ_x takes the well known logistic form (similar to P(ax)P(a|x)).
Qxj({v~(x,a)})=exp(v~(x,j))1+a=2Aexp(v~(x,a))Q_x^j(\{\tilde{v}(x,a)\})=\frac{exp(\tilde{v}(x,j))}{1+\sum_{a=2}^{|A|}exp(\tilde{v}(x, a))}

Mapping Choice Probabilities to Smooth Value Function

Smooth value function: V~(x)=E[V(x,ϵ)x)]\tilde{V}(x)=E[V(x,\epsilon)|x)].
Smooth Bellman equation:
V~(x)=ϵV(x,ϵ)q(ϵx)dϵ\tilde{V}(x)=\int_{\epsilon}V(x,\epsilon)q(\epsilon|x)d\epsilon where V(x,ϵ)=maxaA{u(x,a)+ϵ(a)+βxV~(x)p(xx,a)dx}V(x,\epsilon)=\underset{a \in A}{\max}\{u(x,a)+\epsilon(a)+\beta\int_{x'}\tilde{V}(x')p(x'|x,a)dx'\}

Re-write the smooth Bellman equation as
V~(x)=aAP(ax)(u(x,a)+E[ϵ(a)x,a]+βxV~(x)p(xx,a)dx)\tilde{V}(x)=\sum_{a\in A}P(a|x)\left(u(x,a)+E[\epsilon(a)|x,a]+\beta\int_{x'}\tilde{V}(x')p(x'|x,a)dx'\right) where E[ϵ(a)x,a]E[\epsilon(a)|x,a] is the expectation of unobservable ϵ\epsilon conditional on the optimal choice alternative aa.
E[ϵ(a)x,a]=ϵϵ(a)1{v~(x,a)+ϵ(a)>v~(x,j)+ϵ(j),jA(x)}q(ϵx)dϵP(a,x)E[\epsilon(a)|x,a]=\frac{\int_{\epsilon} \epsilon(a) 1\{\tilde{v}(x,a)+\epsilon(a)>\tilde{v}(x,j)+\epsilon(j), j\in A(x)\}q(\epsilon|x)d\epsilon}{P(a,x)}
Under assumption (xx), we have E[ϵ(a)x,a]=γlnP(ax)E[\epsilon(a)|x,a]=\gamma-ln P(a|x) where γ=\gamma=Euler’s constant (0.5772156649…)

Use compact matrix notation, we can write
V=aAP(a)(u(a)+ϵ(a,P)+βF(a)V)V=\sum_{a\in A}P(a)*\left(u(a) +\epsilon(a,P)+\beta F(a)V \right) where V,P(a),u(a),e(a,P)V, P(a), u(a), e(a,P) are all X×1|X|\times 1 vectors, and F(a)F(a) is the x×X|x|\times|X| matrix of conditional transition probabilities.

Thus, VV can be explicitly expressed by PP:
V=ψ(P)=[IβaAP(a)F(a)]1aAP(a)(u(a)+e(a,P))V=\psi(P)=[I-\beta \sum_{a\in A} P(a)*F(a)]^{-1}\sum_{a\in A}P(a)*\left(u(a)+e(a,P)\right)

Mapping Choice Probabilities to Choice Probabilities
Recall that
V=ψ(P)=[IβaAP(a)F(a)]1aAP(a)(u(a)+e(a,P))\color{red}V=\psi(P)=[I-\beta \sum_{a\in A} P(a)F(a)]^{-1}\sum_{a\in A}P(a)\left(u(a)+e(a,P)\right)

P(ax)=ϵ1{a=argmaxaA{v(x,a)+ϵ(a)}}x]q(ϵx)dϵP(a|x)=\int_{\epsilon} 1\{a=\underset{a'\in A}{argmax}\{v(x,a')+\epsilon(a')\}\}|x]q(\epsilon|x)d\epsilon where v(x,a)=u(x,a)+βxV(x)p(xx,a)dxv(x,a')=u(x,a')+\beta \int_{x'}{\color{red}V(x)}p(x'|x,a)dx'

Under assumptions (xx), P(a)=exp(u(a)+F(a)V)1T(exp(u(a)+F(a)V))P(a)=\frac{exp(u(a)+F(a){\color{red}V})}{1^T(exp(u(a)+F(a){\color{red}V}))}

Combine these two systems, we get P=Ψ(P)=Λ(ψ(P))P=\Psi(P)=\Lambda(\psi(P))

Hotz-Miller Approach (CCP Estimator)
They use a two-step estimator.

  1. Estimate reduced form conditional choice probabilities (CCPs) using data on xx and aa. Label them p^(ax)\hat{p}(a|x).
  2. Use the Hots-Miller inversion to map p^(ax)\hat{p}(a|x) to estimated value function differences, and then measure continuation values that enter in the sample criterion (e.g. likelihood).

2.2.4 Recursive CCP Estimation (NPL algorithm) (Aguirregabiria and Mira 2002)

Hotz-Miller Approach:

  1. Estimate reduced-form conditional choice probabilities (CCPs) using data on xx and aa. Label them p^(ax)\hat{p}(a|x).
  1. Use the Hots-Miller inversion to map P^\hat{P} to estimate VV. Then use VV to calculate PP, and thus derive the partial likelihood lnPi(aixi)ln \sum P_i(a_i|x_i). Maximizing lnPi(aixi)ln \sum P_i(a_i|x_i) gives estimated parameters.

Further iteration:

  1. Given PkP^k we get in the kkth iteration, maximizing lnPik(aixi)ln \sum P_i^k(a_i|x_i) gives estimated parameters.
  2. Based on estimated parameters θk+1\theta^{k+1}, update Pk+1=Ψθk+1(Pk)P^{k+1}=\Psi_{\theta^{k+1}} (P^k)

Repeat the above iteration.

2.2.5 Constrained Optimization (Su, C. L., & Judd, K. L. 2012)

Su, C. L., & Judd, K. L. (2012) propose a constrained optimization approaches to solve the above problem.


maxθ,EVL(θ,EV)\underset{\theta, EV}{\max} L(\theta, EV)
subject to EV=T(EV,θ)EV = T(EV, \theta)

2.3 Dynamic Games


2.4 Competitive Equilibrium Models

Heckman et al. (1998)

E.g., skill prices is endogenously determined in the model as an equilibrium outcome.

3 Approximate Dynamic Programming

DP algorithms typically sweep through all states in each iteration. We can not do this for large finite spaces or infinite space. Then, we develop Approximate Dynamic Programming.

3.1 Introduction

Here, we describe the system using a little different concepts.

  1. being in state StS_t, from which we take an action ata_t,
  2. and then observe new information Wt+1W_{t+1} ,
  3. which takes us to a new state St+1S_{t+1}, i.e., St+1=SM(St,at,Wt+1).S_{t+1}=S^{M}(S_t, a_t, W_{t+1}).

We make a decision in tt based on what we have know before tt and in tt: at=Aπ(St).a_t=A^{\pi}(S_t).

Theoretically, we can solve maxπ{E[t=0TβtRt(St,Aπ(St))]}\underset{\pi}{\max}\{E[\sum_{t=0}^T \beta^{t}R_t(S_t, A^{\pi}(S_t))]\} by recursively computing the optimality equations (mathematically equalling to Bellman equations) Vt(St)=maxat{Rt(St,at)+βE[Vt+1(St+1)St]}.V_t(S_t)=\underset{a_t}{\max}\{R_t(S_t, a_t)+\beta E[V_{t+1}(S_{t+1})|S_t]\}.

Unfortunately, most of time, we cannot solve these equations exactly, since there are three curses of dimensionality.

  1. Flat representation of state StS_t: if StS_t is a multidimensional vector of discrete (or continuous) variables, it is almost intractable to list all possible combinations of state in a single list.
  2. Expectation: sometimes it is impossible to compute the expectation exactly.
  3. Decision ata_t: same as StS_t.

Approximate dynamic programming helps solve the curses of dimensionality. The key idea of approximate dynamic programming is stepping forward through time.

Post-Decision State
The post-decision state, represented by StaS^a_t , is the state immediately after action ata_t, but before the arrival of new information Wt+1W_{t+1}.
Sta=SM,a(St,at)S^a_t = S^{M, a}(S_t, a_t)

Example: Tic-tac-toe game (三连棋游戏)

Rewriting Optimality Equations Using Post-Decision State
If we substitute
Vt1a(St1a)=E[Vt(St)St1a](using approximation)V_{t-1}^a(S_{t-1}^a)=E[V_t(S_t)|S_{t-1}^a]\hspace{0.5cm}\textit{\scriptsize (using approximation)} into Vt(St)=maxat{Rt(St,at)+βVta(Sta)},V_t(S_t)=\underset{a_t}{\max}\{R_t(S_t, a_t)+\beta V_t^a(S_t^a)\}, we obtain Vt1a(St1a)=E[maxat{Rt(St,at)+βVta(Sta)}](no expectation inside).\color{blue}V_{t-1}^a(S_{t-1}^a)=E[\underset{a_t}{\max}\{R_t(S_t, a_t)+\beta V_t^a(S_t^a)\}]\hspace{0.5cm}\textit{\scriptsize (no expectation inside)}.

Post-decision state and Q-learning: Post-decision state is exactly the key concept used in Q-learning, where focuses on learning the value of being state StS_t and taking an action ata_t, denoted as Q(St,at)Q(S_t, a_t).

3.2 Forward Dynamic Programming Using the Post-Decision State Variable

Step 0. Initialization.

Step 1. Choose a sample path ωn\omega^n.

Step 3. Let n=n+1n=n+1. If nNn\leq N, go to step 1.

3.3 Fitting a Value Function Approximation

3.3.1 Multilevel Aggregation

Assume StS_t multidimensional, and may include discrete, continuous and categorial elements.

Let SS be the original state space, and let S(g)S^{(g)} be the state space at the ggth level of aggregation. Gg:SS(g)G^g:S\to S^{(g)}

Estimate a single value for a discrete state ss at each level of aggregation. Also, only update states that we actually visit.

We may

3.3.2 Basis Functions

Vˉ(St)θ)=fFθfϕf(St)\bar{V}(S_t)|\theta)=\sum_{f\in F}\theta_f \phi_f(S_t)

The new challenge here is estimating the parameter vector. A simple strategy is to use a stochastic gradient algorithm, given by θn=θn1αn1(Vˉt(Stnθn1)v^tn(Stn))θVˉt(Stnθn1)\theta^n = \theta^{n-1}-\alpha_{n-1}(\bar{V}_t(S^n_t|\theta_{n-1})-\hat{v}^n_t(S^n_t)) {\color{green}\bigtriangledown_{\theta}\bar{V}_t(S^n_t|\theta_{n-1})}

3.3.3 Other Methods

E.g., Neural network, K-nearest neighbor.

3.3.4 Example: Rust’s model for bus replacement

Using linear function as basis function to approach the value function for pre-decision state


Using DNN to approach the value function for post-decision state
Approximation of Q value: q(s,a)=DNN(s,a)q(s,a) = \text{DNN}(s, a)

4 Reinforcement Learning

DP and ADP assume that we have known probability model. However, we often do not have access to these probabilities in real world. Even if we know probability model, it can be very challenging to update since the real-world model is too large or too complex.

Thus, we need to interact with the actual (or simulated) environment.

4.1 Introduction

Value Function-based and Policy-based RL

4.2 Interaction With Environment

4.3 Key: Policy Gradient

Policy gradient has many equivalent forms, and thus leads to multiple algorithms:

Action-Value (Q) Actor-Critic
Approximation of Q value: q(s,a;θq)=DNN(s,a;θq)q(s,a;\theta_q) = \text{DNN}(s, a;\theta_q)

Approximation of policy: π(s;θπ)=DNN(s;θπ)\pi(s;\theta_{\pi}) = \text{DNN}(s;\theta_{\pi})