From Dynamic Programming to Reinforcement Learning
- Markov Process, Markov Reward Process, and Markov Decision Process
- Dynamic Programming
- Approximate Dynamic Programming
- Reinforcement Learning
1 Markov Process, Markov Reward Process, and Markov Decision Process
Five key elements:
- state ()
- action ()
- state transition probability (; one step transition matrix )
- reward function ()
- discounting factor ()
Markov Property of States:
- The probability of next state depends only on current state, i.e., the future is independent of the past given the present.
- A countable set of states
- Probability Distribution of Start States:
- Markov Property of States
- Transition Probability
- Default: Discrete-Time, Countable-States Stationary Markov Process
Markov Reward Process:
- A Markov Process, along with a time-indexed sequence of Reward random variable
Markov Decision Process:
- A mathematical model of a decision maker who is in. state at time , and takes an action that determines current payoff and affects the distribution of next period’s state via a Markov transition probability .
A policy is a mapping: .
: value function for policy
- For a policy , ,is s.t. the expected total payoff in the future for staring in state and executing policy .
is the optimal value function, i.e.,
Bellman equations (very important!)
According to the definition of ,
(if the sequence of states is determined and known)
Since , then we get
- Given a policy , we have a system of linear equations. Solving these equations, we can derive the value functions .
The optimal policy is the policy that achieves the highest value for every state and it’s value function is written as .
- There are an exponential number of policies (). Hence, we do not use this formulation to find the optimal policy.
Using Bellman equation, we can directly define the optimal value function
and optimal policy is simply the action that attains this max.
2 Dynamic Programming
2.1 General Introduction
- Original use of DP term: MDP theory and solution methods
- Bellman referred to DP as the Principle of Optimality
- Later, the usage of the term DP diffused out to other algorithms
- In CS, it means “recursive algorithms with overlapping subproblems”
- In general,
- DP for prediction: compute the Value Function of an MP
- DP for control: compute the Optimal Value Function of an MDP
The core of DP is to figure out Value Function.
- Value Iteration
- Policy Iteration
2.1.1 Backward Induction
When we face with finite-horizon MDPs, we can simply walk backwards and set the value function from the horizon-end to the start
2.1.2 Value Iteration
Repeatedly update an estimate of the optimal value function according to Bellman optimality equation:
- Initialize an estimate for the value function arbitrarily:
- Repeat, update:
Theorem of Convergence
- Theorem: Value iteration converges to optimal value: .
- Proof: (key point: Bellman operator is a contraction mapping)
How many iterations will it take to find optimal policy?
Assume the reward per period , then
Then letting be value after th iteration
Thus, we have linear convergence to optimal value function
2.1.3 Policy Iteration
- Initialize policy randomly.
- Compute value of policy, .
- Update to be greedy policy with respect to :
- If policy changed in last iteration, return to step 2.
Theorem of Convergence
- Theorem: Policy iteration converges to optimal policy: .
- It takes less iterations for policy iteration to find optimal policy than value iteration.
2.2 SingleAgent Models – Rust’s Model (Rust 1987)
- Nested Fixed Point Algorithm
- Hotz–Miller’s CCP Method
- Recursive CCP Estimation (NPL algorithm)
- Constrained Optimization
2.2.1 The Set-Up
At each period, the decision maker decides whether to replace a bus. The decision maker chooses an optimal replacement policy to minimize the cumulative discounted cost.
State , refers to the mileage of a bus.
- For computational reasons, Rust discretizes the state space into 90 intervals (an interval covers 5000 miles), i.e., the state space is discrete.
Action , where
- means replacing the engine,
- means keeping the engine and performing normal maintenance.
State Transition Probability
The state transition probability:
- Rust found that the increase of mileage of a bus between a period is no more than 2 interval. Hence, we form the transition probability as
The per-period reward function:
: the net costs of replacing an engine,
: regular maintenance costs,
: payoff shocks.
is observable to both agent and econometrician, but is only observable to the agent.
is necessary for a coherent model, for sometimes we observe the agent making different decisions for the same value of
The agent must choose a sequence of decision rules to maximize expected discounted utility over an infinite horizon.
where the expectation is taken with respect to the stochastic process .
Under certain regularity assumptions described in Rust (1987), the solution to this problem is given by a stationary decision rule. is the unique solution to Bellman’s equation given by
where i.e., the expectation is taken over the next period of values of states .
Solving Equation (1) is very hard. Thus, Rust made three assumptions
- Conditional independence
- Extreme type 1 distribution
Let . We can express the Bellman equation in the expected value function space. has lower dimension than , since it dose not depend on .
Let , which denotes the integrated (or smooth) value function. We can express the Bellman equation in the integrated value function space. is further lower dimensional, since it does not depend on and .
Since I.I.D. standard Gumbel, can be expressed as the “log-sum”.
Return to . We update , and get
The core task of Rust (1987) is to estimate the parameters in MDP with the assumption that people make optimal decisions according to this MDP, based on people’s decision data.
Then, the question is how to write optimal decisions with respect to the states?
2.2.2 Nested Fixed Point Algorithm (NFXP)
NFXP solves the unconstrained optimization problem
Likelihood function for bus
Log likelihood function for bus
- Given , calculate
- is requried to be solved (inner loop)
- Repeat until we find the maximum (BFGS algorithm)
- Use fixed point algorithm to find according to the relationship
- It is feasible since is discrete, and is discrete.
2.2.3 Hotz–Miller’s CCP Method (Miller 1993)
A disadvantage of Rust’s approach is that it can be computationally intensive that solving value functions (inner fixed point; ).
Hotz and Miller’s idea is to use observable data to form an estimate of (differences in) the value function from conditional choice probabilities (CCP’s). Below, I leave out the notation of for simplicity.
Mapping Normalized Value Function Differences to CCP
is uniquely determined by the vector of normalized value function differences (assuming ; WLG, excluding the alternative ).
That is, there exist a vector mapping such that .
- Hots and Miller prove that under CI, is invertible.
where , is McFadden’s social surplus function.
Under assumption (xx), is the well-known “log-sum” formula (also denoted as in the note of Rust 1987). The th component of takes the well known logistic form (similar to ).
Mapping Choice Probabilities to Smooth Value Function
Smooth value function: .
Smooth Bellman equation:
Re-write the smooth Bellman equation as
where is the expectation of unobservable conditional on the optimal choice alternative .
Under assumption (xx), we have where Euler’s constant (0.5772156649…)
Use compact matrix notation, we can write
where are all vectors, and is the matrix of conditional transition probabilities.
Thus, can be explicitly expressed by :
Mapping Choice Probabilities to Choice Probabilities
Under assumptions (xx),
Combine these two systems, we get
Hotz-Miller Approach (CCP Estimator)
They use a two-step estimator.
- Estimate reduced form conditional choice probabilities (CCPs) using data on and . Label them .
- Use the Hots-Miller inversion to map to estimated value function differences, and then measure continuation values that enter in the sample criterion (e.g. likelihood).
2.2.4 Recursive CCP Estimation (NPL algorithm) (Aguirregabiria and Mira 2002)
- Estimate reduced-form conditional choice probabilities (CCPs) using data on and . Label them .
- Use the Hots-Miller inversion to map to estimate . Then use to calculate , and thus derive the partial likelihood . Maximizing gives estimated parameters.
- Given we get in the th iteration, maximizing gives estimated parameters.
- Based on estimated parameters , update
Repeat the above iteration.
2.2.5 Constrained Optimization (Su, C. L., & Judd, K. L. 2012)
Su, C. L., & Judd, K. L. (2012) propose a constrained optimization approaches to solve the above problem.
2.3 Dynamic Games
- There are multiple players in a game.
- Every period these players decide simultaneously a discrete action.
- At the beginning of period , a player is characterized by two vectors of state variables which affect her current utility: and .
- Variables in are common knowledge for all players in the game,
- but the vector is private information of player .
- Two-Stage Methods
- Sequential Estimation
- Method Proposed by Lee and Wolpin (LW)
2.4 Competitive Equilibrium Models
Heckman et al. (1998)
E.g., skill prices is endogenously determined in the model as an equilibrium outcome.
3 Approximate Dynamic Programming
DP algorithms typically sweep through all states in each iteration. We can not do this for large finite spaces or infinite space. Then, we develop Approximate Dynamic Programming.
Here, we describe the system using a little different concepts.
- being in state , from which we take an action ,
- and then observe new information ,
- which takes us to a new state , i.e.,
We make a decision in based on what we have know before and in :
Theoretically, we can solve by recursively computing the optimality equations (mathematically equalling to Bellman equations)
Unfortunately, most of time, we cannot solve these equations exactly, since there are three curses of dimensionality.
- Flat representation of state : if is a multidimensional vector of discrete (or continuous) variables, it is almost intractable to list all possible combinations of state in a single list.
- Expectation: sometimes it is impossible to compute the expectation exactly.
- Decision : same as .
Approximate dynamic programming helps solve the curses of dimensionality. The key idea of approximate dynamic programming is stepping forward through time.
- In period , to get and choose optimal , we must know something about : we use an approximation of .
- Moreover, we can fit the value function approximation around post-decision state , instead of , to further deal with curses of dimensionality.
- And, we generate a sample of random information , and compute the next state .
- Repeatedly generating a sample path .
The post-decision state, represented by , is the state immediately after action , but before the arrival of new information .
Example: Tic-tac-toe game (三连棋游戏)
- Pre-decision stage, is an empty board.
- We make our move: .
- The board immediately after is post-decision state (deterministic process).
- Our opponent makes his move (random process), and next pre-decision stage is reached .
Rewriting Optimality Equations Using Post-Decision State
If we substitute
into we obtain
Post-decision state and Q-learning: Post-decision state is exactly the key concept used in Q-learning, where focuses on learning the value of being state and taking an action , denoted as .
3.2 Forward Dynamic Programming Using the Post-Decision State Variable
Step 0. Initialization.
- Step 0a. Initialize for all states ().
- Step 0b. Set .
- Step 0c. Choose an initial state .
Step 1. Choose a sample path .
- Step 2. Do for :
- Step 2a. Solve: and let be the action that solves the maximization problem.
- Step 2b. Update using ()
- Step 2c. Compute and
Step 3. Let . If , go to step 1.
3.3 Fitting a Value Function Approximation
3.3.1 Multilevel Aggregation
Assume multidimensional, and may include discrete, continuous and categorial elements.
Let be the original state space, and let be the state space at the th level of aggregation.
- is the most disaggregate level, but even at this level, is a set of discrete values.
Estimate a single value for a discrete state at each level of aggregation. Also, only update states that we actually visit.
- Choose a single level of aggregation.
- Which is the right level of aggregation?
- Use a weighted sum of estimates from different levels of aggregation.
3.3.2 Basis Functions
- is referred to as a feature
- is a function that is believed to provide explanatory power to value
- Linear in the parameters
The new challenge here is estimating the parameter vector. A simple strategy is to use a stochastic gradient algorithm, given by
3.3.3 Other Methods
E.g., Neural network, K-nearest neighbor.
3.3.4 Example: Rust’s model for bus replacement
Using linear function as basis function to approach the value function for pre-decision state
- States: , i.e., continuous state
- Approximation of value function: +
- For :
- Calculating by random sampling
- Update , based on new data point:
Using DNN to approach the value function for post-decision state
Approximation of Q value:
- For :
- Decide action by -greedy policy, i.e., best action with probability , and random action with probability
- Observe the next state
- Updata , by new data point, where
4 Reinforcement Learning
DP and ADP assume that we have known probability model. However, we often do not have access to these probabilities in real world. Even if we know probability model, it can be very challenging to update since the real-world model is too large or too complex.
Thus, we need to interact with the actual (or simulated) environment.
- That is, we obtain individual experiences of next state and reward.
- RL learns Value Functions from a stream of individual experiences.
- In general,
- RL for prediction: estimate MSP value function for a policy .
- RL for control: learn optimal value function.
- RL algorithms are mostly founded on the Bellman Equations
- RL control is based on Generalized Policy Iteration
- The key is incrementally updating Q-value function from experiences
- Q-value: the value of the combination of a state and an action, .
- We need to find an appropriate approximation of Q-value function.
Value Function-based and Policy-based RL
- Value Function-based
- Learn Value Function (with a function approximation) Policy is implicit - readily derived from - Value Function (eg: -greedy)
- Learn Policy (with a function approximation)
- No need to learn a Value Function
- Learn Policy (Actor)
- Learn Value Function (Critic)
4.2 Interaction With Environment
- Monte Carlo (MD)
- Temporal-Difference (TD)
4.3 Key: Policy Gradient
- In policy iteration, we need to improve policy by
- How do we do argmax when action space is large or continuous?
- Idea: Do Policy Improvement step with a Gradient Ascent instead.
- Objective: expected return from the present, .
- We have known (assumed) the Q-values, and we can sample experiences to obtain the expected value with respect to random information.
- We can estimate the gradient of :
Policy gradient has many equivalent forms, and thus leads to multiple algorithms:
- Action-Value (Q) Actor-Critic
- Advantage Actor-Critic
- TD Actor-Critic
Action-Value (Q) Actor-Critic
Approximation of Q value:
- Update by TD (Temperal Difference)
Approximation of policy:
- Update by policy gradient
- Parametrize the policy by Gaussian distribution
- Sample from
- For each step:
- Get reward , and transition
- Sample action from
- Update :
- Update :
- Finds the best Stochastic Policy (Optimal Deterministic Policy, produced by other RL algorithms, can be unsuitable for POMDPs)
- Naturally explores due to Stochastic Policy representation
- Effective in high-dimensional or continuous action spaces
- Small changes in θ ⇒ small changes in π, and in state distribution T
- his avoids the convergence issues seen in argmax-based algorithms
- Typically converge to a local optimum rather than a global optimum
- Policy Evaluation is typically inefficient and has high variance
- Policy Improvement happens in small steps ⇒ slow convergence
- Aguirregabiria, V., & Mira, P. (2010). Dynamic discrete choice structural models: A survey. Journal of Econometrics, 156(1), 38-67.
- Heckman, J. J., Lochner, L., & Taber, C. (1998). Explaining rising wage inequality: Explorations with a dynamic general equilibrium model of labor earnings with heterogeneous agents. Review of economic dynamics, 1(1), 1-58.
- Powell, W. B. (2007). Approximate Dynamic Programming: Solving the curses of dimensionality (Vol. 703). John Wiley & Sons.