Reinforcement Learning 第九周课程笔记

This week

Generalization

Example

When there are a lot of states, we can use features to represent states (e.g. grouping states using some common features or using some base states to represent other stats)
Inductive bias: refer to the way algorithm prefer one solution against another.
representation can stay unchanged even when states changed(?)

Generalization idea

In RL, the goal is to learn some input-output mapping.
- e.g. Policy maps state -> actions
- value function maps state/actions to expected return
- Model maps T/R (transition and reward)
- Generalization can happen on all these levels
Most researchers are focused on value function approximation.

Basic Update Rule

Q is represented by function approximator F(w^a, f(s)). w^a is some kind of weights, f(s) takes a state s and generate a series of features.
experience tuple <s,a,r,s’> comes in, and do an update.
TD error= the difference between predicted value (reward + discounted value of next state) and value of current state. TD error tells us the direction to move the parameters w. (the prediction is high, low or just right).
α is learning rate.

Linear Value Function Approximation

Linear value function

Quiz 1

Does it work?

Baird's Counter example

Quiz 2: Bad Update Sequence

V(s’) = V(7) =8; V(s) = V(1) =3, derivative is w =1; so Δw₀ = 0.1( 0 + 0.9 * 8 -3)*1 = 0.42;
Δw₀ is 0.42, and Δw_i, i = [0, 6] are all positive, so weight will increase with each update
Δw₇ is negative when we update <7,0,7>, but w₇ will still increase because it’s original value is » Δw₇
So, even the algorithm is doing the right thing, the updates will never converge

Quiz 2: What if we initialize all the weights as 0

in this case, the temporal differences are all zeroes because w_i is 0, which means no updates (changes) are going to happen. Sticky
If at any circumstance, when all values are zero, it could not escape the circumstance unless reward (r) is non-zero. If reward is also zero, we can call the circumstance the solution of the MDP.
this holds when MDP is deterministic.
the lesson: Shared weights can “not converge”!

Averagers: looks like Linear Value Function Approximation but is non-linear

Problem: given a certain points of value, estimate the function of line. How to estimates the points in-between?
Naturally, weighted average of neighbours and get convex combinations of anchor points
for points that are far from anchor points, are good estimate will be the average of the anchor points.
more anchor points lead to less error in MDP estimation.

quiz 3

Connection to MDPs

recap

Need for generalization: some problem have Zillions of states; Method: applying supervised learning method .
Methods: Linear function approximation.
- 1) value functions 2) policies; 3) models;
Success: TD gamma, Atari, Fitted Q-iteration (but still have need not work cases)
Problem cases: even linear need not work (baird counter problem)
Averagers: can be viewed as MDP itself. Use anchor points to represent the whole MDP.
LSPI: Least Squared policy iteration

2015-10-13 初稿 upto quiz 1
2015-10-17 完成
2015-12-04 reviewed and revisted