Conge 精进

Reinforcement Learning 第九周课程笔记

本文 2728 字,阅读全文约需 8 分钟

This week

  • Watch Generalization
  • The readings are Gordon (1995) and Baird (1995).



  • When there are a lot of states, we can use features to represent states (e.g. grouping states using some common features or using some base states to represent other stats)
  • Inductive bias: refer to the way algorithm prefer one solution against another.
  • representation can stay unchanged even when states changed(?)

Generalization idea

  • In RL, the goal is to learn some input-output mapping.
    • e.g. Policy maps state -> actions
    • value function maps state/actions to expected return
    • Model maps T/R (transition and reward)
    • Generalization can happen on all these levels
  • Most researchers are focused on value function approximation.

Basic Update Rule

  • Q is represented by function approximator F(wa, f(s)). wa is some kind of weights, f(s) takes a state s and generate a series of features.
  • experience tuple <s,a,r,s’> comes in, and do an update.
  • TD error= the difference between predicted value (reward + discounted value of next state) and value of current state. TD error tells us the direction to move the parameters w. (the prediction is high, low or just right).
  • α is learning rate.

Linear Value Function Approximation

Linear value function

  • Q value for an action is represented by features and a set of weights
  • wai are weights of features fi(s), which represent importance of features

Quiz 1

Success and fail stories

Does it work?

  • 3-layer back prop: decision making in Jeopardy
  • CMAC(wikipedia) :

Baird's Counter example

Quiz 2: Bad Update Sequence

  • V(s’) = V(7) =8; V(s) = V(1) =3, derivative is w =1; so Δw0 = 0.1( 0 + 0.9 * 8 -3)*1 = 0.42;
  • Δw0 is 0.42, and Δwi, i = [0, 6] are all positive, so weight will increase with each update
  • Δw7 is negative when we update <7,0,7>, but w7 will still increase because it’s original value is » Δw7
  • So, even the algorithm is doing the right thing, the updates will never converge

Quiz 2: What if we initialize all the weights as 0

  • in this case, the temporal differences are all zeroes because wi is 0, which means no updates (changes) are going to happen. Sticky
  • If at any circumstance, when all values are zero, it could not escape the circumstance unless reward (r) is non-zero. If reward is also zero, we can call the circumstance the solution of the MDP.
  • this holds when MDP is deterministic.
  • the lesson: Shared weights can “not converge”!

Averagers: looks like  Linear Value Function Approximation but is non-linear

  • Problem: given a certain points of value, estimate the function of line. How to estimates the points in-between?
  • Naturally, weighted average of neighbours and get convex combinations of anchor points
  • for points that are far from anchor points, are good estimate will be the average of the anchor points.
  • more anchor points lead to less error in MDP estimation.

quiz 3

Connection to MDPs

  • Using base states, we can respresent the original MDP with MDP over Sbs

What have we learned


  • Need for generalization: some problem have Zillions of states; Method: applying supervised learning method .
  • Methods: Linear function approximation.
    • 1) value functions 2) policies; 3) models;
  • Success: TD gamma, Atari, Fitted Q-iteration (but still have need not work cases)
  • Problem cases: even linear need not work (baird counter problem)
  • Averagers: can be viewed as MDP itself. Use anchor points to represent the whole MDP.
  • LSPI: Least Squared policy iteration
2015-10-13 初稿 upto quiz 1
2015-10-17 完成
2015-12-04 reviewed and revisted