Advanced Algorithmic Analysis
Value iteration

1 tells us that VI converges in a reasonable amount of time t* (finite). We know t* exists, but we don’t know when that will happen.

2 Gives us an bound of difference between the policy now and optimal policy. We can take this bound and test when it is a decent time to stop VI and take the current policy.

Both 1 and 2 encourage us choosing a small γ. (γ tells us how further in the future we should look). The effective horizon is H ~ 1/(1γ).

3 applying Bellman Operator K times to Q functions will shrink the distance between Q functions.
Linear Programming

Only one way to solve MDP in polynomile time：solving Bellman equation through Linear Programming. A linear programming problem may be defined as the problem of maximizing or minimizing a linear function subject to linear constraints. ( Defination extracted form this PDF)

Bellman equation has one part that is not linear, the max function. But it can be expressed by a series of linear function and a objective function
min
.
 in the primal, the objective function is the minimum of sum of all the V_{s};
 in linear programming, we can change constraints to variables and variables to constraints, and the resulting linear program is equivalent to the old one.
 Dual is a new linear program comes from the old primal version of linear program (没有推导过程）
 q_{sa} is “Policy flow”, maximize the expected rewards of all states.
 For each possible next state, we wanted it to be true that the amount of policy flows going through the next state should be equal to the number of the state it has been visited.
Policy Iteration
 Initialize the first step Q value of all the states to be 0; improve the policy at time t; Apply the policy to calculate the Q value of t+1 step.
 Convergence time is an open question (but it is finite): >= linear, <= exponential
 For every state, if the value of it follows π_{1} always equals or is larger than the value when it follows π_{2}, we say π_{1} dominates π_{2}.
 if π_{1} dominates π_{2} and there exits states that V^{π1}(s) > V^{π2}(s), we say π_{1} strict dominates π_{2}.
 if for every state, the distance between the value following policy π and following the optimal policy π* is no larger than ε, then π is εoptimal.
 B_{1} makes the update follows π_{1} and B_{2} makes the update follows π_{2}
 Applying B operator to two Values will not make them further apart, if the two values are the same, applying B will not make them closer.
 Theorem: if V_{1} dominates V_{2}, applying B will keep the ordering: BV_{1} >= BV_{2}.
 Q_{1} is the fix point of B_{1};
 π_{2} is greedy policy with respect to Q_{1}
 B_{2} is the Bellman operator of π_{2}
 Applying B_{2} (the greedy function wrt Q_{1}) on </sub>Q_{1} will add a bound to Q_{1}, thus getting a better Q_{1}. This is called Value Improvement.
 Value improvement (or value non deprovement): for each state, value will improved until it could not get better anymore.
 Monotonicity: Value can only get better after each iteration.
 Transitivity: a >= b, b >=c, so a >=c.
 Fixed point. If we apply B_{2} over and over again on Q1, we will reach the fixed point of B_{2},which is Q_{2}.
 In Value iteration, greedy policy converges in finite step. This does not necessarily mean value function will converge.
20150923 初稿
20150926 完成
20151204 reviewed and revised.