Conge 精进

AI 笔记 Week 15 Planning under uncertainty

本文 3835 字,阅读全文约需 11 分钟

Link to the videos Link to the video transcripts


This lecture will focus on marrying planning and uncertainty together to drive robots in actual physical roles and find good plans for these robots to execute.

Planning Under Uncertainty MDP

  • Planning + uncertainty: MDPs (Markov decision process) and POMDPs (partially observed MDP)
  • Planning + uncertainty + learning: Reinforcement Learning

Methods categorized based on the characteristics of the world (observability and certainty).

  • MDP deals with situations where the world is stochastic and fully observable.


  • MDP has states, actions and state’s transition matrix and a reward function. (it can attach to state, or action or transition?)
  • it becomes Markov if the outcomes of actions are somewhat random.
  • planning problem is depended on actions to each possible state. But the operational goal becomes maximizing total reward.

Robot Tour Guide Examples

All these robots need to deal with uncertainties and observabilities to do their jobs (tour guide or mine explorer).

MDP Grid World


Absorbing states: search will end if the agent is at the absorbing states. Policy assign action based on the state the agent is in.

Problems With Conventional Planning 1

Policy Question

Question: what is the best action to take when an agent is in states a1, c1, c4 and b3?

MDP And Costs

The reason that the agent should be avoiding the b4 state is the cost.

  • The goal is to find the best path wich cost less with finite steps.
  • The discount fact can bound the max result.

Value Iteration

  • value function can determine the policy based on the total reword.

Intuition of VI functions

  • The value function is a potential function that leads from the goal location all the way into the space so that hill climbing in this potential function leads you on the shortest path to the goal.
  • The algorithm is a recursive algorithm and it converges, and you have a grayscale value that really corresponds to the best way of getting to the goal. Bellman equation


Determistic question:

I Quiz: calculate Value when an agent was at each state.

Stochastic Question :

Calculation is complicated here because the reward of each action must be evaluated.

Value Iterations And Policy

The policy can be defined by the value function after the value of each cell is calculated. The action policy is to choose the action which leads to the highest path reward.

If the cost of each state is positive, the policy will encourage action to stay in the current state.

If the cost is too low, the value of each state might become so low that the agent will try to end the search as soon as possible without looking for an optimal solution

MDP Conclusion



  • MDP is for fully observable environments and planning in stochastic environments.
  • Partially observable environments, or POMDPs address problems of optimal exploration versus exploitation, where some of the actions might be information-gathering actions; whereas others might be goal-driven actions.


conventional planning

  • When the world observable and the action is deterministic, Conventional planning will work and plan the best path to the goal state.


  • When the world observable and the action is stochastic, MDP will work and find the best path to the goal state.


  • When the world partially observable (e.g. we don’t know where the absorbing states are) and the action is stochastic, POMDP will need to gather information and then work to find the best path to the goal state.

POMDP wouldn't work

POMDP would not work when there are two worlds So here’s a solution that doesn’t work: Obviously, the agent might be in 2 different worlds and it does not know. Solving the problem for both of these cases and then put these solutions together will not work because the average result will never let it go south to gather information.

POMDP will work

POMDP on belief states will work. If the agent goes south and reaches the sign. 50% chance it will go the right-side belief state. if MDP was performed, then it will reach the +100 state. the same will happen if it goes to left-side belief state (50% chance).

Readings on Planning under Uncertainty

AIMA: Chapter 17

Further Study

Charles Isbell and Michael Littmann’s ML course:

Peter Norvig and Sebastian Thrun’s AI course:

2018-12-01 First draft