9. Planning under Uncertainty Nov 8, 2011 Video 9.1. Introduction put together material from past classes to make decisions under uncertainty Video 9.2. Planning Under Uncertainty MDP see image: 09class_venn1.jpg 09class_venn2.jpg combine planning and uncertainty MDP -- Markov Decision Processes POMDP -- Partially Observable Markov Decision Processes combine planning, uncertainty, learning Reinforcement Learning note: remember that Partially Observable means you need memory... What is an MDP -- a state transition graph where the outcomes of actions are somewhat random see image: 09class_MDPstates.jpg states: S1 ... Sn actions: A1 ... Ak state transition matrix: T( S, A, S' ) (state->action->next-state) which is "just about the same thing as": P( S' | A,S ) the probability that state S' is the correct posterior state after executing action A in state S the missing thing is the objective for the MDP, what does one achieve? for that we oftern define a reward function: R( S ) for this lecture each state has a reward that says how good it is... the planning problem is to assign? and action to each possible state so that we maximize our total reward Video 9.3. Robot Tour Guide Examples fun ok... environments are stochastic Video 9.4. MDP Grid World absorbing states -- goals simple grid world with start and two goals, one good, one bad have 80% chance of going to the right square on a move and if at wall 80% chance to bounce back to same cell, 10% going at 90deg to original cell (not diagonal to the target cell!! but changes story in vid 9.6 ) (and he only shows up-motions in the 80%) note that in a corner it's 90% chance of bouncing back: 80% from the wall in the right direction and 10% from wall to side want to have a planning method that assigns an answer no matter where we end up (given the stochastic moves). that's a Policy: pi( S ) -> A assigns desirable actions to any/all states see image: 09class_gridworld.jpg red arrows are the Policy the planning problem is to find the optimal Policy Video 9.5. Problems with Conventional Planning 1 see image: 09class_branchingFactor.jpg with stochastic environments the Branching Factor of action outcomes gets large with conventional planning you have to follow more than one possible path Video 9.6. Branching Factor Question see image: 09class_branchingFactor.jpg Quiz: how many states can you reach under any possible action from B3 -- 8 NOT the number of action->outcomes but the number of cells you can reach also note that he changes the end states for the 10%'s from beside in Vid 9.4 to diagonal here...!!!??? and then changes back in vid 9.10...???!!! Video 9.7. Problems with Conventional Planning 2 Branching factor large Trees can be very deep due to infinite loops of "failed" actions. Many states reoccur in the search, are visited more than once; it doesn't matter how you got to the state but it may be expanded more than once. These are overcome by using Policies Video 9.8. Policy Question 1 Quiz: in gridworld find a policy that will maximize the probability of reaching the +100 goal state, given the stochastic NEWS actions what is the optimal action from spot A1? Go East to get closer to goal Video 9.9. Policy Question 2 what is the optimal action from spot C1? Go North to get closer to goal there are two equal paths but going east risks falling into -100 goal Video 9.10. Policy Question 3 what is the optimal action from spot C4? no cost for steps but maximize possiblity of getting to +100 goal Go South to avoid bouncing into -100 on failure of East and hae 10% chance of going West "by accident"... (except of course this goes back to right angle failures rather than diagonals...) Video 9.11. Policy Answer 3 Question 4 what is the optimal action from spot B3? Go West to avoid bouncing into -100 on failure of North Video 9.12. MDP and Costs see image: 09class_MDPcost.jpg Would normally have a cost factor as a Reward function over any possible state where reaching the state is: R( S ) -> +100 in state A4 -100 in state B4 -3 in other states so taking a step somewhere costs -3 which gives an incentive to shorten the final action sequence Objective of MDP is to maximize the sum of all future rewards where t is a time step E[ sum(t) R(t) ] -> max Maximize the Expectation of the sum of Rewards from each time step Find the Policy that maximizes the expression... Sometimes put in Discount Factor gamma(Y) -- say 0.9 -- to the power of steps, over future rewards: E[ sum(t) (Y^t * R(t)) ] -> max which decays future rewards relative to immediate rewards This is an alternative way to mimimize the number of steps (instead of having a step cost like "-3") because it discounts the value of finally getting the reward by the number of steps. It also keeps the Expectation bounded because it is "easy to show" that it will be 1/1-Y times the absolute reward maximizing value (in this case +100): E[] <= (1 / 1 - Y) * | Rmax | Video 9.13. Value Iteration 1 see image: 09class_MDPvalue.jpg The definition of the Expectation of possibly discounted rewards allows me to define a Value Function For each state S, the value of the state is the Expected sum of of discounted rewards provided that I start in state S and execute Policy pi : V^pi(S) = E(pi) [ sum(t) (Y^t) * Rt) | S0 = S ] For any policy you can simulate the agent and compute the average reward being received until you hit a goal state There is a well defined Expectation over any possible execution of the policy that is generic to each state and policy, called Value Function Planning is iterating and computing Value Functions which will also find better Policies Video 9.14. Value Iteration 2 The Value Function leads you on a path such that Hill Climbing leads you to the goal The algorithm is recursive and spread value through the space and after a number of iterations and (in the animation) it makes a gray scale value that leads you to the goal... follow the gradient... Video 9.15. Value Iteration 3 recursivly calculate the Value Function until you arrive at the optimal start with 0 in all spaces but the goals where we know the value recalculate the current value, is it good? example in A3 going east we have 80% chance of getting +100 reward: V(A3, E) = 0.8 * 100 and a .1 chance of going N (bouncing back) for -3 and a .1 chance of going S for -3 (then he doesn't multiply the -3's by .1 and gets 0.8*100 - 3 = 77 ??) then propagate backwards to other states further from the goal To get V(s) you need to go through all possible successor states V(s') from a lookup table, Sum the probability of getting to S' times the Value of S': sum(S') P( S' | S,A ) * V(S') discount it by gamma and add the Reward(or cost) of that state then maximize over possible actions Backup Equation converges to a Dahlman equivalence : (In terminal states the value is R(S) itself...) V(S) <- max(A) Y * ( sum(S') P( S' | S,A ) * V(S') + R(S) see image: 09class_MDPvalueIter.jpg the values represent the optimal future reward for each state Video 9.16. Deterministic Question 1 same grid world as always deterministic transitions gamma = 1 cost for each non-absorbing state = -3 Quiz: value of A3 = 97 easy to see that East is the maximum action -> +100 we have a single successor state P(S') = 1 and subtract the Reward-Cost -3 of the current state Video 9.17. Deterministic Question 2 Quiz: value of B3 = 94 optimal acton (becasue deterministic) is North for value of 97 and subtract -3 cost Video 9.18. Deterministic Question 3 after doing all the shuffle, in either direction 97 - 94 - 91 - 88 -> 85 Quiz: value of C1 = 85 see image: 09class_detValue.jpg For determinsitic actions the value function is the reward minus the distance to the absorbing state times the cost of each step. Video 9.19. Stochastic Question 1 same grid world as always stochastic transitions 0.8 goes where it should 0.1 to sides gamma = 1 cost for each non-absorbing state = -3 Quiz: value of A3 = ( .8 * 100 ) + -3 = 77 "easy to see" that East is the maximum action -> +100 with an 80% chance that the East action reaches the goal we have three successor states P(S') = .8, .1, .1 and the two non-goal ones have 0 value to start with here e.g., going West has a value of -3 on this first iteration then subtract the Cost -3 of the current state Video 9.20. Stochastic Question 2 after calculating a3=77 and the other states are still 0 Quiz: the very first value of B3 = ( .8 * 100 ) + -3 = 48.6 where North is the maximum action... going North: 80% chance of getting a3=77, 10% of getting b4=-100 10% of bouncing back to b3=0 plus Cost=-3 0.8 * 77 + 0.1 * -100 + 0.1 * 0 + -3 == 48.6 going West: 80% chance of boncing back to b3=0, 10% of getting a3=77 10% of getting c3=0 plus Cost=-3 0.8 * 0 + 0.1 * 77 + 0.1 * 0 + -3 == 4.7 "At this point going North is _vastly_ superior because we had a great value of 77 going North and the initial value in b3 was 0" see image: 09class_stocValue.jpg Video 9.21. Value Iterations and Policy 1 The Value Backup Funcion defines the optimal Policy once values have been backed up (iterated enough times) the action to pick is the one that maximizes the VBF in each state For any state S we can define a Policy which picks an Action A that maximizes the VBF, for that maximization we can safely drop gamma and R(s): pi(S) = argmax(a) sum(S') ( P(S'|S,A) * V(S') ) Video 9.22. Value Iterations and Policy 2 same grid world as always stochastic transitions 0.8 goes where it should 0.1 to sides gamma = 1 cost for each non-absorbing state = -3 see image: 09class_policyValue.jpg Values after convergence in blue and Policy actions in red -- point to higher value states. this is a situation where the risk of falling into the -100 goal is balanced by the time spent going around because of the Cost=-3, ie a4 goes West with a small chance of diverting into b4=-100 But if the Cost for each non-absorbing state = 0 things change... each state value after Backup is 100 and we can guarantee that we get to the +100 goal see image: 09class_policy0Value.jpg where the POlicy is the one described before: a4->S, b3->W to avoid falling into -100 state (however it isn't described how we know that from the BVF....) Setting Cost to -200 you get a situation where the agent tries to end the game as fast as possible so as to not overspend... see image: 09class_policy-200Value.jpg (...setting a cost such that even negative death is better than living...) Video 9.23. MDP Conclusion see image: 09class_MDPsummary.jpg Video 9.24. Partial Observability Introduction Not going into full depth of Partially Observable planning Just giving a feeling... Video 9.25. POMDP vs MDP POMDPs address problems of optimal exploration versus exploitation where some of the actions might be for information gathering and others might be goal driven with MDPs the space is Observalbe so no information gathering is needed Video 9.26. POMDP for a simple maze, see image: 09class_mazeFOD.jpg fully observable and deterministic has an optimal plan to cut the corners as close as possible see image: 09class_mazeFOS.jpg fully observable and stochastic has a lot of possible control policies see image: 09class_mazePOS.jpg partial observable and stochastic in this case the agent position is observable but the goal rewards are unknown at the start a sign is at the other end of the maze telling which goal is which so the policy is to move South to the sign first even though it would normally just go North to the exits Can we devise a plan that understands that an information gathering detour is necessary? A soution that doesn't work... agent in two different worlds left or right reward can't solve both and average see image: 09class_POMDPintro.jpg What works Information Space or Belief Space -- do planning in what you know about the space moving around changes your Belief State before reading the sign you have a 50% chance of each route so do MDP Value Backup Function trick in belief state: "pour water" into each arm to make gradients which back flow through the state transtion at the sign and create a gradient for the agent to go South first Video 9.27. Planning Under Uncertainty Conclusion now we know everything these is to know about PUU... can use MDP for robot motion planning, and all kinda neat stuff...