9. Planning under Uncertainty  Nov 8, 2011

Video 9.1. Introduction
 put together material from past classes
  to make decisions under uncertainty


Video 9.2. Planning Under Uncertainty MDP
    see image: 09class_venn1.jpg
    		   09class_venn2.jpg

 combine planning and uncertainty
  MDP -- Markov Decision Processes
  POMDP -- Partially Observable Markov Decision Processes

 combine planning, uncertainty, learning
  Reinforcement Learning

 note: remember that Partially Observable means you need memory...

 What is an MDP --
  a state transition graph where
   the outcomes of actions are somewhat random
    see image: 09class_MDPstates.jpg

    states:  S1 ... Sn
	actions: A1 ... Ak
	state transition matrix:  T( S, A, S' )  (state->action->next-state)
	 which is "just about the same thing as": P( S' | A,S )
	  the probability that state S' is the correct posterior state
	   after executing action A in state S

   the missing thing is the objective for the MDP, what does one achieve?
    for that we oftern define a reward function: R( S )
	 for this lecture each state has a reward that says how good it is...

   the planning problem is to assign? and action to each possible state
    so that we maximize our total reward


Video 9.3. Robot Tour Guide Examples
 fun ok... environments are stochastic


Video 9.4. MDP Grid World

 absorbing states -- goals

 simple grid world with start and two goals, one good, one bad
  have 80% chance of going to the right square on a move
   and if at wall 80% chance to bounce back to same cell,
  10% going at 90deg to original cell
     (not diagonal to the target cell!! but changes story in vid 9.6 )
     (and he only shows up-motions in the 80%)
   note that in a corner it's 90% chance of bouncing back:
    80% from the wall in the right direction and 10% from wall to side

  want to have a planning method that assigns an answer
   no matter where we end up (given the stochastic moves).

  that's a Policy: pi( S ) -> A
   assigns desirable actions to any/all states
    see image:  09class_gridworld.jpg
	 red arrows are the Policy
  
  the planning problem is to find the optimal Policy


Video 9.5. Problems with Conventional Planning 1
    see image:  09class_branchingFactor.jpg
 with stochastic environments
  the Branching Factor of action outcomes gets large
  with conventional planning you have to follow more than one possible path


Video 9.6. Branching Factor Question
		 see image: 09class_branchingFactor.jpg
 Quiz: how many states can you reach
         under any possible action from B3 -- 8

   NOT the number of action->outcomes
     but the number of cells you can reach

  also note that he changes the end states for the 10%'s
   from beside in Vid 9.4 to diagonal here...!!!???
   and then changes back in vid 9.10...???!!!


Video 9.7. Problems with Conventional Planning 2

  Branching factor large
  Trees can be very deep due to infinite loops of "failed" actions.
  Many states reoccur in the search, are visited more than once;
   it doesn't matter how you got to the state
    but it may be expanded more than once.

  These are overcome by using Policies


Video 9.8. Policy Question 1

 Quiz: in gridworld find a policy that will maximize the probability of
            reaching the +100 goal state, given the stochastic NEWS actions
		what is the optimal action from spot A1?
     Go East to get closer to goal

Video 9.9. Policy Question 2

		what is the optimal action from spot C1?
     Go North to get closer to goal
	  there are two equal paths but going east risks falling into -100 goal


Video 9.10. Policy Question 3

		what is the optimal action from spot C4?
		 no cost for steps but maximize possiblity of getting to +100 goal
     Go South to avoid bouncing into -100 on failure of East
	   and hae 10% chance of going West "by accident"...
	  (except of course this goes back to right angle failures
	   rather than diagonals...)


Video 9.11. Policy Answer 3 Question 4

		what is the optimal action from spot B3?
     Go West to avoid bouncing into -100 on failure of North


Video 9.12. MDP and Costs
   see image: 09class_MDPcost.jpg

  Would normally have a cost factor
   as a Reward function over any possible state
   where reaching the state is:

  		R( S ) ->  +100  in state A4
		           -100  in state B4
				    -3   in other states

	so taking a step somewhere costs -3
	 which gives an incentive to shorten the final action sequence

 Objective of MDP is to maximize the sum of all future rewards
  where t is a time step

    E[ sum(t) R(t) ] -> max
  
  Maximize the Expectation of the sum of Rewards from each time step
  Find the Policy that maximizes the expression...

  Sometimes put in Discount Factor gamma(Y) -- say 0.9 --
     to the power of steps,  over future rewards:

    E[ sum(t) (Y^t * R(t)) ] -> max

  which decays future rewards relative to immediate rewards

  This is an alternative way to mimimize the number of steps
   (instead of having a step cost like "-3") because it discounts
   the value of finally getting the reward by the number of steps.

  It also keeps the Expectation bounded because it is "easy to show"
  that it will be 1/1-Y times the absolute reward maximizing value
  (in this case +100):

  	E[] <= (1 / 1 - Y)  * | Rmax |


Video 9.13. Value Iteration 1
   see image: 09class_MDPvalue.jpg

  The definition of the Expectation of possibly discounted rewards
   allows me to define a Value Function

  For each state S, the value of the state is
   the Expected sum of of discounted rewards
   provided that I start in state S and
   execute Policy pi :

      V^pi(S) = E(pi) [ sum(t) (Y^t) * Rt) | S0 = S ]


 For any policy you can simulate the agent and
  compute the average reward being received until you hit a goal state

 There is a well defined Expectation over any possible execution
  of the policy that is generic to each state and policy, called
  Value Function

 Planning is iterating and computing Value Functions
  which will also find better Policies


Video 9.14. Value Iteration 2
 The Value Function leads you on a path such that
  Hill Climbing leads you to the goal

  The algorithm is recursive and spread value through the space
  and after a number of iterations and (in the animation) it makes
  a gray scale value that leads you to the goal... follow the gradient...


Video 9.15. Value Iteration 3
 recursivly calculate the Value Function until you arrive at the optimal
  
  start with 0 in all spaces but the goals where we know the value
  recalculate the current value, is it good? 
   example in A3 going east we have 80% chance of getting +100 reward:
      V(A3, E) = 0.8 * 100
	and a .1 chance of going N (bouncing back) for -3
	and a .1 chance of going S for -3
	(then he doesn't multiply the -3's by .1 and gets 0.8*100 - 3 = 77 ??)
  then propagate backwards to other states further from the goal

  To get V(s) you need to go through
   all possible successor states V(s') from a lookup table,
  Sum the probability of getting to S' times the Value of S':
     sum(S') P( S' | S,A ) * V(S')
   discount it by gamma
   and add the Reward(or cost) of that state
   then maximize over possible actions

  Backup Equation converges to a Dahlman equivalence :
   (In terminal states the value is R(S) itself...)
     V(S) <- max(A) Y * ( sum(S') P( S' | S,A ) * V(S') + R(S)

   see image: 09class_MDPvalueIter.jpg

  the values represent the optimal future reward for each state


Video 9.16. Deterministic Question 1
  same grid world as always
  deterministic transitions
  gamma = 1
  cost for each non-absorbing state = -3

 Quiz: value of A3 = 97
        easy to see that East is the maximum action -> +100
		we have a single successor state P(S') = 1
		and subtract the Reward-Cost -3 of the current state


Video 9.17. Deterministic Question 2

 Quiz: value of B3 = 94

  optimal acton (becasue deterministic) is
   North for value of 97 and subtract -3 cost


Video 9.18. Deterministic Question 3

   after doing all the shuffle,
      in either direction 97 - 94 - 91 - 88 -> 85
 Quiz: value of C1 = 85

   see image: 09class_detValue.jpg
   For determinsitic actions the value function is the reward minus
    the distance to the absorbing state times the cost of each step.


Video 9.19. Stochastic Question 1
  same grid world as always
  stochastic transitions 0.8 goes where it should 0.1 to sides
  gamma = 1
  cost for each non-absorbing state = -3

 Quiz: value of A3 = ( .8 * 100 ) + -3  = 77
        "easy to see" that East is the maximum action -> +100
		with an 80% chance that the East action reaches the goal
		 we have three successor states P(S') = .8, .1, .1
		 and the two non-goal ones have 0 value to start with here
		  e.g., going West has a value of -3 on this first iteration
		then subtract the Cost -3 of the current state


Video 9.20. Stochastic Question 2
  after calculating a3=77 and the other states are still 0

 Quiz: the very first value of B3 = ( .8 * 100 ) + -3  = 48.6
     where North is the maximum action...
 	going North: 80% chance of getting a3=77,
	             10% of getting b4=-100
	             10% of bouncing back to b3=0
				 plus Cost=-3
					  0.8 * 77  +  0.1 * -100  +  0.1 * 0   +  -3 == 48.6
 	going West: 80% chance of boncing back to b3=0,
	             10% of getting a3=77
	             10% of getting c3=0
				 plus Cost=-3
					  0.8 * 0  +  0.1 * 77  +  0.1 * 0   +  -3 == 4.7

 
 "At this point going North is _vastly_ superior because we had
   a great value of 77 going North and the initial value in b3 was 0"

   see image: 09class_stocValue.jpg


Video 9.21. Value Iterations and Policy 1

 The Value Backup Funcion defines the optimal Policy
  once values have been backed up (iterated enough times)
   the action to pick is the one that maximizes the VBF in each state

 For any state S we can define a Policy which picks an Action A
  that maximizes the VBF,
  for that maximization we can safely drop gamma and R(s):

     pi(S) = argmax(a)   sum(S') ( P(S'|S,A) * V(S') )


Video 9.22. Value Iterations and Policy 2
  same grid world as always
  stochastic transitions 0.8 goes where it should 0.1 to sides
  gamma = 1
  cost for each non-absorbing state = -3

 see image: 09class_policyValue.jpg
 Values after convergence in blue and
  Policy actions in red -- point to higher value states.
 
 this is a situation where the risk of falling into the -100 goal
 is balanced by the time spent going around because of the Cost=-3,
  ie a4 goes West with a small chance of diverting into b4=-100


 But if the Cost for each non-absorbing state = 0 things change...
  each state value after Backup is 100 and we can guarantee that
  we get to the +100 goal
	  see image:  09class_policy0Value.jpg
 where the POlicy is the one described before:
   a4->S, b3->W  to avoid falling into -100 state
   (however it isn't described how we know that from the BVF....)


 Setting Cost to -200 you get a situation where the agent tries to
  end the game as fast as possible so as to not overspend...
	  see image:  09class_policy-200Value.jpg
  (...setting a cost such that even negative death is better than living...)


Video 9.23. MDP Conclusion

  see image:  09class_MDPsummary.jpg


Video 9.24. Partial Observability Introduction

 Not going into full depth of Partially Observable planning
 Just giving a feeling...


Video 9.25. POMDP vs MDP
 POMDPs address problems of optimal exploration versus exploitation
  where some of the actions might be for information gathering and
   others might be goal driven

 with MDPs the space is Observalbe so no information gathering is needed


Video 9.26. POMDP
 for a simple maze,
	see image: 09class_mazeFOD.jpg
   fully observable and deterministic has an optimal plan
    to cut the corners as close as possible

	see image: 09class_mazeFOS.jpg
   fully observable and stochastic has a lot of possible control policies

	see image: 09class_mazePOS.jpg
   partial observable and stochastic
    in this case the agent position is observable
	  but the goal rewards are unknown at the start
	 a sign is at the other end of the maze telling which goal is which
	 so the policy is to move South to the sign first
	  even though it would normally just go North to the exits
	Can we devise a plan that understands that an information gathering
	 detour is necessary?

 A soution that doesn't work...
  agent in two different worlds left or right reward
   can't solve both and average

	see image: 09class_POMDPintro.jpg
 What works Information Space or Belief Space --
  do planning in what you know about the space
   moving around changes your Belief State
    before reading the sign you have a 50% chance of each route
	so do MDP Value Backup Function trick in belief state:
	 "pour water" into each arm to make gradients
	   which back flow through the state transtion at the sign
	   and create a gradient for the agent to go South first
  

Video 9.27. Planning Under Uncertainty Conclusion 
 now we know everything these is to know about PUU...
 can use MDP for robot motion planning, and all kinda neat stuff...