12. MDP Review  Nov 15, 2011

Video 12.1. Deterministic Question
		see image: 12class_MDPreview_1.jpg

 in a 4x2 grid with +/- goals in the bottom corners
 the agent can go NEWS directions but actions may fail at random with
   prob P() = success, and
   prob 1 - P() = reverse -- goes the opposite direction
  bouncing into a wall keeps you where you are

 Quiz:  With P=1 (fully determinisitic), Cost = -4, and gamma(Y) = 1
  Fill in the final values after running Value Iteration to completion

 via ProbabilityDemo program after hacking out the original kludges
 note that the first 1,2 row counters are reversed from ST's a,b
	creating an MDP to represent the 4 X 2 world
	Beginning Value Iteration
	Utility of (1 , 1 ) -100.0 (b,1)
	Utility of (1 , 2 ) 92.0   (b,2)
	Utility of (1 , 3 ) 96.0   (b,3)
	Utility of (1 , 4 ) 100.0  (b,4)
	Utility of (2 , 1 ) 84.0   (a,1)
	Utility of (2 , 2 ) 88.0   (a,2)
	Utility of (2 , 3 ) 92.0   (a,3)
	Utility of (2 , 4 ) 96.0   (a,4)

 But basically you subtract -4 cost for each square-move away from goal
  because there is no gamma or stochasticity...


Video 12.2. Single Backup Question
		see image: 12class_MDPreview_3.jpg (same layout but single cycle)

 Quiz:  With P=0.8 Cost = -4, and gamma(Y) = 1
  Fill in the value after a single Value Iteration
   at the top right square (a4)

   (P * goalval) + (1-P * bounceval) + cost  ==
   (0.8 * 100)   + (0.2 *  0)        + -4 = 76

  The value is maximized by the South action
   with a 0.8 chance of success
   with a 0.2 chance we bounce off the wall whose value is 0
   (note: need to SUM all the possible action results!!!)
   then add (or subtract) the cost of moving

Video 12.3. Convergence Question
		see image: 12class_MDPreview_3.jpg
 
 Quiz:  With (the same params) P=0.8 Cost = -4, and gamma(Y) = 1
  What is the value of a4 after convergence
     !! 95 !!

  By the Dahlman equation,
   (after convergence?) each new iteration doesn't change the value.
  "You kindof know" the optimal policy is to go South so just keep trying it.
  Then we can set the a,4 converged value to X and get this:
  	X = (0.8 * 100) + (0.2 * X ) + -4
  Solving for X gives:
    X - 0.2X = 80 + -4 = 76
	.8X = 76
	X = 76/.8 = 95 (!!?)

Note: I also managed to hack the hacked DEMO kludge to get 95
 with the right setup of
  getTransitionProbability() and
  determineDirectionOfActualMovement()


Video 12.4. Optimal Policy Question 
		see image: 12class_MDPreview_4.jpg
 
 Quiz:  With (the same params) P=0.8 Cost = -4, and gamma(Y) = 1
  What is the correct policy for each space, NEWS?
   (only one correct answer for each space...)

  Via hacked DEMO program:
	Policy Iteration Demo

	creating an MDP to represent the 4 X 2 world
	Reccomended Action for (1 , 1 )  =  null   (b,1) noop
	Reccomended Action for (1 , 2 )  =  up     (b,2) N
	Reccomended Action for (1 , 3 )  =  right  (b,3) E
	Reccomended Action for (1 , 4 )  =  null   (b,4) noop
	Reccomended Action for (2 , 1 )  =  right  (a,1) E
	Reccomended Action for (2 , 2 )  =  right  (a,2) E
	Reccomended Action for (2 , 3 )  =  right  (a,3) E
	Reccomended Action for (2 , 4 )  =  down   (a,4) S

  They are all "easy to see" except
    b,2 going North to avoid dropping into the -100 goal