12. MDP Review Nov 15, 2011 Video 12.1. Deterministic Question see image: 12class_MDPreview_1.jpg in a 4x2 grid with +/- goals in the bottom corners the agent can go NEWS directions but actions may fail at random with prob P() = success, and prob 1 - P() = reverse -- goes the opposite direction bouncing into a wall keeps you where you are Quiz: With P=1 (fully determinisitic), Cost = -4, and gamma(Y) = 1 Fill in the final values after running Value Iteration to completion via ProbabilityDemo program after hacking out the original kludges note that the first 1,2 row counters are reversed from ST's a,b creating an MDP to represent the 4 X 2 world Beginning Value Iteration Utility of (1 , 1 ) -100.0 (b,1) Utility of (1 , 2 ) 92.0 (b,2) Utility of (1 , 3 ) 96.0 (b,3) Utility of (1 , 4 ) 100.0 (b,4) Utility of (2 , 1 ) 84.0 (a,1) Utility of (2 , 2 ) 88.0 (a,2) Utility of (2 , 3 ) 92.0 (a,3) Utility of (2 , 4 ) 96.0 (a,4) But basically you subtract -4 cost for each square-move away from goal because there is no gamma or stochasticity... Video 12.2. Single Backup Question see image: 12class_MDPreview_3.jpg (same layout but single cycle) Quiz: With P=0.8 Cost = -4, and gamma(Y) = 1 Fill in the value after a single Value Iteration at the top right square (a4) (P * goalval) + (1-P * bounceval) + cost == (0.8 * 100) + (0.2 * 0) + -4 = 76 The value is maximized by the South action with a 0.8 chance of success with a 0.2 chance we bounce off the wall whose value is 0 (note: need to SUM all the possible action results!!!) then add (or subtract) the cost of moving Video 12.3. Convergence Question see image: 12class_MDPreview_3.jpg Quiz: With (the same params) P=0.8 Cost = -4, and gamma(Y) = 1 What is the value of a4 after convergence !! 95 !! By the Dahlman equation, (after convergence?) each new iteration doesn't change the value. "You kindof know" the optimal policy is to go South so just keep trying it. Then we can set the a,4 converged value to X and get this: X = (0.8 * 100) + (0.2 * X ) + -4 Solving for X gives: X - 0.2X = 80 + -4 = 76 .8X = 76 X = 76/.8 = 95 (!!?) Note: I also managed to hack the hacked DEMO kludge to get 95 with the right setup of getTransitionProbability() and determineDirectionOfActualMovement() Video 12.4. Optimal Policy Question see image: 12class_MDPreview_4.jpg Quiz: With (the same params) P=0.8 Cost = -4, and gamma(Y) = 1 What is the correct policy for each space, NEWS? (only one correct answer for each space...) Via hacked DEMO program: Policy Iteration Demo creating an MDP to represent the 4 X 2 world Reccomended Action for (1 , 1 ) = null (b,1) noop Reccomended Action for (1 , 2 ) = up (b,2) N Reccomended Action for (1 , 3 ) = right (b,3) E Reccomended Action for (1 , 4 ) = null (b,4) noop Reccomended Action for (2 , 1 ) = right (a,1) E Reccomended Action for (2 , 2 ) = right (a,2) E Reccomended Action for (2 , 3 ) = right (a,3) E Reccomended Action for (2 , 4 ) = down (a,4) S They are all "easy to see" except b,2 going North to avoid dropping into the -100 goal