Week 5 Homework -- Nov 8, 2011 see images: 05week_HW_Q*.jpg 1. Q Learning Agent starts in location 3,3 and does a NORTH action but it's stochastic so it arrives at 4,3 the +100 Terminal state. How should the Q values be updated using the formula (from the Sarsa version of Q learning), with a learning rate a = 1/2 and discount rate (gamma) Y = 0.9 and with all rewards of 0 except for the 100 terminal state. Q(s,a <- Q(s,a) + a * (R(s) + Y * Q(s',a') - Q(s,a)) Q(s',a') means what goes on in the next state, so we started in 3,3 and ended in 4,3 and no matter what action you take from there the Q-value is always 100, so Q(s',a') will always be 100... addenda: Note: the SARSA version of Q-learning compares two consecutive state/action pairs. So the Q(s',a') refers to the second state and action; if we are in the terminal state it will always be 100. update the 3,3 box for N -- 45 E -- 0 W -- 0 S -- 0 2. Function Generalization in Reinforcement Learning Operating in a one dimensional environment of squares: A B G _ _ And we're going to consider a state generalization function that takes a state and condenses it into some features to represent that state. The first function F = < f1: distance from Agent (A) to Goal (G), f2: distance to the closest Bad guy (B) > Consider also the function G = (which has the same features f1, f2, and adds a third feature) (the minimum distance over all possible Bad guys) Say what states below have same value as the state above under functions F and G (for each function)? F G neither a. _ B A B G -- X X (same order: ABG) b. B B A _ G -- X (G distinguishes order of AB) c. A _ G B _ -- X (A->B dist different for both F,G) (assuming that distance is an absolute value: AB == BA) In this world, Agents and Bad guys can move one Square at a time; Agent tries to get to goal without encountering Bad guys. For the agent to do that, which is a more useful generalization function to use over these states? F G neither d. -- X (because of b. above) 3. Passive RL Agent see image for description. addenda under the video: The actions are move North, West, East, and South. All actions are stochastic; 80% they move as intended, and 10% they might move 90 degrees right or left. The part of the policy c) of moving back immediately means on the next turn take an action that (if it goes in the intended direction) brings the agent back to the grey square that is closest to its position, and if there are several of those, closest to the goal. If there are two road squares equally close to the goal, head towards the one that is on the branch of the road you were originally traveling on when you first left the road. -- All of the lower two rows A,B 1-8 (specifically from the last clarification: if you get to c8 by "mistake" you will head N to the Goal if you then get to c7 by mistake you will head N to d7 because it is on the "road you were originally traveling" rather than trying to return to c8)