Week 5 Homework -- Nov 8, 2011
					 see images: 05week_HW_Q*.jpg

1. Q Learning
 Agent starts in location 3,3 and does a NORTH action but
 it's stochastic so it arrives at 4,3 the +100 Terminal state.
 How should the Q values be updated using the formula
  (from the Sarsa version of Q learning), with a learning
  rate a = 1/2  and discount rate (gamma) Y = 0.9 and
  with all rewards of 0 except for the 100 terminal state.

  Q(s,a <- Q(s,a) + a * (R(s) + Y * Q(s',a') - Q(s,a))


  Q(s',a') means what goes on in the next state,
           so we started in 3,3 and ended in 4,3
		   and no matter what action you take from
		   there the Q-value is always 100, so
		   Q(s',a') will always be 100...

 addenda: Note: the SARSA version of Q-learning compares two consecutive
 state/action pairs. So the Q(s',a') refers to the second state and action;
 if we are in the terminal state it will always be 100.

	update the 3,3 box for N -- 45
						   E --  0
	                       W --  0
						   S --  0


2. Function Generalization
    in Reinforcement Learning


	Operating in a one dimensional environment of squares:

	    A  B  G  _  _

	And we're going to consider a state generalization function
	 that takes a state and condenses it into some features to
	 represent that state.

	The first function F =
	    < f1: distance from Agent (A) to Goal (G),
	      f2: distance to the closest Bad guy (B) >
	Consider also the function G =
	  (which has the same features f1, f2, and adds a third feature)
	    <f1, f2, f3: distance of closest B to G >
		 (the minimum distance over all possible Bad guys)


	Say what states below have same value as the state above
	 under functions F and G (for each function)?
							 F	G	neither
		a.   _ B A B G   --  X  X       (same order: ABG)
		b.   B B A _ G   --  X          (G distinguishes order of AB)
		c.   A _ G B _   --         X   (A->B dist different for both F,G)

	(assuming that distance is an absolute value: AB == BA)

	In this world, Agents and Bad guys can move one Square at a time;
	Agent tries to get to goal without encountering Bad guys.

	For the agent to do that,
	 which is a more useful generalization function
	 to use over these states?
								F	G	neither
	 	d.               --         X      (because of b. above)


3. Passive RL Agent 

 see image for description.

addenda under the video:
The actions are move North, West, East, and South. All actions are
stochastic; 80% they move as intended, and 10% they might move 90 degrees
right or left. The part of the policy c) of moving back immediately means
on the next turn take an action that (if it goes in the intended direction)
brings the agent back to the grey square that is closest to its position,
and if there are several of those, closest to the goal. If there are two
road squares equally close to the goal, head towards the one that is on the
branch of the road you were originally traveling on when you first left the
road.

  -- All of the lower two rows A,B 1-8 
  (specifically from the last clarification:
   if you get to c8 by "mistake" you will head N to the Goal
   if you then get to c7 by mistake you will head N to d7
    because it is on the "road you were originally traveling"
	rather than trying to return to c8)