10. Reinforcement Learning -- Nov 8, 2011

Video 10.1. Introduction
 For MDP an agent can successfully navigate an environment
  if it knows the correct model and the where the rewards and penalties are.

 Reinforcement Learning can guide agent to an optimal policy
  even when the rewards are unknown. The agent can learn to explore
  the territory, find the rewards, and then learn an optimal policy.


Video 10.2. Successes
 Tried to do Backgammon, first with experts, then by playing each other
 With helicopter, recording expert pilots and then extrapolating.


Video 10.3. Forms of Learning
 1. UnSupervised -- training set has outputs as well (X1, Y1) ... Y= f(X)
 2. Supervised -- just data with no outputs, find pattern  ...  P(X=x)
 3. Reinforcement -- action->state transistions (s,a) with Rewards
 						learn optimal policy


Video 10.4. Forms of Learning Question

 Quiz: good for Supervised, Unsupervised, Reinforcement?

  1. Speech recognition with speech examples and written texts,
       then try to learn model of language.
	   -- Supervised: has input out pairs

  2. Analyzing spectral emissions of stars and trying to find
       clusters of similar types that may be of interest to astronomers.
	   Data is emission frew for each star.
	   -- UnSupervised: clustering data and discovering labels

  3. A rat presses a lever when certain conditions are met.
       -- Reinforcement: classic behaviorism

  4. Elevator controller -- policy to decide which goes where,
      sequence of button presses and resulting wait times.
       -- Reinforcement: input is state-action transitions (button pushes)
	                     but wait time is Penalty, not output...hmmm....!?
						 "all we get is the amount we are trying to mimimize"


Video 10.5. MDP Review
  see image:  10class_MDPsummary.jpg

  a set of states: s e S
    (s is an element of a set of States)
  a start state: S0
  a set of actions from each state: a e Actions(s)
    (a is an element of a set of Actions in each state)
  state transition functions:  P(s' | s,a )  or   T(s,a,s')
    (the probability that we get state s'
	  given that we apply action a in state s)
  reward function: R(s,a,s')
    (over the whole triplet -- the reward you get for starting in s,
	 taking action a, and arriving in s') or
	               R(s')
	(only over the result state s', e.g. in gridworld we don't care
	 how we got to a goal state, just that we got there...)


Video 10.6. Solving a MDP
  see image:  10class_MDPsummary2.jpg

  to solve an MDP we are trying to find a Policy pi(s) that
  maximizes the discounted total reward:

     max  sum(t) ( Y^t *  R(St, pi(St), St+1 )
	   maximize the sum of
		gamma to the t power (discount over time) times
	    the state-transition rewards for all time t
	    (so we count future rewards less than current rewards,
		 this also bounds rewards which could sum to infinity)

  The Utility of a state: U(s) = argmax(a)  sum(s') P(s'|s,a) * U(s')
    the maximum Expected Value over all possible actions in s
	which result in new state s'
	  where Expected Value is the
	   Probability of reaching that state times it's Utility
	    

Video 10.7. Agents of Reinforcement Learning
    see image:  10class_agentTypes.jpg

 what if you don't know R Reward or Transition model P?
 can't solve the MDP
 with RL you can learn R & P or substitutes:

 choices based on:
    what we know, what we want to learn, what we use once we're learned

   A utility based agent might know P and need to learn R so it can 
     calculate Utility U for use in solving MDP problem.
	
   A Q-learning agent, doesn't have to know P or R, but learns a
     Value Function Q(s,a) which is a type of Utility over state-action pairs
	  -- for any given state and action, what is the Utility of the result?
	  then can use the Q directly (instead of Utility in MDP) so we don't
	  have to learn the transition model.

   A Reflex Agent,  doesn't have to know P or R,
      but directly learns a Policy which it uses,
	  Called a reflex agent because it is pure stimulus-response,
	   doesn't have to think about modeling the world, just takes actions.


Video 10.8. Passive vs Active

 how adventurous is the agent?

   Passive RL agent, agent has fixed policy but learns about R and P
    e.g.: on a ship in uncharted waters where the captain has a fixed policy
	      and you need to learn the rewards as things proceed

  Active RL agent, changes policy as you go
    e.g.: captain modifies policy based on what you're learned along the way
	 allows you to "cash in" on learning so far,
	  and explore futher as needed -- using actions that might allow for
	  more learning.


Video 10.9. Passive Temporal Difference Learning (TD)
    see image:  10class_RLTD.jpg

  move from one state to next using a fixed policy pi,
   look at the difference between states and learn that
   then "backup" the values to previous states.

  keep a table of Utilities for each state, which starts with null values
   and a table of the Number of visits to each state, which starts at zero
      U(s)   and   N(s)

  run the policy until you get to a goal,
   keeping track of Utilities and Number of visits
   and then run it again, updating the tables as we go with update rule:
      U(s) <- U(s) +  a ( N(s) * (R  + Y * U(s') - U(s))
	new Utility for state is old U plus alpha-a (learning rate) times
	  the Number of visits to this state times
	   the reward in state s  plus
	    gamma-Y (discount) times
		 Utility of new state minus Utility of this state
 
   we want an alpha that starts out big and then reduces as we visit
    each state more...because we get more confident in the values
	as we get more samples... so something like alpha = 1/N+1
	for this system the reward-r is 0 for all but the goal states
	and gamma is one, so no discounting

  that the first row of red numbers are the calculation
    for the state to the left of the +1
	 (called c3 in the old videos, but 3,something in this one)
	at the time that the goal is reached we calculate only this pre-state
	because all the previous ones are 0's so it won't matter anyway

  then start over again, using the same policy we get to the previous non-zero
  calculation and can update (old-video-c2) -- see second line in red
   (we've visited the c2 state twice so alpha*N == 1/3)

  and getting to goal again we re-update (c3) -- third line in red

 The reward from the goal state propagates slowly backwards
 but need one trial for every state that gets updated.
 The negative goal also propagates backwards as we explore that path.
 Eventually converges to the correct Utilities for this policy.


Video 10.10. Passive Agent Results
    see image: 10class_RLTDerr.jpg
 on right is average error in utility of all states over number of trials
  after about 60 trials it gets to about 0.05...
 
 on left are actual utility estimates for individual states over trials
  takes a while to converge to something close to true values


Video 10.11. Weaknesses Question

 Quiz: true or false, what are the weaknesses of Passive TD Learning?

  1. Long Convergence time?  -- T
  2. Limited by the Policy?  -- T
      using fixed policy,
	   may deviate because env is stochastic
	   but not because of different choices
  3. Could there be states that never get visited
      and have no Utility estimates?  -- T
  4. Could there be states that have low visit counts
      and have poor Utility estimates?  -- T

 All are possibilites but are environment dependent
 and result from execution only one Policy.
 e.g. if Policy is always go up and right
   you only get to lower states when the actions fail in stochastic envs


Video 10.12. Active Reinforcement Learning

  Greed RL -- same passive algorithm,
    but after a few passes on estimation we recompute a new Policy
	which is the result of solving the MDP using our new Utility estimates

  if the initial Policy was flawed the greedy algorithm
    will tend to move towards a better Policy.


Video 10.13. Greedy Agent Results
    see image: 10class_GreedyErr.jpg

  error is high for 40 trials but then drops fast.
  "Policy Loss" is the difference between optimal and estimate
   so if it was perfect it would be 0...

  Missed a move at 1,2 (b1) where it goes down instead of up
   because it was greedy and then never deviated from one good Policy...


Video 10.14. Balancing Policy

 the question is how to get the learner out of the ok-good rut?
 We have to stop using a good Policy,
  stop Exploiting the best found so far
  and start Exploring to see if there's a better Policy.
 Exploring could lead us astray and waste time
  so we have to figure out the right trade-off.
  Need to look at the long term benefit of something that hurts in short-term.

 One possibility is Random Exploration,
  follow the best Policy most of the time
   but once in a while randomly take an action that isn't the best Policy.
  Tends to be slow.
  To get better we have to really understand what's going with
   Exploration vs Exploitation.


Video 10.15. Errors in Utility Questions

 What are we doing...
  Keeping track of optimal Policy so far and updating with new ones.
  Keeping track of Utilities of states and updating with new ones.
  Keeping track of Number of visits for each states.

 what can go wrong?
  Haven't sampled enough to get good Utility estimates,
   the Nvisits values are too low and the U we got is just random crap.
  Could get bad Utility because Policy was wrong,
   Policy did non-optimal actions so Utility wasn't as high as it could be.

 Quiz:  for sampling and policy error is it true or false:
                                                           sample policy
  1. could the error make the Utility estimates too low?     T      T
  2. could the error make the Utility estimates too high?    T      F
       ...not sure why low samples could make U high....
	   but bad Policy can't make a better estimate than optimal
  3. could it be improved with higher N (sample) values?     T      F
	   for bad Policy having more samples will decrease
	    variance but won't improve the mean values


Video 10.16. Exploration Agents

 This suggests a design for an Exploration Agent that will be
  more proactive about Exploring the world when it's uncertain
  and fall back on Exploitation as it gets more certain.

 We can make the initial Utilities some large Reward,
  maybe the largest we can expect, +1 in this example:
    U(s) = +R
  in every case where the number of visits to the state
   is less then some theshold "e", the Exploration threshold:
    N(s) < e
   and when we've visited 'e' times then revert to the learned Utilities
  
 So we start out we're going to explore from new states until we
  have a good estimate of what the true Utility is we stop exploring
  and just go with that Utility.


Video 10.17. Exploration Agent Results
    see image: 10class_ExpAgentErr.jpg

 better results, started high but
  after only 20 trials it's come down to a perfect Policy
  ( learned the 1,2 up arrow thing)
  but can have perfect Policy and still not have quite correct Utilities


Video 10.18. Q Learning 1
   see image:  10class_QLearn1.jpg

 once we've learned the Policy and Utilities (shown in gridbox image)
 if we haven't learned (or been given) the transition probabilites Model
  we can't apply the Policy even though we know the Utilities

 but with Q-leaning we don't need the Model or the Utilities
 we learn a direct mapping from states and actions to Utilities: Q(s,a)
 and once we've learned Q we can determine the optimal Policy by
 take the maximum Q over all possible actions


Video 10.19. Q Learning 2
    see image:   10class_QLearnTable.jpg

 Start with the Q table where each section is the value for
  performing that action from that state.
 Use same update formula but only update value for action,state pairs


Video 10.20. Pacman 1

  doesn't work that well for larger problems
   because there are too many states to update

  no way to relate similar state sets
  want to generalize between similar states
  just like with supervised learning generalizing

Video 10.21. Pacman 2  -- function Generalization
 see image:  10class_Pakman2.jpg

 represent state by set of important features instead of positions,
   s = [ F1, F2, F3 ... ]
 distance to nearest ghost or food or whatever...

 then states with similar features have the same value
 bad if dissimilar states have same value because we used the wrong features
 i.e. need to represent the right things,
   e.g. if feature is a tunnel, does it have an outlet?

 learn how important each feature is using a weight "Wi"

 so the Q value is the sum of each feature times it's weight:
   Q(s,a) = sum(i) (Wi * Fi)

 Can update feature weight Wi values as we make Q(s,a) updates
  both driven by the amount of error, i.e. if off by a lot make big changes

 Same process as looking for Supervised learning (classification) features.


Video 10.22. Conclusion 

 We can solve known MDPs

 If we don't know the MDP we can estimate and solve it 
  with a fixed Policy pi or estimate the Q values with an exploration policy.