10. Reinforcement Learning -- Nov 8, 2011 Video 10.1. Introduction For MDP an agent can successfully navigate an environment if it knows the correct model and the where the rewards and penalties are. Reinforcement Learning can guide agent to an optimal policy even when the rewards are unknown. The agent can learn to explore the territory, find the rewards, and then learn an optimal policy. Video 10.2. Successes Tried to do Backgammon, first with experts, then by playing each other With helicopter, recording expert pilots and then extrapolating. Video 10.3. Forms of Learning 1. UnSupervised -- training set has outputs as well (X1, Y1) ... Y= f(X) 2. Supervised -- just data with no outputs, find pattern ... P(X=x) 3. Reinforcement -- action->state transistions (s,a) with Rewards learn optimal policy Video 10.4. Forms of Learning Question Quiz: good for Supervised, Unsupervised, Reinforcement? 1. Speech recognition with speech examples and written texts, then try to learn model of language. -- Supervised: has input out pairs 2. Analyzing spectral emissions of stars and trying to find clusters of similar types that may be of interest to astronomers. Data is emission frew for each star. -- UnSupervised: clustering data and discovering labels 3. A rat presses a lever when certain conditions are met. -- Reinforcement: classic behaviorism 4. Elevator controller -- policy to decide which goes where, sequence of button presses and resulting wait times. -- Reinforcement: input is state-action transitions (button pushes) but wait time is Penalty, not output...hmmm....!? "all we get is the amount we are trying to mimimize" Video 10.5. MDP Review see image: 10class_MDPsummary.jpg a set of states: s e S (s is an element of a set of States) a start state: S0 a set of actions from each state: a e Actions(s) (a is an element of a set of Actions in each state) state transition functions: P(s' | s,a ) or T(s,a,s') (the probability that we get state s' given that we apply action a in state s) reward function: R(s,a,s') (over the whole triplet -- the reward you get for starting in s, taking action a, and arriving in s') or R(s') (only over the result state s', e.g. in gridworld we don't care how we got to a goal state, just that we got there...) Video 10.6. Solving a MDP see image: 10class_MDPsummary2.jpg to solve an MDP we are trying to find a Policy pi(s) that maximizes the discounted total reward: max sum(t) ( Y^t * R(St, pi(St), St+1 ) maximize the sum of gamma to the t power (discount over time) times the state-transition rewards for all time t (so we count future rewards less than current rewards, this also bounds rewards which could sum to infinity) The Utility of a state: U(s) = argmax(a) sum(s') P(s'|s,a) * U(s') the maximum Expected Value over all possible actions in s which result in new state s' where Expected Value is the Probability of reaching that state times it's Utility Video 10.7. Agents of Reinforcement Learning see image: 10class_agentTypes.jpg what if you don't know R Reward or Transition model P? can't solve the MDP with RL you can learn R & P or substitutes: choices based on: what we know, what we want to learn, what we use once we're learned A utility based agent might know P and need to learn R so it can calculate Utility U for use in solving MDP problem. A Q-learning agent, doesn't have to know P or R, but learns a Value Function Q(s,a) which is a type of Utility over state-action pairs -- for any given state and action, what is the Utility of the result? then can use the Q directly (instead of Utility in MDP) so we don't have to learn the transition model. A Reflex Agent, doesn't have to know P or R, but directly learns a Policy which it uses, Called a reflex agent because it is pure stimulus-response, doesn't have to think about modeling the world, just takes actions. Video 10.8. Passive vs Active how adventurous is the agent? Passive RL agent, agent has fixed policy but learns about R and P e.g.: on a ship in uncharted waters where the captain has a fixed policy and you need to learn the rewards as things proceed Active RL agent, changes policy as you go e.g.: captain modifies policy based on what you're learned along the way allows you to "cash in" on learning so far, and explore futher as needed -- using actions that might allow for more learning. Video 10.9. Passive Temporal Difference Learning (TD) see image: 10class_RLTD.jpg move from one state to next using a fixed policy pi, look at the difference between states and learn that then "backup" the values to previous states. keep a table of Utilities for each state, which starts with null values and a table of the Number of visits to each state, which starts at zero U(s) and N(s) run the policy until you get to a goal, keeping track of Utilities and Number of visits and then run it again, updating the tables as we go with update rule: U(s) <- U(s) + a ( N(s) * (R + Y * U(s') - U(s)) new Utility for state is old U plus alpha-a (learning rate) times the Number of visits to this state times the reward in state s plus gamma-Y (discount) times Utility of new state minus Utility of this state we want an alpha that starts out big and then reduces as we visit each state more...because we get more confident in the values as we get more samples... so something like alpha = 1/N+1 for this system the reward-r is 0 for all but the goal states and gamma is one, so no discounting that the first row of red numbers are the calculation for the state to the left of the +1 (called c3 in the old videos, but 3,something in this one) at the time that the goal is reached we calculate only this pre-state because all the previous ones are 0's so it won't matter anyway then start over again, using the same policy we get to the previous non-zero calculation and can update (old-video-c2) -- see second line in red (we've visited the c2 state twice so alpha*N == 1/3) and getting to goal again we re-update (c3) -- third line in red The reward from the goal state propagates slowly backwards but need one trial for every state that gets updated. The negative goal also propagates backwards as we explore that path. Eventually converges to the correct Utilities for this policy. Video 10.10. Passive Agent Results see image: 10class_RLTDerr.jpg on right is average error in utility of all states over number of trials after about 60 trials it gets to about 0.05... on left are actual utility estimates for individual states over trials takes a while to converge to something close to true values Video 10.11. Weaknesses Question Quiz: true or false, what are the weaknesses of Passive TD Learning? 1. Long Convergence time? -- T 2. Limited by the Policy? -- T using fixed policy, may deviate because env is stochastic but not because of different choices 3. Could there be states that never get visited and have no Utility estimates? -- T 4. Could there be states that have low visit counts and have poor Utility estimates? -- T All are possibilites but are environment dependent and result from execution only one Policy. e.g. if Policy is always go up and right you only get to lower states when the actions fail in stochastic envs Video 10.12. Active Reinforcement Learning Greed RL -- same passive algorithm, but after a few passes on estimation we recompute a new Policy which is the result of solving the MDP using our new Utility estimates if the initial Policy was flawed the greedy algorithm will tend to move towards a better Policy. Video 10.13. Greedy Agent Results see image: 10class_GreedyErr.jpg error is high for 40 trials but then drops fast. "Policy Loss" is the difference between optimal and estimate so if it was perfect it would be 0... Missed a move at 1,2 (b1) where it goes down instead of up because it was greedy and then never deviated from one good Policy... Video 10.14. Balancing Policy the question is how to get the learner out of the ok-good rut? We have to stop using a good Policy, stop Exploiting the best found so far and start Exploring to see if there's a better Policy. Exploring could lead us astray and waste time so we have to figure out the right trade-off. Need to look at the long term benefit of something that hurts in short-term. One possibility is Random Exploration, follow the best Policy most of the time but once in a while randomly take an action that isn't the best Policy. Tends to be slow. To get better we have to really understand what's going with Exploration vs Exploitation. Video 10.15. Errors in Utility Questions What are we doing... Keeping track of optimal Policy so far and updating with new ones. Keeping track of Utilities of states and updating with new ones. Keeping track of Number of visits for each states. what can go wrong? Haven't sampled enough to get good Utility estimates, the Nvisits values are too low and the U we got is just random crap. Could get bad Utility because Policy was wrong, Policy did non-optimal actions so Utility wasn't as high as it could be. Quiz: for sampling and policy error is it true or false: sample policy 1. could the error make the Utility estimates too low? T T 2. could the error make the Utility estimates too high? T F ...not sure why low samples could make U high.... but bad Policy can't make a better estimate than optimal 3. could it be improved with higher N (sample) values? T F for bad Policy having more samples will decrease variance but won't improve the mean values Video 10.16. Exploration Agents This suggests a design for an Exploration Agent that will be more proactive about Exploring the world when it's uncertain and fall back on Exploitation as it gets more certain. We can make the initial Utilities some large Reward, maybe the largest we can expect, +1 in this example: U(s) = +R in every case where the number of visits to the state is less then some theshold "e", the Exploration threshold: N(s) < e and when we've visited 'e' times then revert to the learned Utilities So we start out we're going to explore from new states until we have a good estimate of what the true Utility is we stop exploring and just go with that Utility. Video 10.17. Exploration Agent Results see image: 10class_ExpAgentErr.jpg better results, started high but after only 20 trials it's come down to a perfect Policy ( learned the 1,2 up arrow thing) but can have perfect Policy and still not have quite correct Utilities Video 10.18. Q Learning 1 see image: 10class_QLearn1.jpg once we've learned the Policy and Utilities (shown in gridbox image) if we haven't learned (or been given) the transition probabilites Model we can't apply the Policy even though we know the Utilities but with Q-leaning we don't need the Model or the Utilities we learn a direct mapping from states and actions to Utilities: Q(s,a) and once we've learned Q we can determine the optimal Policy by take the maximum Q over all possible actions Video 10.19. Q Learning 2 see image: 10class_QLearnTable.jpg Start with the Q table where each section is the value for performing that action from that state. Use same update formula but only update value for action,state pairs Video 10.20. Pacman 1 doesn't work that well for larger problems because there are too many states to update no way to relate similar state sets want to generalize between similar states just like with supervised learning generalizing Video 10.21. Pacman 2 -- function Generalization see image: 10class_Pakman2.jpg represent state by set of important features instead of positions, s = [ F1, F2, F3 ... ] distance to nearest ghost or food or whatever... then states with similar features have the same value bad if dissimilar states have same value because we used the wrong features i.e. need to represent the right things, e.g. if feature is a tunnel, does it have an outlet? learn how important each feature is using a weight "Wi" so the Q value is the sum of each feature times it's weight: Q(s,a) = sum(i) (Wi * Fi) Can update feature weight Wi values as we make Q(s,a) updates both driven by the amount of error, i.e. if off by a lot make big changes Same process as looking for Supervised learning (classification) features. Video 10.22. Conclusion We can solve known MDPs If we don't know the MDP we can estimate and solve it with a fixed Policy pi or estimate the Q values with an exploration policy.