Unit 11 Hidden Markov Models and Filters Nov 15, 2011 Video 11.1. Introduction really fun class Video 11.2. Hidden Markov Models -- HMM used to anaylize or predict time series in robotics, medical, finanacial, speech, language Video 11.3. Bayes Network of Hidden Markov Models a sequence of states -- Sn -- that evolves over time where each state depends _only_ on the previous state called a Markov Chain each state emits a "measurement" -- Zn with a Hidden Markov Chain/Model you don't see the states but only see the measurements form the basis of Kalman and Particle filters Video 11.4. Examples Robot Localization Problem robot has rangefinders and a map of the environment to infer where it is from the range sensor measurements this is filtering with an underlying HMM underground mapping robot also builds the map as it goes generates "hypotheses" about it's position based on noisy data each path is a "particle" when the path reconnects someplace one hypothesis is selected as best Speech recognition uses HMM to analyze sounds into text Video 11.5 Markov Chain question & answer see image: 11class_MarkovChainQuiz1.jpg not hidden... Quiz: for Rainy and Sunny weather day to day (yesterday -> today) : Rainy -> Rainy = 0.6 = P(R1|R0) Rainy -> Sunny = 0.4 = P(S1|R0) Sunny -> Sunny = 0.8 = P(S1|S0) Sunny -> Rainy = 0.2 = P(R1|S0) start state: P(R0) = 1, P(S0) = 0 prob rain in next days P(R1) = .6 -- first R->R transition (note: P(S1) = .4 ) P(R2) = .44 -- total probability: (note: P(S2) = .56 P(R2|R1)*P(R1) + P(R2|S1)*P(S1) 0.6 * 0.6 + 0.2 * 0.4 P(R3) = .376 -- total probability: P(R3|R2)*P(R2) + P(R3|S2)*P(S2) 0.6 * 0.44 + 0.2 * 0.56 Video 11.6 Markov Chain 2 question & answer see image: 11class_MarkovChainQuiz2.jpg not hidden... Quiz: ...for A to B and back again... A -> A = 0.5 A -> B = 0.5 B -> A = 1.0 start state: P(A) = 1, P(B) = 0 prob A in next steps P(A1) = .5 -- first R->R transition (note: P(B1) = .5 ) P(A2) = .75 -- total probability: (note: P(B2) = .25) P(A2|A1)*P(A1) + P(A2|B1)*P(B1) 0.5 * 0.5 + 1.0 * 0.5 P(A3) = .625 -- total probability: P(A3|A2)*P(A2) + P(A3|B2)*P(B2) 0.5 * 0.75 + 1.0 * 0.25 Video 11.7 Stationary Distribution see image: 11class_StationaryDist.jpg What happens if you wait a long time? An infinite number of time steps into the future: P(Ainf) = lim(t->inf) P(At) A Markov Chain settles to a Stationary Distribution or sometimes a Limit Cycle if the transitions are deterministic (which we don't care about....) The probability of the next step is the same as the previous step: P(At) = P(At-1) or the total probability as before: P(At | At-1) * P(At-1) + P(At | Bt-1) * P(Bt-1) Replacing X = P(At-1) and believing that P(Bt-1) = 1 - P(At-1) solving for X and using the MC P() from the last Quiz: X = P(At | At-1) * X + P(At | Bt-1) * ( 1 - X ) X = 0.5 * X + 1.0 * ( 1 - X ) = 0.5*X + 1 - 1*X X = -0.5 * X + 1 = 2/3 So, in the limit, the process is in the A state 2/3 of the time. Video 11.8 Stationary Distribution Question- Back to the Rain Sun Markov Chain... Quiz: what's the stationary distribution of Rain? P(Rinf) = X = P(Rt | Rt-1) * X + P(Rt | St-1) * ( 1 - X ) .6 * X + .2 * ( 1 - X ) X = .6X + .2 - .2X = .4X + .2 X - .4X = .2 .6X = .2 X = .2/.6 = 1/3 The stationary distribution does not depend on the intitial state "pretty much all" Markov Chains have this property and are "Ergodic" the speed with which the initial state "gets lost" is "The Mixing Speed" Video 11.9 Finding Transition Probabilities see image: 11class_TransitionProb1.jpg You can learn the transition probabilities from actual data using Maximum Likelihood. Using the Rainy Sunny model First without any smoothing with the sequence: R S S S R S R The probability of Rainy on day one is P(R0) = 1 Then the probability of Rainy to Rainy is P(R | R) = 0 (because this is a short sequence no two Rainy days in a row) so the prob of Rainy to Sunny is P(S | R) = 1 There are 4 Sunny->X transitions with two each S->S and S->R so the probability of Sunny to Sunny P(S | S) = 2/4 and the probability of Sunny to Rainy P(R | S) = 2/4 Video 11.10 Transition Probabilities Question see image: 11class_TransitionProb2.jpg Quiz: with the same model and this sequence: SSSSSRSSSRR (11 samples, 10 transitions) P(R0) = 0 (first sample is Sunny...) P(S|S) = 6/8 (8 total S->X transisitons, 6 S->S, 2 S->R) P(R|S) = 2/8 P(S|R) = 1/2 (2 total R->X transisitons, 1 R->S, 1 R->R) P(R|R) = 1/2 note that these both need to add up to 1: P(S|S) + P(R|S) P(S|R) + P(R|R) Video 11.11 Laplacian Smoothing Question see image: 11class_TPlaplacian.jpg ...one of the oddities of ML is overfitting... so do Laplacian Smoothing with K=1 Quiz: with the same model and this sequence: RSSSS (5 samples, 4 transitions) (note: add 1 to everything before you start... count the number of transitions add 1 to the number and divide by the total plus 2 because there are 2 possibilities R,S) P(R0) = 2/3 (first sample is Rainy...) ( 1R+1 / total=1 + 1R + 1S = 2R / 3total ) P(S|S) = 4/5 (3 S->X transisitons, 3 S->S transitions) ( 3S+1 / total=3 + 1R + 1S = 4S / 5total ) P(R|S) = 1/5 ( 1S+1 / total=3 + 1R + 1S = 1R / 5total ) P(S|R) = 2/3 (1 R->X transisitons, 1 R->S transition) ( 1S+1 / total=1 + 1R + 1S = 2S / 3total ) P(R|R) = 1/3 ( 1R+1 / total=1 + 1R + 1S = 1R / 3total ) Video 11.12 HMM Happy Grumpy Problem see image: 11class_HMMProb1.jpg for Hidden Markov Models, can't see the state for the Rainy Sunny model (yesterday -> today) : Rainy -> Rainy = 0.6 = P(R1|R0) Rainy -> Sunny = 0.4 = P(S1|R0) Sunny -> Sunny = 0.8 = P(S1|S0) Sunny -> Rainy = 0.2 = P(R1|S0) don't know if it's rainy or sunny but know the transition probabilities and know if ST is Happy or Grumpy with probabilities (shown on image) starting with equal day 0 probabilities: P(R0) = P(S0) = 1/2 given that he's happy on day 1 the prob of Rainy given Happy on day 1 is (by Bayes): P(R1 | H1) = P(H1 | R1) * P(R1) / P(H1) 0.4 * 0.4 / 0.7 ~= 0.229 get the prob of Rainy on day 1 (total probability of R,S transitions): P(R1) = P(R1|R0)*P(R0) + P(R1|S0)*P(S0) 0.6 * 0.5 + 0.2 * 0.5 = 0.4 making the prob of Sunny on day 1: P(S1) = 0.6 the prob of being happy on a rainy day: P(H1 | R1) = 0.4 (due to the observation R->H = 0.4) and P(H1) the total prob of Happy on day 1 is the above plus the prob of being Happy on a Sunny day: P(H1|R1)*P(R1) = 0.4 * 0.4 = 0.16 (due to the observation S->H = 0.9 and P(S1) = 1 - P(R1) = 0.4 P(H1|S1)*P(S1) = 0.9 * 0.6 = 0.54 += 0.70 Video 11.13 Happy Grumpy Question see image: 11class_HMMProb2.jpg Quiz: same model and probabilities except it's Rainy on day 0: P(R0) = 1 observe Happy on day 1 what is: P(R1 | H1) = 0.4 the prob of Rainy given Happy on day 1 is (by Bayes): P(R1 | H1) = P(H1 | R1) * P(R1) / P(H1) 0.4 * 0.6 / 0.6 = 0.4 get the prob of Rainy on day 1 (total probability of R,S transitions): P(R1) = P(R1|R0)*P(R0) + P(R1|S0)*P(S0) 0.6 * 1 + 0.2 * 0.0 = 0.6 making the prob of Sunny on day 1: P(S1) = 0.4 the prob of being happy on a rainy day: P(H1 | R1) = 0.4 (due to the observation R->H = 0.4) and P(H1) the total prob of Happy on day 1 is the above plus the prob of being Happy on a Sunny day: P(H1|R1)*P(R1) = 0.4 * 0.6 = 0.24 (due to the observation S->H = 0.9 and P(S1) = 1 - P(R1) = 0.4 P(H1|S1)*P(S1) = 0.9 * 0.4 = 0.36 += 0.60 Video 11.14 Wow You Understand Use HMM for prediction and state estimation in prediction you predict the next state and possibly the measurement in state estimation you predict the probability of internal state given measurements Video 11.15 HMMs and Robot Localization not clear to me how the decay as it moves works... Video 11.16 HMM Equations see image: 11class_HMMParameters.jpg a HMM chain with states Xn and measurements Zn X1->X2->X3->X4 v v v v Z1 Z2 Z3 Z4 given a certain state, X2, the past (X1) future (X3,X4) and measurement Z2 are all Conditionally Independent based on the X2 state find the prob that you're in a certain state given a measurement, by bayes: P(X1 | Z1) = P(Z1 | X1) * P(X1) / P(Z1) but since Z1 is not dependent on the target value X1 we can drop it and use a proportion instead: P(Z1 | X1) * P(X1) which is the basic measurement update for an HMM but eventually needs to be normalized by P(Z1)... the prediction equation "predicts" the distribution of X2 given that we know X1 X1->X2 we use Total Probability of being in previous states, X1, and the prob of going from those states to X2 P(X2) = sum(X1) P(X1) * P(X2 | X1) The parameters of a HMM are then -- The next state distribution: P(X2 | X1) The measurement distribution: P(Z1 | X1) The initial state distribution: P(X0) Video 11.17 HMM Localization Example see image: 11class_HMMExample.jpg measurements are multiplications of current state with given prob distribution and update the current state motions become essentially convolutions with added noise Video 11.18 Particle Filters see images: 11class_ParticleVid1-3.jpg one of the most successful AI algorithm for robot localization represent the space by a collection of points or Particles each is a hypothesis about the robot location: X,Y,Heading and the whole set is the belief space it approximates the posterior prob with many "guesses" and the density of the guesses represents the posterior prob of being in a certain location The more consistant the Particle is with the measurement the more likely it is Uses 10 lines of C code...somehow... Video 11.19 Localization and Particle Filters starts with a bunch of particles make a measurement and then copy them multiplied by the Importance Weight (measurement probability) move and create a new particle set that is denser where the weights were high this is Resampling -- pick particles in proportion to their Importance Weight add the motion and a little noise to make a new particle makes a Forward Prediction Set then make another measurement and multiply to get a new ImpWeight set Particle Filters work in continuous spaces and use computational resources in proportion to how likely things are Video 11.20 Particle Filter Algorithm input is a set S of particles with Importance Weights and a control and measurement vector v,z : (S, v,z) you construct a new particle set S' and a running counter "etta" for all particles in the input set sample (with replacement) in proportion to their Importance Weight for that particle "sj" make a successor state x' according to the state transition probability using the controls x' ~ P(x' | v, sj ) with an ImpWeight which is the measurement probabilty for that particle w' = P(z | x') these are the new Particle and (non-normalized) ImpWeight add them into the new state set S' S' = S' (union) { } also add w' to etta for later normalization at the very end we have to normalize all the new weights using etta w' = w' / etta Video 11.21 Particle Filters Pros and Cons easy to implement work well in many applications, e.g. self-driving cars computationally efficient can deal with non-monotonic distributions -- with many peaks (many other filters can't deal...) number of particles grows exponentially with the dimension of the space although some extentions to algorithm can help... problems with degenerate conditions, like small number of particles need noise in measurement and control, to mix things a little Video 11.22 Conclusion particle filters are used in anything involving time, sensors, and uncertainty