Unit 11  Hidden Markov Models and Filters Nov 15, 2011

Video 11.1. Introduction
 really fun class


Video 11.2. Hidden Markov Models -- HMM
 used to anaylize or predict time series
  in robotics, medical, finanacial, speech, language


Video 11.3. Bayes Network of Hidden Markov Models
 a sequence of states -- Sn -- that evolves over time
 where each state depends _only_ on the previous state
 called a Markov Chain

 each state emits a "measurement" -- Zn

 with a Hidden Markov Chain/Model you don't see the states
  but only see the measurements

 form the basis of Kalman and Particle filters

Video 11.4. Examples
Robot Localization Problem
 robot has rangefinders and a map of the environment
 to infer where it is from the range sensor measurements
 this is filtering with an underlying HMM

 underground mapping robot also builds the map as it goes
 generates "hypotheses" about it's position based on noisy data
 each path is a "particle"
 when the path reconnects someplace one hypothesis is selected as best

Speech recognition uses HMM to analyze sounds into text


Video 11.5 Markov Chain question & answer
		see image: 11class_MarkovChainQuiz1.jpg

 not hidden...

 Quiz:
  for Rainy and Sunny weather day to day (yesterday -> today) :
   Rainy -> Rainy  = 0.6  =  P(R1|R0)
   Rainy -> Sunny  = 0.4  =  P(S1|R0)
   Sunny -> Sunny  = 0.8  =  P(S1|S0)
   Sunny -> Rainy  = 0.2  =  P(R1|S0)

   start state:  P(R0) = 1, P(S0) = 0
		prob rain in next days
		  P(R1) = .6		-- first R->R transition  (note: P(S1) = .4 )
		  P(R2) = .44		-- total probability:     (note: P(S2) = .56
		  						P(R2|R1)*P(R1) + P(R2|S1)*P(S1)
								 0.6    * 0.6  +   0.2   * 0.4
		  P(R3) = .376		-- total probability:
		  						P(R3|R2)*P(R2) + P(R3|S2)*P(S2)
								  0.6   * 0.44 +   0.2   * 0.56


Video 11.6 Markov Chain 2 question & answer
		see image: 11class_MarkovChainQuiz2.jpg

 not hidden...

 Quiz:
  ...for A to B and back again...
   A -> A  = 0.5
   A -> B  = 0.5
   B -> A  = 1.0

   start state:  P(A) = 1, P(B) = 0
		prob A in next steps
		  P(A1) = .5		-- first R->R transition  (note: P(B1) = .5 )
		  P(A2) = .75		-- total probability:     (note: P(B2) = .25)
		  						P(A2|A1)*P(A1) + P(A2|B1)*P(B1)
								 0.5    * 0.5  +   1.0   * 0.5
		  P(A3) = .625		-- total probability:
		  						P(A3|A2)*P(A2) + P(A3|B2)*P(B2)
								  0.5   * 0.75 +   1.0   * 0.25


Video 11.7 Stationary Distribution
		see image: 11class_StationaryDist.jpg

 What happens if you wait a long time?
  An infinite number of time steps into the future:

         P(Ainf) = lim(t->inf) P(At)

 A Markov Chain settles to a Stationary Distribution
  or sometimes a Limit Cycle if the transitions are deterministic
  (which we don't care about....)

 The probability of the next step is the same as the previous step:
 		P(At) = P(At-1)
  or the total probability as before:
  		P(At | At-1) * P(At-1) + P(At | Bt-1) * P(Bt-1)

 Replacing  X =  P(At-1)  and believing that P(Bt-1) = 1 - P(At-1)
  solving for X and using the MC P() from the last Quiz:
   		X = P(At | At-1) * X + P(At | Bt-1) * ( 1 - X )
		X =   0.5        * X +      1.0     * ( 1 - X )  = 0.5*X + 1 - 1*X 
		X =  -0.5        * X +  1  = 2/3

 So, in the limit, the process is in the A state 2/3 of the time.


Video 11.8 Stationary Distribution Question-

 Back to the  Rain Sun Markov Chain...

 Quiz: what's the stationary distribution of Rain? 
    
	P(Rinf) = 

	X = P(Rt | Rt-1) * X + P(Rt | St-1) * ( 1 - X )
	           .6    * X +        .2    * ( 1 - X )
	X =  .6X  + .2 - .2X = .4X + .2
	X - .4X = .2
	.6X = .2
	X = .2/.6 =  1/3

The stationary distribution does not depend on the intitial state
 "pretty much all" Markov Chains have this property and are "Ergodic"
 the speed with which the initial state "gets lost" is "The Mixing Speed"


Video 11.9 Finding Transition Probabilities
		see image: 11class_TransitionProb1.jpg

 You can learn the transition probabilities from actual data
  using Maximum Likelihood.

 Using the Rainy Sunny model
  First without any smoothing with the sequence:
  	
	 R S S S R S R

  The probability of Rainy on day one is  P(R0) = 1
  Then the probability of Rainy to Rainy is P(R | R) = 0
   (because this is a short sequence no two Rainy days in a row)
   so the prob of Rainy to Sunny is P(S | R) = 1
  There are 4 Sunny->X transitions with two each S->S and S->R
   so the probability of Sunny to Sunny P(S | S) = 2/4
   and the probability of Sunny to Rainy P(R | S) = 2/4


Video 11.10 Transition Probabilities Question
		see image: 11class_TransitionProb2.jpg

 Quiz: with the same model and this sequence:
			SSSSSRSSSRR   (11 samples, 10 transitions)

		P(R0)  = 0      (first sample is Sunny...)
		P(S|S) = 6/8    (8 total S->X transisitons, 6 S->S, 2 S->R)
		P(R|S) = 2/8

		P(S|R) = 1/2    (2 total R->X transisitons, 1 R->S, 1 R->R)
		P(R|R) = 1/2

   note that these both need to add up to 1:
   				P(S|S) + P(R|S)
				P(S|R) + P(R|R)


Video 11.11 Laplacian Smoothing Question
		see image: 11class_TPlaplacian.jpg

 ...one of the oddities of ML is overfitting...
 so do Laplacian Smoothing with K=1

 Quiz: with the same model and this sequence:
			RSSSS   (5 samples, 4 transitions)

	(note: add 1 to everything before you start...
	  count the number of transitions
	   add 1 to the number and divide by the total plus 2
	    because there are 2 possibilities R,S)

		P(R0)  = 2/3    (first sample is Rainy...)
							( 1R+1 / total=1 + 1R + 1S = 2R / 3total )
		P(S|S) = 4/5    (3 S->X transisitons, 3 S->S transitions)
							( 3S+1 / total=3 + 1R + 1S = 4S / 5total )
		P(R|S) = 1/5
							( 1S+1 / total=3 + 1R + 1S = 1R / 5total )
		P(S|R) = 2/3    (1 R->X transisitons, 1 R->S transition)
							( 1S+1 / total=1 + 1R + 1S = 2S / 3total )
		P(R|R) = 1/3
							( 1R+1 / total=1 + 1R + 1S = 1R / 3total )


Video 11.12 HMM Happy Grumpy Problem
		see image: 11class_HMMProb1.jpg

 for Hidden Markov Models, can't see the state

 for the Rainy Sunny model (yesterday -> today) :
   Rainy -> Rainy  = 0.6  =  P(R1|R0)
   Rainy -> Sunny  = 0.4  =  P(S1|R0)
   Sunny -> Sunny  = 0.8  =  P(S1|S0)
   Sunny -> Rainy  = 0.2  =  P(R1|S0)

 don't know if it's rainy or sunny
  but know the transition probabilities 
  and know if ST is Happy or Grumpy with probabilities (shown on image)

  starting with equal day 0 probabilities:  P(R0) = P(S0) = 1/2

  given that he's happy on day 1
   the prob of Rainy given Happy on day 1 is (by Bayes):
  	P(R1 | H1) = P(H1 | R1) * P(R1)  /  P(H1)
	                   0.4  *  0.4   / 0.7  ~= 0.229

  get the prob of Rainy on day 1 (total probability of R,S transitions):
   P(R1) = P(R1|R0)*P(R0) + P(R1|S0)*P(S0)
              0.6  * 0.5  +   0.2   * 0.5  = 0.4
  making the prob of Sunny on day 1:
   P(S1) =                                   0.6

  the prob of being happy on a rainy day:
   P(H1 | R1) = 0.4  (due to the observation R->H = 0.4)
   
  and P(H1) the total prob of Happy on day 1 is
   the above plus the prob of being Happy on a Sunny day:
   P(H1|R1)*P(R1) = 0.4 * 0.4 = 0.16
    (due to the observation S->H = 0.9 and P(S1) = 1 - P(R1) = 0.4
   P(H1|S1)*P(S1) = 0.9 * 0.6 = 0.54
   							+=  0.70


Video 11.13 Happy Grumpy Question
		see image: 11class_HMMProb2.jpg

 Quiz:
 same model and probabilities except it's Rainy on day 0:  P(R0) = 1
 observe Happy on day 1
 what is:
 	P(R1 | H1) = 0.4

   the prob of Rainy given Happy on day 1 is (by Bayes):
  	P(R1 | H1) = P(H1 | R1) * P(R1)  /  P(H1)
	                   0.4  *  0.6   / 0.6  = 0.4

  get the prob of Rainy on day 1 (total probability of R,S transitions):
   P(R1) = P(R1|R0)*P(R0) + P(R1|S0)*P(S0)
              0.6  * 1  +     0.2   * 0.0  = 0.6
  making the prob of Sunny on day 1:
   P(S1) =                                   0.4

  the prob of being happy on a rainy day:
   P(H1 | R1) = 0.4  (due to the observation R->H = 0.4)
   
  and P(H1) the total prob of Happy on day 1 is
   the above plus the prob of being Happy on a Sunny day:
   P(H1|R1)*P(R1) = 0.4 * 0.6 = 0.24
    (due to the observation S->H = 0.9 and P(S1) = 1 - P(R1) = 0.4
   P(H1|S1)*P(S1) = 0.9 * 0.4 = 0.36
   							+=  0.60


Video 11.14 Wow You Understand

 Use HMM for prediction and state estimation
    in prediction you predict
	 the next state and possibly the measurement
   in state estimation you predict
    the probability of internal state given measurements
   

Video 11.15 HMMs and Robot Localization

  not clear to me how the decay as it moves works...


Video 11.16 HMM Equations
		see image: 11class_HMMParameters.jpg

 a HMM chain with states Xn and measurements Zn

 					  X1->X2->X3->X4
                      v   v   v   v
					  Z1  Z2  Z3  Z4

 given a certain state, X2,
  the past (X1) future (X3,X4) and measurement Z2
  are all Conditionally Independent based on the X2 state

 find the prob that you're in a certain state given a measurement, by bayes:
  P(X1 | Z1) = P(Z1 | X1) * P(X1)  /  P(Z1)
 but since Z1 is not dependent on the target value X1
 we can drop it and use a proportion instead:
   P(Z1 | X1) * P(X1)
 which is the basic measurement update for an HMM
 but eventually needs to be normalized by P(Z1)...

 the prediction equation "predicts" the distribution of X2
  given that we know X1
             X1->X2
 we use Total Probability of
  being in previous states, X1,
  and the prob of going from those states to X2
  P(X2) = sum(X1) P(X1) * P(X2 | X1)
 
 The parameters of a HMM are then --
  The next state distribution: P(X2 | X1)
  The measurement distribution: P(Z1 | X1)
  The initial state distribution: P(X0)


Video 11.17 HMM Localization Example
		see image: 11class_HMMExample.jpg

 measurements are multiplications of
  current state with given prob distribution
  and update the current state

 motions become essentially convolutions with added noise


Video 11.18 Particle Filters
		see images: 11class_ParticleVid1-3.jpg

 one of the most successful AI algorithm for robot localization

 represent the space by a collection of points or Particles
  each is a hypothesis about the robot location: X,Y,Heading
  and the whole set is the belief space
 it approximates the posterior prob with many "guesses"
  and the density of the guesses represents the
  posterior prob of being in a certain location

 The more consistant the Particle is with the measurement the more likely it is

 Uses 10 lines of C code...somehow...


Video 11.19 Localization and Particle Filters

 starts with a bunch of particles
 make a measurement and then
 copy them multiplied by the Importance Weight (measurement probability)
 move and create a new particle set that is denser where the weights were high
 this is Resampling -- 
  pick particles in proportion to their Importance Weight
  add the motion and a little noise to make a new particle
  makes a Forward Prediction Set
 then make another measurement and multiply to get a new ImpWeight set

 Particle Filters work in continuous spaces and
  use computational resources in proportion to how likely things are


Video 11.20 Particle Filter Algorithm

 input is a set S of particles with Importance Weights and
  a control and measurement vector v,z :
     (S, v,z)
 
 you construct a new particle set S' and a running counter "etta"
  for all particles in the input set
   sample (with replacement) in proportion to their Importance Weight
    for that particle "sj" make a successor state x'
     according to the state transition probability using the controls
       x' ~ P(x' | v, sj )
     with an ImpWeight which is the measurement probabilty for that particle
       w' = P(z | x')
     these are the new Particle and (non-normalized) ImpWeight
     add them into the new state set S'
       S' = S' (union) { <x',w'> }
	 also add w' to etta for later normalization

  at the very end we have to normalize all the new weights using etta
    w' = w' / etta


Video 11.21 Particle Filters Pros and Cons

 easy to implement
 work well in many applications, e.g. self-driving cars
 computationally efficient
 can deal with non-monotonic distributions -- with many peaks
  (many other filters can't deal...)
 
 number of particles grows exponentially with the dimension of the space
  although some extentions to algorithm can help...

 problems with degenerate conditions, like small number of particles

 need noise in measurement and control, to mix things a little


Video 11.22 Conclusion 

 particle filters are used in anything involving
  time, sensors, and uncertainty