Unit 4 Probabilistic Inference

video 4.1
 Overview and Example

    Inference -- how to answer probability questions using Bayes Networks

    Simple alarm model:

            B   E        Burglary, Earthquake
             \ /
              A          Alarm
             / \
            J   M        JohnCalls, MaryCalls

    Inputs: B and E  -- Evidence Variables -- known values
    others: A         -- Hidden Variables -- unknown value but compute with it
    Outputs: J and M -- Query Variables -- computed values

     result of query will be joint prob DISTRIBUTION:
        the posterier distribution given the evidence --
           P( Q1, Q2 ... | E1=e1, E2=e2 ... )

     another question: what is the most likely output?
        out of all possible query value combinations
         which has the highest value?
           argmax P( Q1=q1, Q2=q2 ... | E1=e1 ... )

     Bayes Nets can go both ways --
       Causal -- given evidence (B,E) what are the outputs (J,M)
       Diagnostic -- given the outputs (J,M) what is the evidence (B,E)
         or any combination thereof --
           putting arbitrary variables in the evidence and query positions

  Quiz: Mary has called to report that the alarm is going off
         and we want to know if there has been a burglary...
         What roles do each of the nodes take?
             Evidence nodes:  M
             Query nodes:     B
             Hidden nodes:    J,A,E
              


Video 4.2

 Enumeration -- goes through all the possibilities and adds them up

      (notation:  +B is B=true, ~B is B=false, to avoid confusion...)

   state the problem:
     P(+B | +J, +M)
       the probability of a burglary given that both john and mary called

    expand using conditional probability -- P(Q|E) = P(Q,E) / P(E):
     P(+B, +J, +M)  /  P(+J, +M)
     
    calculate a term by expanding all possibilities of Hidden nodes (A,E):
      sumE ( sumA P(+B,+J,+M) )

    expand each element in terms of the parents of each node
     making it the product of each element enumerated over hidden variables:
      sumE ( sumA ( P(+B) * P(E) * P(A|+B,E) * P(+J|A) * P(+M|A) )
        (so: B,E have no parents, just apriori probabilities +B and enum E
             A has parents B,E, enum over E where +B
             J,M have parent A, enum over A )

    calling the whole enchilada above f(E,A) then enumerate all values of E,A:
        f(+E,+A) + f(+E,~A) + f(~E,+A) + f(~E,~A)

    Values for each term come from conditional probability model
     see: alarmCPT.jpgo
    
  Quiz: Fill in numbers for each term where +E and +A
               P(+B) * P(E) * P(A|+B,E) * P(+J|A) * P(+M|A)
              .001 * .002 *  .95      *  .9     *   .7   =  .000001197

         look for the probs in each table where both E and A are true
          then multiply them all.
           also remember that to get a real probability
            it has to be summed over each value of E,A and
            then normalized by P(+J,+M)
         the final probability:  P(+B | +J, +M) = 0.284
          because probabilities of J,M,A are all pretty high
          but the prior P(+B) is very low

Video 4.3-4.7

 Speeding Up Enumeration

   Pulling out terms --
    starting with this:
      sumE (sumA ( P(+B) * P(E) * P(A|+B,E) * P(+J|A) * P(+M|A) )
    P(+B) is a constant so can pull it out of product:
      P(+B) * (sumE (sumA ( P(E) * P(A|+B,E) * P(+J|A) * P(+M|A) ))
    can pull P(E) it out of inner product because it doesn't depend on A:
      P(+B) * (sumE (P(E) * sumA ( P(A|+B,E) * P(+J|A) * P(+M|A) ))
    so that reduced the inner summation from 5 to 3 terms,
    but still the same number of rows to calculate

   Maximize Independence --
    a linear net can do inference in O(n) time:
      X1->X2->...XN
    a "complete" net where every node points to every other node
                                is O(2^n) time (for boolean variables)

  Quiz: with two separate nodes, are they Dependent or Independent?
              J    M
    they are DEPENDENT because having information about one
     affects what we can know about the other
    
  Quiz: adding A node, is it Dependent or Independent of J and M?
              J -> M
                 A
    it is DEPENDENT on both because having information about J,M
     makes A more likely (intuitively?) but also look at CPT's...

  Quiz: adding B node, is it Dependent or Independent of A,J,M?
              J -> M
               \  /
                A
                B
    it is DEPENDENT on A and independent of J,M GIVEN A

  Quiz: adding E node, is it Dependent or Independent of B,A,J,M?
              J -> M
               \  /
                 A
                / 
              B    E
    it is DEPENDENT on A and B,
     A because if E is true it's more likely that A is true
     B because if B is true it's more likely that A is true
       and thus E is less likely
              J -> M
               \  /
                 A
               /  \
              B -> E

 Causal Direction
   "The moral is that bayes nets are more compact when they go in the
    Causal Direction, from Causes to Evidence, as in the first diagram.



Video 4.8-4.11
 Variable Elimination -- combine parts of the network and
                          then enumerate over the smaller parts.
                          compute by "marginalizing out" some varibles.
    "If we make a good choice of the order of elimination
     then it can be much more efficient than doing the whole enumeration"

  In general Bayes net inference is NP-hard

  Need an algebra for manipulating factors,
   a factor is a multi-dimensional array of probabilities.
   Each CPT is a factor.
          
        R -> T -> L    (Rain,Traffic,LateToAppointment)
    see CPTs:  rain_lateCPT.jpg

   Joining Factors --
     combining  2 or more CPT's to make a Joint Probability Table
  
     example, joining P(R) and P(T|R):
      point-wise multiply P(R) * P(T|R) for each value combination to get --
        see: rain_lateCPT_JF.jpg

                      P<>(R,T)
                      +R  +T  (.1 * .8) = 0.08
                      +R  ~T  (.1 * .2) = 0.02
                      ~R  +T  (.9 * .1) = 0.09
                      ~R  ~T  (.1 * .9) = 0.81

    making a net like this (also note that they add to 1 still):
            RT -> L

  Marginalization -- reduces JPT by summing values for intermediate variable
                     back to simple probability:
    add probabilities where T has same value, + or ~,  (still adds to 1)

                         [new] P(T)
                   +T  (.08 + .09) = 0.17
                   ~T  (.02 + .81) = 0.83

    making a net like this (still add to 1):
            T -> L

  Do it again, with new P(T) and old P(L|T)...

   Joining Factors (see rain_lateCPT_JF2.jpg for results):

                        P<>(T,L)
                      +T  +L  (.17 * .3) = 0.051
                      +T  ~L  (.17 * .7) = 0.119
                      ~T  +L  (.83 * .1) = 0.083
                      ~T  ~L  (.83 * .9) = 0.747

    making a net like this (still adds to 1):
            T,L

  Marginalizing to:
                        [new] P(L)
                      +L  (.051 + .083) = 0.134
                      ~L  (.119 + .747) = 0.866

    making a net like this (still adds to 1):
            L


Video 4.12
 Approximate Inference by means of sampling --

  take "samples" of process in order to estimate JPT.
   for instance counting coin flips a number of times to get
    approximate results, see: samplingCoins.jpg
    can be easier than complex inference computation 
    and if we don't know the CPT's but can simulate process
     we can still do sampling, but not inference.

Video 4.13
 Sampling Example
     using Cloudy model net: cloudyCPT.jpg

                    C
                   / \
                  S   R
                   \ /
                    W

    Pick a value for C using it's P(C)
     Then pick values for S and R, using P(x|C) and the C value above
       e.g. for P(S|C) with +C, will give +S=.1 and ~S=.9 of the time
      Then pick values from P(W|S,R) based on the S,R values selected:

  Quiz: so... having selected +C, and then -S and +R,
         what rows of P(W|S,R) would we use and
         is it more likely that we have +W or ~W

           use the ~S +R values where +W=.9

           see: cloudyCPT_sample.jpg
        That makes one sample with the prior and sample values as shown
           (+C,~S,+R = +W)
         then repeat for a new sample


Video 4.14
 Approximate Inference 2
  With an infinite number of samples the sampling proceedure
   repoduces the true joint probability table -- it is "Consistent"
   Can also compute any variable's CPT.


Video 4.15
 Rejection Sampling
   What about computing a conditional probability? e.g.:
           P(W|~C)
    Do "Rejection Sampling" --
      Throw out samples that don't match the desired conditions e.g.:
        every sample that has +C if we are only interested in ~C
    This is also "Consistent"
       
  Problem with Rejection Sampling --
   if the conditions are unlikely you toss out a lot of samples...

   e.g.: for the Burglary->Alarm model B is very unlikely
     so P(B|+A) does a lot of useless sampling

  Then use Likelyhood Weighting --
     set the value of evidence variables, and then sample the rest
     this gives you a list like:
        ~B, +A
        ~B, +A
        +B, +A
        ~B, +A
     but it is "Inconsistent"
    Can be fixed by
     "assigning a probability to each sample and weighing them correctly"

            
Video 4.16-4.18
 Likelihood Weighting is Consistent (when done right....haha)
  collect samples with the constraints fixed but
   add a probalistic weight to each sample

   do samples as before,
    for constrained variables use only the selection,
    collect probabilities from constrained variables given other var values,
    when done, multiply the collected probs to get a weight for that sample

    for: P(R|+S,+W)
     select +/~C, say +C this time
     set +S
     save +S|C value in weight, ie: +S|+C = 0.1
     select +/~R, say +R this time
     set +W
     save +W|S,R value in weight, ie: +S,+R,+W = 0.99
     when done multiply the weights to get prob for sample
       ie:  (.1*.9) = 0.09 +C,+S,+R,+W (sample values)

    see: likeyWeight.jpg

  Doesn't always work well --
    say we want P(C|+S.+R)
      we fix values of +S and +R, which almost always produces a "good" W
      but has no effect on C which will produce values at random
      often the C values will "not go well with the evidence"
      they will have a low associated probability...the weights are low


Video 4.19
 Gibbs Sampling -- Josia Gibbs
   Uses all the evidence using Markov Chain Monte Carlo: "MCMC"

   sample one variable at a time, conditioned on all the others
   randomly initialize values for a set of variables, say:
               +C +S ~R ~W
   as the first sample
   then on each iteration, change one non-evidence variable's value
    and resample it based on all the other fixed variable values:
            +C ~S ~R ~W
   giving another sample
   lather, rinse, repeat

 "we end up walking around in the space of assignments of variables randomly"
 "in rejection and likelyhood sampling each sample
   was independent of the others"
 "with MCMC the samples are dependent on each other,
   and in fact adjacent samples are very similar"
 "however MCMC is still Consistent"


Video 4.20
 Monty Hall Problemo

   Three doors, one with good prize
   Chose a door, say #1
   Host opens a door that doesn't contain a prize, say #3
   What is prob of winning if you switch your choice?
      .33 if stick, .66 if switch

   Because...here we go....
     First choice original probability was 1/3...which doesn't change...
     So the second door MUST be 2/3!
     Why doesn't the original 1/3 prob hold for door #2?
      Because we learned that it wasn't the door the Host opened,
       and #1 was never an option for the Host...


Video 4.21
 Monty Hall Letter -- Monty didn't understand it either...