Unit 3 -- Probability Oct 18, 2011

Video 3.1

Bayes Network
 Compact representation of a distribution over a large Joint Distribution.
 Composed of nodes, which are un/known events, i.e.: random variables.
 Connected by directed "arcs" (links) indicating that a parent node
        has a probabilistic influence over a child node.
 From the net, one can make observations
        (e.g.: car won't start and lights don't work)
  and compute the probability of particular hypothesies (causes).

   see image: bayesnet_intro.jpg


 Example:
   observation: car won't start
   hypothesies: bad battery, no oil, no gas, ...
   can make measurements: battery meter, gas gauge, dipstick
   some measurements may be affected by common cause:
        bad battery affects battery meter and gas gauge but not dipstick

 Bayes Networks used for Diagnostics, Prediction, Machine Learning
   in Finance, Google, Robotics
   Are components of Particle Filters, Hidden Markov Models, MDP, POMDP
    Kalman filters, ...

 Outline for the rest of unit:
   1. Will use Discrete binary events
   2. Probability review
   3. Simple Bayes Networks
   4. Conditional Independence
   5. General Bayes Networks
   6. D-seperation
   7. Parameter counts
   8. Next unit: Probabilistic Inference


Videos 3.2-3.7

Probabilities for coin flip
 results are heads or tails, fair probability:  P(H) = 0.5, P(T) = 0.5

 probabilities add up to 1.0, if P(H) = 0.5, then P(T) = 1 - 0.5

 independence: X independent of Y :  P(X) * P(Y) = P(X,Y)
   the probability of the marginals -- P(X) times P(Y)
    equals the joint probability -- P(X,Y)

 so for multiple independent trials, probabilities multiply:
   P( H, H, H ) = 1/2 * 1/2 * 1/2 = 1/8
      (.125 chance of getting three heads in a row)
    note: this is combinatorics --
      three things that can be in 2 states = 2^3 possibilities

 flip four times:
   Xi is the result of ith flip
    where Xi={H,T} and Pi(H)=0.5 for any flip
     (where the result of the ith flip can be H or T and
      the probability of getting heads is 0.5 for any flip)

   P( X1 = X2 = X3 = X4 )? -- What is the probability that all flips
                                 will give the SAME result: all H or all T?
    two ways to get it: HHHH, TTTT each with P=1/16
       so: P= 1/16 + 1/16 = 1/8 or .125

   P( X1,X2,X3,X4 ) >= 3H? -- What is the probability in four flips
                                 that we will get at least 3 H?
    five ways to get it each with P=1/16 so: P= 5/16 or .3125
       ( HHHH, HHHT, HHTH, HTHH, THHH )

 note: for combinations, see the Khan unit:
http://www.khanacademy.org/video/exactly-three-heads-in-five-flips?playlist=Probability


Unit 3.3 Video 3.8

  Dependence
    two flips of different coins where
     the first is fair and the result:
      Heads decides that we flip a 90% Heads coin
      Tails decides that we flip a 80% Tails coin
     what is the probability that the second flip comes up Heads?

         P(X1=H) = 1/2
          if   X1==H then P(X2=H | X1=H) = 0.9
          else X1==T then P(X2=T | X1=T) = 0.8
    
    where the last equation reads:
      The probability that flip-2 is Tails,
        GIVEN that flip-1 was Tails, equals 0.8

     the answer is 0.55  ... because:

       P(X2=H) = P(X2=H | X1=H) * P(X1=H) + [prob that X1=H and X2=H]
                 P(X2=H | X1=T) * P(X1=T)   [plus prob that X1=T and X2=H]

       The probability that X2 will be heads
         equals
          the probability that X2 is heads given that X1 was heads
           plus
          the probability that X2 is heads given that X1 was tails

        note that the probability that the 80% Tails coin
          comes up heads is 20% i.e.: 1 - P(T)

        so the numbers are:
           0.9 * 0.5 +        (90% coin used 50% of the time)
           0.2 * 0.5        (20% coin used 50% of the time)
            = 0.45 + 0.1 = 0.55
          
 Lessons

   Total probability:
      P(Y) = {sum over i of} P(Y | X=i) * P(X=i)
      the total probability of Y
       equals
        the sum of 
        (the probability of Y given X=i) times (the probability of X=i)
        for all values of i  

   Negation of probability:
      P(~X | Y) = 1 - P(X | Y)
      the probability of NOT X given Y
       equals
        one minus the probability of X given Y

   Probability of X given NOT Y?
     No, you can't negate the variable you are conditioning on
      P(X | ~Y) ?= 1 - P(X | Y)  -- NO!!!


Video 3.10-3.12

 Weather,
  notes: D1 means Day-1
         from introduction, assuming only two states {sunny,rainy}

  quiz 1:
   P(D1=sunny) = 0.9
   P(D2=sunny | D1=sunny) = 0.8
   P(D2=rainy | D1=sunny) = what?

        from Negation:  P(~X | Y) = 1 - P(X | Y)
       P(X|Y) = P(D2=sunny | D1=sunny) = 0.8
       rainy = ~sunny
       P(D2=rainy | D1=sunny) = 1 - 0.8 = 0.2

  quiz2:
   P(D2=sunny | D1=rainy) = 0.6
   P(D2=rainy | D1=rainy) = what?
    
        again from Negation:  P(~X | Y) = 1 - P(X | Y)
       P(X|Y) = P(D2=sunny | D1=rainy) = 0.6
       rainy = ~sunny
       P(D2=rainy | D1=rainy) = 1 - 0.6 = 0.4

  quiz3:
   P(D2=sunny) = what?

    we know:
     a. P(D1=sunny) = 0.9
     b. P(D1=rainy) = 1 - 0.9 = 0.1
     c. P(D2=sunny | D1=sunny) = 0.8
     d. P(D2=sunny | D1=rainy) = 0.6
      so:
       P(D2=sunny) = c * a + d * b =
        P(D2=s | D1=s) * P(D1=s) + P(D2=s | D1=r) * P(D1=r) =
              0.8      *   0.9    +      0.6      *   0.1   =
                           0.72   +   0.06 = 0.78

       the probability that D2=sunny is
        (the prob that D2=s given that D1=s) times (the prob that D1=s)
         plus
        (the prob that D2=s given that D1=r) times (the prob that D1=r)


   Then using same dynamics, so replace D1,D2 with D3,D2
   P(D3=sunny) = what?

    we know:
     a. P(D2=sunny) = 0.78
     b. P(D2=rainy) = 1 - 0.78 = 0.22
     c. P(D3=sunny | D2=sunny) = 0.8 (note: these stay the same!!!)
     d. P(D3=sunny | D2=rainy) = 0.6 ( """  from day to day!!!    )
    so:
       P(D3=sunny) = c * a + d * b =
        P(D3=s | D2=s) * P(D2=s) + P(D3=s | D2=r) * P(D2=r) =
            0.8      *    0.78   +      0.6       *  0.22   =
                          0.624  +  0.132 = 0.756

Videos 3.13-3.16

 Cancer

  Probability of having (or not) this kind of cancer --
    P( C )  = 0.01
    P( ~C ) = 0.99

  Probability of getting positive or negative test result for this cancer --
    P( + | C )  = 0.9
    P( - | C )  = 0.1
  
  Probability of the test incorrectly being positive (false positive) --
    P( + | ~C ) = 0.2
  and inversly, correctly being negative (true negatives) --
    P( - | ~C ) = 0.8

  To start with, what are the "joint probabilities") of:

    1. positive test and having cancer -- true positive
         P( +, C ) =  0.009
          explain:
            P( + | C ) * P( C ) = 0.9 * 0.01 = 0.009
            [prob of + test given you have cancer times prob of having cancer]

    2. negative test and having cancer -- false negative
         P( -, C ) =  0.001
          explain:
            P( - | C ) * P( C ) = 0.1 * 0.01 = 0.001
            [prob of - test given you have cancer times prob of having cancer]

    3. positive test and NOT having cancer -- false positive
         P( +, ~C ) = 0.198
          explain:
            P( + | ~C ) * P( ~C ) = 0.2 * 0.99 = 0.198
            [prob of + test given you don't have cancer times prob of not cancer]

    4. negative test and NOT having cancer -- true negative
         P( -, ~C ) = 0.792
          explain:
            P( - | ~C ) * P( ~C ) = 0.8 * 0.99 = 0.792
            [prob of - test given you don't have cancer times prob of not cancer]
   
  So, what's the probability that you have the cancer if you get a postive test:
    P( C | + ) = 0.043
          explain:
            there are two ways to get a positive test:
             true positives:  P( +, C ) =  0.009
             false positives: P( +, ~C ) = 0.198
            take the ratio of true postive to total positives:
              0.009 / (0.009 + 0.198) ~= 0.043
			note that the total .207 is P(+)

    interesting point --
      because the PRIOR probability of having cancer is so small (0.01)
      the chances of having false positive test are _much_ higher (0.198)
      than the chances of having a true positive test (.009)
      so the positive test only slightly raises the POSTERIOR probability (0.043)


Unit 3.7 Video 3.17

  Bayes Rule -- Rev Thomas Bayes 18th century
    
    --->>>    P(A|B) = P(B|A) * P(A) / P(B)    <<<---

            P(A|B) -- posterior
            P(B|A) -- likelihood
            P(A)   -- prior
            P(B)   -- marginal likelihood

            A is the cause -- cancer,
            B is some evidence -- a test result

            P(A|B) "the posterior" is the "diagnostic direction" --
               we want to know how likely the cause is, given the evidence
             equals
                P(B|A) "the likelihood"  is the "causal direction" --
                   how likely is the evidence, given the cause?
                 times
                P(A) "the prior" --
                   how likely is the cause?
                 divided by
                P(B) "the marginal"
                   how likely is the evidence?
                    note: for P(B) see "total probability" above:
                      P(B) = {sum over a of} P(B | A=a) * P(A=a)


   in the cancer case above:
     P( C | + ) =  P( + | C ) * P( C ) / P( + )

      the probability of having cancer given a positive test
        equals
         the probability of a positive test given you have cancer
          times
         the probability of having cancer
          divided by
         the probability of a positive test 

      to repeat the numbers from above:
        a. P( C )      = 0.01    -- prior
        b. P( ~C )     = 0.99    -- negative of prior
        c. P( + | C )  = 0.9    -- likelihood
        d. P( + | ~C ) = 0.2    -- negative likelihood
       and:
        e. P(+) =               -- "marginal likelihood" or "total probability"
                  c * a + d * b =   P(+|C) * P(C)  +  P(+|~C) * P(~C)
             0.9 * 0.01 + 0.2 * 0.99 =
                   .009 + .198  = 0.207
       so:
         P(C|+) =                -- posterior
                   c * a / e =
             0.9 * 0.01 / 0.207 = 0.043  (!!!the same value we got before!!!)


Unit 3.7a Video 3.18

  Bayes Network -- Bayes Rule Graphically

    (A) ---> (B)

     where A is cause and B is effect (A=cancer, B=test result)
       A is not observable but B is
     we know:
       P(A) -- the probability of the cause, cancer = 1%
       P(B|A) and P(B|~A) -- the prob of effect given each value of cause
       
       Causal reasoning: P(B|A) and P(B|~A)
           how likely is the effect given the cause?

     we want to know:
       Diagnostic reasoning: P(A|B) and P(A|~B)
           how likely is the cause given the effect?

    There are 3 parameters: P(A), P(B|A), P(B|~A)
  
    see image: bayesrule.jpg


 More Complex Bayes Networks

    Bayes Rule (again):     P(A|B) = P(B|A) * P(A) / P(B)
            P(B|A) -- likelihood
            P(A)   -- prior
            P(B)   -- marginal likelihood

        P(B|A) * P(A) -- (likelihood and prior) are easy to compute
        P(B) -- (marginal likelihood) not always so easy,
                 but at least its just a function of B (no A's involved)
                 P(B) is called the "normalizer"

  Computing Bayes Rule -- using "normalization"
    we can find the complementary event, not A given B --
                            P(~A|B) = P(B|~A) * P(~A) / P(B)
    and we know that the two need to add to 1 --
                            P(A|B) + P(~A|B) = 1

    leave out the P(B) normalizer in the two to get "pseudo-probabilities",
     i.e., P' is not a "real" probability at this point --
                            P'(A|B) = P(B|A) * P(A)
                            P'(~A|B) = P(B|~A) * P(~A)
     to get a real probability, P' can be multiplied by some normalizer 'a' --
      (...he uses etta, the book uses alpha...)
                            realP  = a * P'(A|B)
                            real~P = a * P'(~A|B)
     and 'a' is one over the sum of the two P' values
       (because they eventually need to add up to 1) --
                               a = 1 / ( P'(A|B) + P'(~A|B) )
       (note: see that 'a' = 1/P(B) as well!)

Unit 3.8a Videos 3.20-3.21

 Two Test Cancer

  Two tests with net like:
                                 C
                                / \
                               T1 T2 

   with same probabilities for either test:
              priors            negations (1 - prior)
               P(C)    = 0.01    P(~C)   = 0.99
               P(+|C)  = 0.9     P(-|C)  = 0.1    
               P(-|~C) = 0.8     P(+|~C) = 0.2
            
   what is the probability that you have Cancer if both tests are positive?
               P( C | T1=+ T2=+ ) = P( C | ++ ) = 0.1698

    because you multiply the probabilities:
           P(C|++)  = P(C)  * P(T1+|C)  * P(T2+|C)  / P(++)
           P(~C|++) = P(~C) * P(T1+|~C) * P(T2+|~C) / P(++)
     and doing it with normalizing avoids needing P(++):
      using P(C)=0.01, P(~C)=0.99, P(+|C)=0.9, P(+|~C)=0.2
           prior * T1+ * T2+ =   P'   /  a     = P(C|++)
        C   0.01 * 0.9 * 0.9 = 0.0081 / 0.0477 = 0.1698
       ~C   0.99 * 0.2 * 0.2 = 0.0396 / 0.0477 = 0.8301
                        total: 0.0477 -- note: this is joint P(++)

   what is the probability that you have Cancer if one test is + and one -?
               P( C | T1=+ T2=- ) = P( C | +- ) = 0.0056

      using P(C)=0.091, P(+|C)=0.9, P(+|~C)=0.2, P(-|C)=0.1, P(-|~C)=0.8
           prior * T1+ * T2- =   P'   /  a     = P(C|+-)
        C   0.01 * 0.9 * 0.1 = 0.0009 / 0.1593 = 0.0056
       ~C   0.99 * 0.2 * 0.8 = 0.1584 / 0.1593 = 0.9943
                        total: 0.1593


Unit 3.9 Videos 3.22-3.23

 Conditional Independence

    for the network above, C is the "hidden variable":
                                 C
                                / \
                               T1 T2 
    and T1 and T2 are assumed to be "conditionally independent"
     meaning: knowing C and T1 will not change P(T2), i.e.:
                     P( T2 | C, T1 ) = P( T2 | C )
      the probability of T2 given C and T1
       equals
      the probability of T2 given just C

   In the network, the directional arrows from C to Tn "cutoff" the
    Tn's from each other and they are "conditionally independent".
    T2 is "ci" of T1 _only_ if we actually know C
    
   For this directed network:
                                 A
                                / \
                               B   C
    given A, then B and C are "conditionally independent":
       B T C | A
     (the upside down T thing is "independence"...
      so B T C  is "absolute independence" and
         B T C | A is "conditional independence" )

    if you don't know about A can they still be "ci"?
     NO -- because knowing B gives you information about A
           which in turn influences the results of C 

    given the same Cancer net and probabilities, what is:
              P( T2=+ | T1=+ )
     the probability that T2 will be positive
      if we know T1 was positive and that C is the parent of both?

       this is the "total probability" of T2=+ given T1=+
        i.e. add up the probabilities of T2=+ for (+/-C conditioned on T1=+):
             P( T2=+ | T1=+,  C ) * P(  C | T1=+ ) +
             P( T2=+ | T1=+, ~C ) * P( ~C | T1=+ )

      due to conditional independence we can remove T1=+ from the first term --
       given we know C, knowledge of T1 gives no more information about T2
       and using the liklihood values above:
          P( T2=+ | T1=+,  C )  can reduce to  P( T2=+ |  C ) = 0.9
          P( T2=+ | T1=+, ~C )  can reduce to  P( T2=+ | ~C ) = 0.2
       and we already did the bayes calculation of the 2nd term inverse values:
          P(  C | T1=+ ) = 0.043
          P( ~C | T1=+ ) = 0.957
    
      so we get:
             P(+T2|C) * P(C|+T1) + P(+T2|~C) * (P~C|+T1)
                 0.9  * 0.043    +    0.2    *   0.957 =  .2301
      before the total probability of getting a positive test was 0.207
       so the probability of the second test=+ is now slightly higher.


Unit 3.9d Video 3.24

 Absolute and Conditional Independence
   (note: using 'T' instead of the upside-down version in the video)

   For this directed network
    (note: the video turned the letters around from before...)
                                 C
                                / \
                               A   B
  Does absolute independence imply conditional independence?  NOT!
    A T B  ->  A T B | C
   even with absolute ind, things might not be cond ind
     (...explained later.... see Conditional Dependence below)
  
  Does conditional independence imply absolute independence?  NOT!
    A T B | C  ->  A T B
   because C is the "intermediary" which influences both results
     from above: knowing A gives you information about C and 
                 knowing C in turn influences the results of B


Unit 3.10 Video 3.25

 Confounding Cause
 (Different Type of Bayes Networks)
 
   For this directed network:
                                 S   R
                                  \ /
                                   H
    Two independent hidden causes are confounded in one observation --
                        Sunny, Raise -> Happy
        example values:
            P(S) = 0.7
            P(R) = 0.01
          and:
            P( H |  S, R ) = 1.0
            P( H | ~S, R ) = 0.9
            P( H |  S,~R ) = 0.7
            P( H | ~S,~R ) = 0.1
      ("a perfectly fine specification of a probability distribution"
        schip note: we have the prob of each cause and 
                    the joint prob of of all combinations of cause on result)

     what's the probability of a Raise given that it's Sunny:
             P( R | S ) = ??? == 0.01
      I didn't get an explanation video, but I presume it's
      because R and S are defined as absolutely independent events...


Unit 3.11 Video 26-28

 Explaining Away:
  If we know that we are happy then sunny weather can "explain away"
  the cause of the happiness. If we also know that it is sunny then
  it's less likely that we received a raise.
   or
  If you see an effect that has multiple causes,
  Then seeing one of the causes can "explain away" other potential causes.

  Using the same network and probabilities as above --

  What's the probability of a raise given that I'm happy and it's sunny?
          P( R | H, S ) = ??? = 0.0142

    Using Bayes Rule, the above inverts to:
                  (...I don't understand this at all...)
          P( H | R, S ) * P( R | S ) / P( H | S )
     The probability of Happy given Raise and Sunny
      times
     The total probability of Raise given Sunny
      divided by
     The total probability of Happy given Sunny

     then we can change P(R|S) to P(R) because it's independent
     and expand P(H|S) to -----ugh---- the total probability:
          P(H|R,S) * P(R) + P(H|~R,S) * P(~R)
     so:
          P(H|R,S) * P(R)  /  P(H|R,S) * P(R)  +  P(H|~R,S) * P(~R)
          1.0    * 0.01  /    1.0    * 0.01  +     0.7    *  0.99 ~= 0.0142


  Then...what's the probability of a raise given only that I'm happy?
          P( R | H ) = ??? =  0.0185

    to do it:
      Use Bayes Rule to invert the equation:
          P( R | H ) = P( H | R ) * P( R )  /  P( H )

      calculate the total probability of happiness P(H) --
       i.e.: sum over all four combinations of parent values to H:
       (note: P(S,R) = P(S) * P(R)  =  probability of (S and R) )
        P(H) = P(H| S, R) * P( S, R) +
               P(H|~S, R) * P(~S, R) +
               P(H| S,~R) * P( S,~R) +
               P(H|~S,~R) * P(~S,~R) 

                1.0  *  (0.7 * 0.01) +
                0.9  *  (0.3 * 0.01) +
                0.7  *  (0.7 * 0.99) +
                0.1  *  (0.3 * 0.99)   = 0.5245

      calculate the total probability of happiness given raise P(H|R) --
       i.e.: sum over all two combinations of S parent values to H:
        P(H|R) = P(H| S, R) * P( S ) +
                 P(H|~S, R) * P(~S )

                       1.0  *  (0.7) +
                       0.9  *  (0.3)    = 0.97
                  
      Plug the numbers into the Bayes invertion equation:
          P( R | H ) = P( H | R ) * P( R )  /  P( H )
          P(R|H)   =   0.97     *  0.01   /  0.5245  =  0.0185

  The lesson:
        P(R|H,S) = 0.0142
        P(R|H)   = 0.0185

    If you know ST is happy AND it's sunny, the sunny part "explains away"
     the happy result so it's LESS likely he got a raise. But if you don't
     know it's sunny (AND he's still happy) then it's more likely he got
     the raise...

   One last question --
    what's the probability of a raise given happy and NOT sunny?
           P(R|H,~S) =  ???  = 0.0833
    
     by Bayes inversion this is:
          P(H|R,~S) * P(R|~S) / P(H|~S)

     then we can change P(R|~S) to P(R) because it's independent
     and expand P(H|~S) to the total probability:
          P(H|R,~S) * P(R) + P(H|~R,~S) * P(~R)
     so:
          P(H|R,~S) * P(R)  /  P(H|R,~S) * P(R)  +  P(H|~R,~S) * P(~R)
           0.9    * 0.01  /    0.9     * 0.01  +     0.1     *  0.99 ~= 0.0833


Unit 3.11f Video 3.29

 Conditional Dependence

  From before:
    P( R | H, S ) = 0.0142  --
      if he's happy and it's sunny
       the prob of having gotten a raise is only slightly higher
    P(R|H,~S) = 0.0833 --
      if he's happy and it's NOT sunny
       the prob of having gotten a raise is way higher

   So...in the given Bayes Net, S and R are independent
    but H adds a dependence between them.

   Repeating for this directed network:
                                 S   R
                                  \ /
                                   H
          P(R|H,S)      = 0.0142
          P(R|S) = P(R) = 0.01
          P(R|H,~S)     = 0.0833
    
    In the absence of H, R and S are independent:  R T S
    But if you know about H, then R and S become dependent:
          P(R|H,S)  !=  P(R|H,~S)  --   0.0142 != 0.0833
     given H, varying the value of S affects the probability of R

  --> So: Independence does NOT imply conditional independence. <--
                  see answer in Unit 3.9d Video 3.24
       

Unit 3.12 Video 3.30-3.33

 General Bayes Networks

  (note that the parameter calculations only work for binary in/out states
  ( it's more complicated, as in the final exam, with more states
  ( and I think the formula is:  InStates * (OutStates - 1)
  ( because you would normally have (InStates * OutStates) combinations
  (  of input/output probabilities but since you can calculate the 1-total
  (  you can use one fewer element in each outstate table...
  (    see cloudyCPT.jpg for the full tables )

  Bayes Networks define distributions over graphs of random variables.
   Instead of enumerating all possibilities of combinations of the variables,
   the network is defined by probabilities that are inherient to each node

                A   B    P(A), P(B)      -- one parameter each
                 \ /
                  C      P(C|A,B)        -- four parameters
                 / \
                D   E    P(D|C), P(E|C)  -- two parameters each

   The joint probability represented by a Bayes Network is the product of
    all the prob over each node where each node's prob is only conditioned
    on it's incoming arcs.
          P(A,B,C,D,E) = P(A) * P(B) * P(C|A,B) * P(D|C) * P(E|C)

   The advantage is that it reduces the number of factors needed to spec
    the full joint probability: 5 variables need 2^5-1 = 31 factors to
    spec every combination, but the Bayes Net shown needs only 10 --
          P(A)     -- 1  ( single probability of T,F )
          P(B)     -- 1
          P(C|A,B) -- 4  ( A or B can be T,F so 4 possibilities )
          P(D|C)   -- 2  ( C can be T,F so 2 possibilities )
          P(E|C)   -- 2

    So...using Bayes Nets you get a representation that can scale significantly
     better when you get to large networks. !!Key Advantage!!

    Quiz: How many values needed to spec this Bayes Net: ??? = 13

                  A          P(A)                    --          1
                 /|\
                B C D        P(B|A), P(C|A), P(D|A) --  2+2+2 = 6
                |  \|
                E   F        P(E|B), P(F|C,D)       --  2+4   = 6

     Any (boolean) variable that has K inputs has 2^K values.

Video 3.33 Value of a Network
  Quiz: How many values needed to spec this Bayes Net: ??? = 19

               A  B  C     P(A), P(B), P(C)            --  1+1+1       = 3
                \ | /|
                  D  |     P(D|A,B.C)               --  2^3         = 8
                / | \|
               E  F  G     P(E|D), P(F|D), P(G|C,D) --  2^1+2^1+2^2 = 8

  Quiz: How many parameters in the Bayes Net from the begining: ??? = 47
         ...note that the Full JPD is 2^16-1 = 65535 parameters...
               see image: bayesnet_intro.jpg

               row 1: 3 w/ 0 in                            =  3  
               row 2: 1 w/ 1 in, 1 w/ 2 in                 =  6
               row 3: 1 w/ 1 in, 1 w/ 2 in, 4 w/ 0 in      = 10
               row 4: 2 w/ 1 in, 2 w/ 2 in, 1 w/ 4 in     
                        2*2^1   +  2*2^2   +  1*2^4        = 28



Unit 3.13 Video 3.34-36

 D-Separation or Reachability

   Quiz:
        Bayes network:
                  A
                 / \
                B   D
               /     \
              C       E

           is C independent of A          y,n ? n -- A influences C by way of B
           is C ind of A given B (C|A,B)  y,n ? y -- knowing B makes A not matter
           is C independent of D          y,n ? n -- A influences both C and D
           is C ind of D given A (C|D,A)  y,n ? y -- knowing A makes D not matter
           is E ind of C given D (E|C,D)  y,n ? y -- knowing D makes E not matter

  Rule: two variables are independent if they are not linked by just unknowns
          (variables are independent if they are linked by a known variable)
          schip: a known variable on a direct path between variables
                  cuts off the link between the two sides --
                  it makes the link IN-ACTIVE and the variables INDEPENDENT.
    
    in the example:
     if you know B,
      everything "downstream" of B is independent of everything upstream,
      so C is ind of A given B, and E is ind of C given B -- works both ways...
       but knowing B doesn't make A and E independent.

  Quiz:
        Bayes network:
                  A   B
                   \ /
                    C
                   / \
                  D   E

     is A ind of E                        y,n ? n -- only unknowns in-between
     is A ind of E given B                y,n ? n -- only unknowns in-between
     is A ind of E given C                y,n ? y -- C makes A not matter
     is A ind of B                        y,n ? y -- no incoming arcs
     is A ind of B given C                y,n ? n -- "the explain away effect"
                                                      "conditional dependence"

  condional dependence (see the sunny,raise,happy example above):
            if we know A can cause C then
             it's less likely that B caused C
              and vice versa,
             if A is false and C true then
              B is more likely
              

  D-separation is the General Study of Conditional Independence in Bayes Networks

    active triplets: make variables dependent
    in-active triplets: make variables independent

      for a chain of variables
        (note that B is in the middle):
                              A -> B -> C
      or for this directed tree of variables
        (note that B is at the top and middle...):
                                  B
                                 / \
                                A   C
     active: if the value of B is unknown, then A and C are dependent
     inactive: if the value of B is known, then A and C are independent
    
    But...
      for this directed tree of variables, conditional dependence reverses it --
        (note that B is at the bottom and middle...):
                                A   C
                                 \ /
                                  B
     active: if the value of B is KNOWN, then A and C are dependent
     inactive: if the value of B is UNKNOWN, then A and C are independent
    AND
      for this directed tree of variables
        (note that B and D are at the bottom and middle...):
                                A   C
                                 \ /
                                  B
                                  |
                                  D
      if we know D then it we don't need to know B.
       if we know a successor of B, we don't have to know B itself,
        because we can get knowledge of B from its successors.
     active: if the value of _D_ is KNOWN, then A and C are dependent
     inactive: if the value of _D_ is UNKNOWN, then A and C are independent

      see: bayes_independence.jpg
    
  Quiz: see bayes_indQuiz.jpg for the network

        is F ind of A              y,n ? y -- A & F dep on D, but we don't know D
        is F ind of A given D      y,n ? n -- we know D, the successors of B & E
        is F ind of A given G      y,n ? n -- we know G, the successor of D
        is F ind of A given H      y,n ? y -- no known variables in-between


Video 3.37
 Congratulations! 
  main points:
    Graph structure of Bayes Networks
    Compact representation
    Conditional Independence

    ...He hopes we enjoyed the Unit...oy...