Unit 4 Probabilistic Inference video 4.1 Overview and Example Inference -- how to answer probability questions using Bayes Networks Simple alarm model: B E Burglary, Earthquake \ / A Alarm / \ J M JohnCalls, MaryCalls Inputs: B and E -- Evidence Variables -- known values others: A -- Hidden Variables -- unknown value but compute with it Outputs: J and M -- Query Variables -- computed values result of query will be joint prob DISTRIBUTION: the posterier distribution given the evidence -- P( Q1, Q2 ... | E1=e1, E2=e2 ... ) another question: what is the most likely output? out of all possible query value combinations which has the highest value? argmax P( Q1=q1, Q2=q2 ... | E1=e1 ... ) Bayes Nets can go both ways -- Causal -- given evidence (B,E) what are the outputs (J,M) Diagnostic -- given the outputs (J,M) what is the evidence (B,E) or any combination thereof -- putting arbitrary variables in the evidence and query positions Quiz: Mary has called to report that the alarm is going off and we want to know if there has been a burglary... What roles do each of the nodes take? Evidence nodes: M Query nodes: B Hidden nodes: J,A,E Video 4.2 Enumeration -- goes through all the possibilities and adds them up (notation: +B is B=true, ~B is B=false, to avoid confusion...) state the problem: P(+B | +J, +M) the probability of a burglary given that both john and mary called expand using conditional probability -- P(Q|E) = P(Q,E) / P(E): P(+B, +J, +M) / P(+J, +M) calculate a term by expanding all possibilities of Hidden nodes (A,E): sumE ( sumA P(+B,+J,+M) ) expand each element in terms of the parents of each node making it the product of each element enumerated over hidden variables: sumE ( sumA ( P(+B) * P(E) * P(A|+B,E) * P(+J|A) * P(+M|A) ) (so: B,E have no parents, just apriori probabilities +B and enum E A has parents B,E, enum over E where +B J,M have parent A, enum over A ) calling the whole enchilada above f(E,A) then enumerate all values of E,A: f(+E,+A) + f(+E,~A) + f(~E,+A) + f(~E,~A) Values for each term come from conditional probability model see: alarmCPT.jpgo Quiz: Fill in numbers for each term where +E and +A P(+B) * P(E) * P(A|+B,E) * P(+J|A) * P(+M|A) .001 * .002 * .95 * .9 * .7 = .000001197 look for the probs in each table where both E and A are true then multiply them all. also remember that to get a real probability it has to be summed over each value of E,A and then normalized by P(+J,+M) the final probability: P(+B | +J, +M) = 0.284 because probabilities of J,M,A are all pretty high but the prior P(+B) is very low Video 4.3-4.7 Speeding Up Enumeration Pulling out terms -- starting with this: sumE (sumA ( P(+B) * P(E) * P(A|+B,E) * P(+J|A) * P(+M|A) ) P(+B) is a constant so can pull it out of product: P(+B) * (sumE (sumA ( P(E) * P(A|+B,E) * P(+J|A) * P(+M|A) )) can pull P(E) it out of inner product because it doesn't depend on A: P(+B) * (sumE (P(E) * sumA ( P(A|+B,E) * P(+J|A) * P(+M|A) )) so that reduced the inner summation from 5 to 3 terms, but still the same number of rows to calculate Maximize Independence -- a linear net can do inference in O(n) time: X1->X2->...XN a "complete" net where every node points to every other node is O(2^n) time (for boolean variables) Quiz: with two separate nodes, are they Dependent or Independent? J M they are DEPENDENT because having information about one affects what we can know about the other Quiz: adding A node, is it Dependent or Independent of J and M? J -> M A it is DEPENDENT on both because having information about J,M makes A more likely (intuitively?) but also look at CPT's... Quiz: adding B node, is it Dependent or Independent of A,J,M? J -> M \ / A B it is DEPENDENT on A and independent of J,M GIVEN A Quiz: adding E node, is it Dependent or Independent of B,A,J,M? J -> M \ / A / B E it is DEPENDENT on A and B, A because if E is true it's more likely that A is true B because if B is true it's more likely that A is true and thus E is less likely J -> M \ / A / \ B -> E Causal Direction "The moral is that bayes nets are more compact when they go in the Causal Direction, from Causes to Evidence, as in the first diagram. Video 4.8-4.11 Variable Elimination -- combine parts of the network and then enumerate over the smaller parts. compute by "marginalizing out" some varibles. "If we make a good choice of the order of elimination then it can be much more efficient than doing the whole enumeration" In general Bayes net inference is NP-hard Need an algebra for manipulating factors, a factor is a multi-dimensional array of probabilities. Each CPT is a factor. R -> T -> L (Rain,Traffic,LateToAppointment) see CPTs: rain_lateCPT.jpg Joining Factors -- combining 2 or more CPT's to make a Joint Probability Table example, joining P(R) and P(T|R): point-wise multiply P(R) * P(T|R) for each value combination to get -- see: rain_lateCPT_JF.jpg P<>(R,T) +R +T (.1 * .8) = 0.08 +R ~T (.1 * .2) = 0.02 ~R +T (.9 * .1) = 0.09 ~R ~T (.1 * .9) = 0.81 making a net like this (also note that they add to 1 still): RT -> L Marginalization -- reduces JPT by summing values for intermediate variable back to simple probability: add probabilities where T has same value, + or ~, (still adds to 1) [new] P(T) +T (.08 + .09) = 0.17 ~T (.02 + .81) = 0.83 making a net like this (still add to 1): T -> L Do it again, with new P(T) and old P(L|T)... Joining Factors (see rain_lateCPT_JF2.jpg for results): P<>(T,L) +T +L (.17 * .3) = 0.051 +T ~L (.17 * .7) = 0.119 ~T +L (.83 * .1) = 0.083 ~T ~L (.83 * .9) = 0.747 making a net like this (still adds to 1): T,L Marginalizing to: [new] P(L) +L (.051 + .083) = 0.134 ~L (.119 + .747) = 0.866 making a net like this (still adds to 1): L Video 4.12 Approximate Inference by means of sampling -- take "samples" of process in order to estimate JPT. for instance counting coin flips a number of times to get approximate results, see: samplingCoins.jpg can be easier than complex inference computation and if we don't know the CPT's but can simulate process we can still do sampling, but not inference. Video 4.13 Sampling Example using Cloudy model net: cloudyCPT.jpg C / \ S R \ / W Pick a value for C using it's P(C) Then pick values for S and R, using P(x|C) and the C value above e.g. for P(S|C) with +C, will give +S=.1 and ~S=.9 of the time Then pick values from P(W|S,R) based on the S,R values selected: Quiz: so... having selected +C, and then -S and +R, what rows of P(W|S,R) would we use and is it more likely that we have +W or ~W use the ~S +R values where +W=.9 see: cloudyCPT_sample.jpg That makes one sample with the prior and sample values as shown (+C,~S,+R = +W) then repeat for a new sample Video 4.14 Approximate Inference 2 With an infinite number of samples the sampling proceedure repoduces the true joint probability table -- it is "Consistent" Can also compute any variable's CPT. Video 4.15 Rejection Sampling What about computing a conditional probability? e.g.: P(W|~C) Do "Rejection Sampling" -- Throw out samples that don't match the desired conditions e.g.: every sample that has +C if we are only interested in ~C This is also "Consistent" Problem with Rejection Sampling -- if the conditions are unlikely you toss out a lot of samples... e.g.: for the Burglary->Alarm model B is very unlikely so P(B|+A) does a lot of useless sampling Then use Likelyhood Weighting -- set the value of evidence variables, and then sample the rest this gives you a list like: ~B, +A ~B, +A +B, +A ~B, +A but it is "Inconsistent" Can be fixed by "assigning a probability to each sample and weighing them correctly" Video 4.16-4.18 Likelihood Weighting is Consistent (when done right....haha) collect samples with the constraints fixed but add a probalistic weight to each sample do samples as before, for constrained variables use only the selection, collect probabilities from constrained variables given other var values, when done, multiply the collected probs to get a weight for that sample for: P(R|+S,+W) select +/~C, say +C this time set +S save +S|C value in weight, ie: +S|+C = 0.1 select +/~R, say +R this time set +W save +W|S,R value in weight, ie: +S,+R,+W = 0.99 when done multiply the weights to get prob for sample ie: (.1*.9) = 0.09 +C,+S,+R,+W (sample values) see: likeyWeight.jpg Doesn't always work well -- say we want P(C|+S.+R) we fix values of +S and +R, which almost always produces a "good" W but has no effect on C which will produce values at random often the C values will "not go well with the evidence" they will have a low associated probability...the weights are low Video 4.19 Gibbs Sampling -- Josia Gibbs Uses all the evidence using Markov Chain Monte Carlo: "MCMC" sample one variable at a time, conditioned on all the others randomly initialize values for a set of variables, say: +C +S ~R ~W as the first sample then on each iteration, change one non-evidence variable's value and resample it based on all the other fixed variable values: +C ~S ~R ~W giving another sample lather, rinse, repeat "we end up walking around in the space of assignments of variables randomly" "in rejection and likelyhood sampling each sample was independent of the others" "with MCMC the samples are dependent on each other, and in fact adjacent samples are very similar" "however MCMC is still Consistent" Video 4.20 Monty Hall Problemo Three doors, one with good prize Chose a door, say #1 Host opens a door that doesn't contain a prize, say #3 What is prob of winning if you switch your choice? .33 if stick, .66 if switch Because...here we go.... First choice original probability was 1/3...which doesn't change... So the second door MUST be 2/3! Why doesn't the original 1/3 prob hold for door #2? Because we learned that it wasn't the door the Host opened, and #1 was never an option for the Host... Video 4.21 Monty Hall Letter -- Monty didn't understand it either...