Probability Terminology
A brief brief on probability terminology and notation with reference to the AIMA book.

The Probability equations use symbols defined on page 244 in the Logic section of the AIMA book -- here's a scan. Unfortunately many of those symbols are not in standard font sets (some of them appear to be in the Windows "Symbol" font but didn't reproduce correctly in my browser) and are not readily usable in source code or program console output. So, some of the operators and such have been mapped to more "logical" characters. The set used here and in the code is as follows:

P (probability of): P(variable) = 0.6
~ (not or inverse): P(~variable) = 0.4
OR (or, also "disjunct"): P(x OR y) = 0.5
AND (and, also "conjunct"): P(x AND y) = 0.2
, (joint): P(x,y) = .2
| (given): P(x|y)
<> (distribution): P(x) = <.06, .04> or P<>(x) = <.06, .04>

The domain of a variable is the set of values it can take. In most cases in this book variables are boolean, so they can have the values: {true, false} and these are apparently the default. In other cases variables may be multi-valued e.g.:

P(weather) = {sunny, rain, cloudy, snow} = <0.6, 0.1, 0.29, 0.01>

where the {names} are the possible values and the <numbers> are their associated probabilities. For any particular value you could have this: P(weather=rain) = 0.1. By convention probabilities cover the range from 0 to 1 and each variable's set must add up to 1. The assigned probabilities, and calculations made directly with them, are called unconditioned or prior values, I suppose because they are defined before anything happens.

The logical operators are, well, logical and behave sort of as one might expect.

AND is the product of probabilities where the variables are independent -- when they have no effect on each other -- and have the specified value (the default value is true for all of the below):

P(x AND y) = P(x) * P(y)

and when not independent:

P(x AND y) = P(x) * P(y|x) = P(y) * P(x|y)

Note that when the variables are independent P(x) = P(x|y) so the second equation becomes the first.

OR is the sum where either variable has the specified value. When they are mutually exclusive -- only one can occur in any situation:

P(x OR y) = P(x) + P(y)

But when the variables are not mutually exclusive OR will count probabiites twice -- when both variables have the given values -- so "one copy" of those is subtracted from the sum:

P(x OR y) = P(x) + P(y) - P(x AND y)

Joint is actually a synonym for AND, it means to combine the probabilities of the provided variables by multiplying. When the variables are absolutely independent this is just the product of their probabilities. But often they will be conditioned such that the product needs to be scaled by the conditional probability.

Given means that we want to know the probability of one variable being a certain value given that (when) we already know the value of another variable. You can read the equation: P(x|y) as: The probability that x is true given that y is true. This is a conditional or posterior probability, again I suppose because you find it after something happens. The value can be calculated with this identity:

P(x|y) = P(x AND y) / P(y)

Distributions are another tricky concept. The standard notation specifies that a bold P(x) means we want the whole set of possible values for the variable x -- rather than just the value where x is true-- but to make it easier on everyone the code uses:

P<>(x) = <0.6, 0.4>

to indicate that we want the distribution of probabilities. This is a bit more useful with a multi-valued variables such as: P<>(weather) = <0.6, 0.1, 0.29, 0.01> above. Going further, the notation: P(x,y) or P<>(x,y) means that we want the joint distribution of both variables, that is, the probabilities of all combinations of the values of each variable. If x and y are boolean that would be a 2x2 table like this:

x = t f
y = t .3 .2
 f .4 .1

Note that, if x and y are the only variables in the system, the sum of all values in the table should be 1 -- because it contains all the possibilities -- and it would then be called a full joint distribution.

We can use a full joint distribution to calculate the marginal probability -- so called because folks wrote the results in the margins of the book -- of a particular variable. This would be the sum of all probabilities where the variable has a certain value. In the example above the marginal probability of x=true is: .3 + .4 = .7, written as P(x) = 0.7. The marginal probabilities of each variable's values should sum to 1, so P(~x) = (.2 + .1) = .3, and .3 + .7 totals 1.

A further operational trick is normalization, indicated by the lower case Greek letter alpha: α. This operation allows one to avoid a bunch of calculation when evaluating a conditional probability over only part of a joint distribution. It is explained in detail on AIMA page 493, but here's the chase... If you have a system with more than two variables and you want to calculate a conditional probability using only a sub-set of the variables -- in other words you don't care about the values of some variables -- you can proceed like this (given the toothache, cavity, catch model and ignoring catch):

P<>(cavity | toothache) = α P<>(cavity, toothache)

where alpha indicates that we want to normalize the resulting distribution such that it adds up to 1. From the example full joint distribution the probability of having:

Both a cavity and a toothache is: P(cavity AND toothache) = (.108 + .012) = 0.12
No cavity but still have a toothache is: P(~cavity AND toothache) = (.016 + .064) = 0.08

Giving a conditional joint distribution of: <0.12, 0.08> which is the total probability for the system, just as if it had only two variables. Unfortunately, it doesn't add up to 1. So we normalize it by summing the distributed probabilities and dividing each by the sum. That sum is alpha (actually 1/alpha, but who quibbles here, eh?): 0.12 + 0.08 = 0.20 and <0.12, 0.08>/ 0.20 = <0.6, 0.4>. The alpha value turns out to be P(toothache) which is needed in the full calculation, but can be extracted from the partial results.Eh Voila!

Joint probability tables can be rather unwieldy as they scale up. Fortunately, if one can find subsets of variables that are independent or conditionally independent of one another -- effectively they have no direct influence over each other -- those relationships can be removed from the joint probability table. This produces a conditional probability table for each variable which contains entries for only the parent variables that directly affect it. In effect it creates a network of variables that influence others. This is described in AIMA section 14.1 starting on page 510.

Lastly, a discussion of Bayesian Probability can be found here.

Probability Terminology A brief brief on probability terminology and notation with reference to the AIMA book.

Probability Terminology
A brief brief on probability terminology and notation with reference to the AIMA book.