Probability Terminology
A brief brief on probability terminology
and notation with reference to the AIMA book.
The Probability equations use symbols defined on page 244 in
the Logic section of the AIMA book -- here's a scan. Unfortunately many
of those symbols are not in standard font sets (some of them appear to
be in the Windows "Symbol" font but didn't reproduce correctly in my
browser) and are not readily usable in source code or program console
output. So, some of the operators and such have been mapped to more
"logical" characters. The set used here and in the code is as follows:
- P
(probability of): P(variable) = 0.6
- ~
(not or inverse): P(~variable) = 0.4
- OR
(or, also "disjunct"): P(x OR y) = 0.5
- AND
(and, also "conjunct"): P(x AND y) = 0.2
- ,
(joint): P(x,y)
= .2
- |
(given): P(x|y)
- <>
(distribution): P(x)
= <.06, .04> or
P<>(x) = <.06, .04>
The domain
of a variable is the set of values it can take. In most cases in this
book variables
are boolean, so they can have the values: {true, false} and these are
apparently the default. In other cases variables may be multi-valued
e.g.:
P(weather) = {sunny, rain,
cloudy, snow} = <0.6, 0.1, 0.29, 0.01>
where the {names} are the possible values and the
<numbers> are their associated probabilities. For any
particular value you could have this: P(weather=rain) = 0.1. By
convention probabilities cover the range from 0 to 1 and each
variable's set must add up to 1. The assigned probabilities, and
calculations made directly with them, are called unconditioned or prior values, I
suppose because they are defined before
anything happens.
The logical operators are, well, logical and behave sort of as one
might
expect.
AND
is the product of probabilities where the variables are independent
-- when they have no effect on each other -- and have the specified
value (the default value is true for all of the below):
P(x AND y) = P(x) * P(y)
and when not independent:
P(x AND y) = P(x) * P(y|x) = P(y) * P(x|y)
Note that when the variables are independent P(x) = P(x|y) so the second equation becomes the first.
OR is
the sum where
either
variable has the specified value. When they are mutually exclusive
-- only one can occur in any situation:
P(x OR y) = P(x) + P(y)
But when the variables are not mutually exclusive OR will count
probabiites twice -- when both variables have the given values -- so "one
copy"
of those is subtracted from the sum:
P(x OR y) = P(x) + P(y) - P(x AND y)
Joint is
actually a
synonym for AND, it means to combine the probabilities of the provided
variables by multiplying. When the variables are absolutely independent
this is just the product of their probabilities. But often they will be
conditioned such that the product needs to be scaled by the conditional
probability.
Given means
that we want to know the probability of one variable being a
certain value given that (when) we already know the value of another
variable. You
can read the equation: P(x|y) as: The
probability that x is true given that y is true. This
is a conditional
or posterior
probability, again I suppose because you find it after
something happens. The value can be calculated with this identity:
P(x|y) = P(x AND y) / P(y)
Distributions
are another tricky concept. The standard notation specifies that a bold
P(x) means
we want
the whole set of possible values for the variable x -- rather than just
the value where x is true-- but to make it easier on everyone the code
uses:
P<>(x) =
<0.6, 0.4>
to indicate
that we want the distribution
of probabilities. This is a bit more useful
with a multi-valued variables such as: P<>(weather) =
<0.6, 0.1, 0.29, 0.01> above. Going further, the
notation: P(x,y)
or P<>(x,y) means that we want the joint distribution
of both variables, that is, the probabilities of all combinations of
the values of each variable. If x and y are boolean that would be a 2x2
table like this:
x = t f
y = t .3 .2
f .4 .1
Note that, if x and y are the only variables in the system, the sum of
all values in the table should be 1 -- because it contains all the
possibilities -- and it would then be called a full joint distribution.
We can use a full joint
distribution to calculate the marginal
probability -- so called because folks wrote the results in the margins
of the book -- of a particular variable. This would be the sum of all
probabilities where the variable has a certain value. In the example
above the marginal probability of x=true is: .3 + .4 = .7,
written as P(x) = 0.7. The marginal probabilities of each variable's
values should sum to 1, so P(~x) = (.2 + .1) = .3, and .3 + .7 totals 1.
A further operational trick is normalization,
indicated by the lower case Greek letter alpha: α.
This operation allows one to avoid a bunch of calculation when
evaluating a conditional probability over only part of a joint
distribution. It is explained in detail on AIMA page 493, but here's
the chase... If you have a system with more than two variables and you
want to calculate a conditional probability using only a sub-set of the
variables -- in other words you don't care about the values of some
variables -- you can proceed like this (given the toothache, cavity,
catch model and ignoring catch):
P<>(cavity |
toothache) = α
P<>(cavity, toothache)
where alpha indicates that we want to normalize the resulting
distribution such that it adds up to 1. From the example full joint
distribution the probability of having:
- Both
a cavity and a toothache is: P(cavity AND toothache) = (.108 + .012) =
0.12
- No
cavity but still have a toothache is: P(~cavity AND toothache) = (.016
+ .064) = 0.08
Giving a conditional joint distribution of: <0.12, 0.08>
which is the total probability for the system, just as if it had only
two variables. Unfortunately, it doesn't add up to 1. So we normalize
it by summing the distributed probabilities and dividing each by the
sum. That sum is alpha (actually 1/alpha, but who quibbles here, eh?):
0.12 + 0.08 = 0.20 and <0.12, 0.08>/ 0.20 = <0.6,
0.4>. The alpha value turns out to be P(toothache) which is
needed in the full calculation, but can be extracted from the partial
results.Eh Voila!
Joint probability tables can be rather unwieldy as they scale up.
Fortunately, if one can find subsets of variables that are independent or conditionally independent
of one another -- effectively they have no direct influence over each
other -- those relationships can be removed from the joint probability
table. This produces a conditional
probability table for each variable which contains entries
for only the parent variables that directly affect it. In effect it
creates a network of variables that influence others. This is described
in AIMA section 14.1 starting on page 510.
Lastly, a discussion of Bayesian Probability can be found here.