Bayes Theorem
v1.1
The Bayesian
thing is pretty cool once you wrap your head around it...presuming that
I have... To start, I found this nice page that describes what it means in my
particularly practical standpoint: http://yudkowsky.net/rational/bayes
Basic Bayes Background
This is pretty much verbatim from section 13.5.1 on AIMA page 496...I'm going to use disease for cause and symptom for effect because the subsequent examples are medically motivated. The Bayesian equation is:
P(disease | symptom) = (P(symptom | disease) * P(disease)) / P(symptom)
Which reads as:
The Probability of having a disease, given that you have a symptom
EQUALS
The Probability of having that symptom, given that you have that disease
TIMES
The Probability of that disease
(both)
DIVIDED BY
The Probability of having that
symptom
In the meningitis example given this becomes:
- P(symptom | disease) = 0.7 -- 70% of meningitis patients have a stiff neck
- P(disease) = 0.00002 -- 1/50,000 of the population has meningitis
- P(symptom)
= 0.01
-- 1% of the population has a stiff neck no matter
Plugging those numbers into the Bayes equation gives:
(0.7 * 0.00002) / 0.01 = 0.0014
So about .14% of the population who have stiff necks also have meningitis.
Running the Numbers
The yudkowsky page is way long. Reduced to the minimum it says
that, for a seemingly binary
test, there are four possible results:
- True Positives -- Positive results that are correct;
- False Negatives -- Negative results that are actual
positives.
- False Positives -- Positive results that are
actual negatives;
- True Negatives -- Negative results that are correct;
In order to figure out what a test result means you really need to know
the False
bits...and one other piece of information: The expected results, or Prior.
For the Breast Cancer example on the yudkowsky page it goes like this:
1% of women at age forty who
participate in routine screening have breast cancer. 80% of
women
with breast cancer will get positive mammographies. 9.6% of
women
without breast cancer will also get positive mammographies. A
woman in this age group had a positive mammography in a routine
screening. What is the probability that she actually has
breast
cancer?
To recap:
- Actual rate in the population: 1% = 0.01
- True Positives: 80% = 0.80
- False Negatives: 20% = 0.20
- False Positives: 9.6% = 0.096
Out of a Population of 10,000 where the Rate is 0.01,
there should be 100 who are Real Positives, and the test results
will be:
- True Positive:
80 with
cancer and positive
[0.01 * 0.80 * 10,000] ;
- False Negative: 20 with
cancer and negative
[0.01 * 0.20 * 10,000];
- False Positive: 950 without cancer
and positive
~[0.096 * (10,000 - 100)];
- True Negative: 8,950 without cancer
and negative
~[(1 - 0.096) * (10,000 - 100)].
(Note that the numbers magically add up to 10,000)! So, 1030
[80 +
950] will be Positive Test, but a positive result is really
positive about [(80 / 1030) = 0.0776] ~7.8% of the time. Without knowing
the
"expected prior probability" of the result we cannot evaluate its
actual probability. This also works nicely in the degenerate cases. If
you
have a perfect test with 0% False Positive/Negative results, you get
the
expected 100 True Positives and
9900 True Negatives. And if you have 0% Real Positives in
the population you get the expected 960 False Positives and 9040 True
Negatives.
So, why's that Bayesian?
Lets do it another way then. Here's what we know in a different light:
- Rate in the population: 1% = 0.01 -- aka P(disease)
- True Positives: 80% = 0.80 -- aka P(symptom | disease)
- Positive Tests: 1030/10,000 = 0.103 -- aka P(symptom)
Plugging that into Bayes we get:
(P(symptom | disease) * P(disease)) / P(symptom) = (0.80 * 0.01) / 0.103
Which, incredibly enough, gives us:
P(disease | symptom) = 0.0776 or 7.8% of Test Positives who are Real Positives (!!!)
Finding the Prior
Estimating the expected Prior probability gives you a
better handle on the problem, and a way to revise the actual results.
But I think you can use this all to calculate the Real Positive value
when it is not known. Like this.
The things we know are:
- Population = 10,000
- PositiveTest = FalsePositive + TruePositive = 1030
- FalsePositive = 0.096 * (Population - RealPositive) =
((0.096 * Population) - (0.096 * RealPositive))
- TruePositive = 0.80 * RealPositive
Therefore:
PositiveTest = ((0.096 *
Population) - (0.096 * RealPositive)) + (0.080 * RealPositive) =
PositiveTest - (0.096 * Population) = RealPositive *
(0.80 - 0.096) =
(PositiveTest - (0.096 * Population)) / (0.80 -
0.096) = RealPositive =
(1030 - (0.0960 * 10,000)) / 0.704 = RealPositive = 99.43 ~= 100
!!! I think I did that right anyway !!!
Just for the Exercise
Chapter 13 of the AIMA book covers this sort of probability reasoning
and Exercise 13.15, AIMA page 508, is exactly this problem:
After your yearly checkup, the doctor has bad news and good news. The
bad news is that you tested positive for a serious disease and that the
test is 99% accurate (i.e., the probability of testing positive when
you do have the disease is 0.99, as is the probability of testing
negative when you don't have the disease). The good news is that this
is a rare disease, striking only 1 in 10,000 people of your age. What
are the chances that you actually have the disease?
To recap:
- Actual rate in the population: 1/10,000 = 0.0001
- True Positives: 99% = 0.99
- True Negatives: 99% = 0.99
Giving:- False Positives: 1% = 0.01
- False Negatives: 1% = 0.01
Out of a Population of 1,000,000 where the Rate is 0.0001,
there should be 100 who are Real Positive. The test results
will be:
- True Positive:
99 with the disease and positive
[0.0001 * 0.99 * 1,000,000] ;
- False Negative: 1 with the disease and negative
[0.0001 * 0.01 * 1,000,000];
- False Positive: 9,999 without the disease
and positive [0.01 * (1,000,000 - 100)];
- True Negative: 989,901 without the disease
and negative
~[(1 - 0.01) * (1,000,000 - 100)].
So the probability of being a Real Positive is 99/10,098 ~= .0098 or 0.98%. Why bother going to the doctor at all?