Information Theory
Simplified(?)
by way of explanation
Shannon's Information
Theory can be a useful tool for measuring the
amount of order
in a system. However the name is somewhat misleading... Information in
this
context has nothing to do with Meaning,
and this trips up many new-comers. It has also been
suggested (I believe I owe this coinage to Murray Gell-Mann
but my belief is not strong enough to generate quotation marks) that it
should have been called Confusion
Theory, partially because what is being measured is the
amount of Randomness
in a system. A system that has one-and-only-one state for all time has
an Information value of 0, whereas the seemingly random universe has an
estimated value of 10^91 bits (http://www.mdpi.com/1099-4300/10/3/150/pdf).
Adding to the confusion, the Information Measure is called Entropy -- we have
Johnny von Neumann to thank for this, as he was looking over Claude
Shannon's shoulder when names were being assigned and noticed that the
soon-to-become-familiar equation looked just like the one for
Thermodynamic Entropy. This also trips up many new-comers, but
in-probable-fact it is an apt analogy (see the Maxwell's Demon
discussion on the wiki page: http://en.wikipedia.org/wiki/Entropy_%28information_theory%29).
There are also many
<IMHO>misguided</IMHO>
attempts to use entropy as a measure of complexity which is
often an orthogonal quantity to randomness (this Crutchfield,Young 1989 paper
chases some of that down if you are really interested).
So what is Shannon Information? From the wiki page:
where b
is the base
of the logarithm used. Common values of
b are 2, Euler's number e,
and 10, and the unit of entropy is bit
for b = 2, nat for b = e,
and dit
(or digit) for b = 10.[3]
The scale of the entropy value is selected by the base of the
log function used, and the most common scale is log2
giving a measure
in bits.
Fortunately it is easy to convert between scales because:
log2
N = log10 N / log10 2
So you can do all your calculations with "regular" log functions and
then divide the final result by (log10
2).
What the equation says is: Given a probability distribution X, i.e.,
a set of probabilities for things that might happen where the full set
adds up to 1.0, such as heads
or tails in a coin flip (2 equal probabilities of 0.5 each) or
the numbers rolled on a six sided die (6 equal values of 0.1666... each),
the Shannon Entropy, H(X), is the sum of each value in the set times
the logarithm of that value.
The amazing thing is that this works just as well for things that are
not equally probable.
What the entropy measures is how random the distribution is.
If the coin or die is weighted, or unfair, the measured
entropy will be less than that predicted for the equal distribution.
It puts a fundamental top limit on how much Information you
can have in particular system.
In many situations it is helpful to think of the Shannon Information
Entropy as a measure of how surprised
you are to see a new piece of data. For instance our old friend the
coin flip: If the coin is fair, every flip gives you some new and
unexpected (not predictable) result. You get one more bit of
information, and a sequence of 8 coin flips gives you 8 bits. But, if
the coin had two heads you would not be surprised to get a head after every
flip and you would have 0 bits of new information. Since
many interesting things are boolean in nature, bits of Information are
quite useful. Getting entropy values in bits also makes it possible to
estimate the amount of computer storage one might need, a byte of
storage can represent 256 equally likely values and contains 8 bits.
To illustrate I made a little spreadsheet
that calculates the entropy in dits
and bits
for some simple
probability distributions. The basic calculation process is:
- Get a list of events that can occur, with their associated
probabilities.
For a coin this would
be heads or tails, each
with a probability of 0.5;
- Multiply each probability, P, by its logarithm, you can use
any base you like, but for instance:
N = P * log10
P
Note that for P=0, log is undefined and it is logically ok to use N =
0. For the coin with P = 0.5, you get: N = -0.150515;
- Sum up all the N's that you calculated and negate it. This
will give you the entropy value in decimal dits.
Note that you need to
negate it because
logarithms of numbers between 0 and 1 are negative, but by convention
we want to end up with a positive result. Again for the coin, you have
two equal probabilities -- one for heads and one for tails -- so your
-Sum = 0.30103;
- To get entropy in bits, divide your dits by log10
2.
This will give you:
Bits = 0.30103 / 0.301029996 = 1... Not surprisingly a coin flip
gives you one bit of information because the result was either heads or
tails!
On the spreadsheet I calculate the Entropy of the following:
- A fair coin, as we saw above: H(X) = 1 bit;
- An unfair 80%/20% coin: H(X) = .7219 bits;
- An 8 bit number, in a
fairly complicated way, but strangely enough: H(X) = 8 bits;
- A six sided fair die: H(X) = 2.584 bits;
- The
entropy of several data distributions. Some are made up
and some are generated by Excel distribution functions and then histogrammed
into 13 bins. They suffer from having only around 100 samples per
distribution and the course grained binning, but they serve to illustrate that
as things get more evenly distributed -- i.e. more random -- their entropy value goes up:
- A single bin with probability = 1.0: H(X) = 0 bits;
- A wider step function having 3 bins with P= 0.33 each:
H(X) = 1.584 bits;
-
A "normal" Gaussian distribution with stddev = 1: H(X) = 2.195 bits;
- Another "normal" Gaussian distribution with stddev = 3: H(X) = 3.271 bits;
- A generated "uniform" distribution, which is not very uniform: H(X) = 3.505 bits;
- An (artificially) flat distribution made by hand: H(X) = 3.789 bits;