>
Next: Example
Up: Sequence Logos: A Powerful,
Previous: The Magnificent Bit
Now that you have a basic understanding of information theory, we'll
give you a simplified mathematical derivation of the formulas.
^{10}
First, we've seen that to make a choice between two equiprobable symbols
requires one bit of information, while four symbols require two bits.
Going one step further, you should see that
eight symbols require three bits to pick one of them.
From this we see the relationship:
or

(2) 
where b is the number of bits required to determine which of Mdifferent symbols is the current one.
This formula assumes that all of the symbols are equally likely, but what if they
are not?
First, we rearrange equation (2).
Since
(M^{1})^{1} = M, we can substitute
into (2)
yielding the equation
.
By pulling out the exponent (according to the rule that
)
and by using the rule that
we obtain the equation:

(3) 
For M equally probable symbols,
is the probability of
each symbol P, so:

(4) 
While information theory
[5,6,7]
is based on probabilities,
sequence data can only be used to generate frequencies,
so we have to use them instead, and
we have to be a little bit careful since
they differ. Probability is the chance that
something (in this case a particular symbol) will occur at any specific
location in an infinite set; but frequency is the
number of times something occurs in a finite set divided by the number of elements
in the set (our sequence).
That is, probability is from an entire population, while frequency
is from a sample of that population.
So, the larger the sample one is working with, the closer the frequency will be to the
probability, but the frequency will only be an approximation of the
probability.
Now, suppose that we have M different symbols (like a, b, c, etc.)
and that
each has a different probability
P_{i} where i denotes which symbol is being referred to.
From these, we
create a sequence that is N symbols long (for example, the sequence
aabcba has N = 6 and M = 3).
When there is a large sample size, the frequency
of a symbol approximates its probability,
so we substitute frequency for probability in equation (4).
Because of this substitution, it is important to use a correction for small sample
sizes [8], but we won't discuss that here.
Finally,
we define probability so that
the sum of the probabilities is 1 (100%).
In other words no symbols which are not part of the set will appear.
When each symbol appears, you are surprised to see it. If a relatively common symbol
appears, you are not very surprised, and if a very rare symbol appears,
then you
are extremely surprised to see it.
Tribus
[14]
quantified this ``surprisal''
and defined it as:

(5) 
Thus, when the probability of a symbol is low, its surprisal is high and
when the probability is high, then its surprisal is low.
Notice the resemblance of equation (5)
equation to (4).
No matter what symbol you receive, your uncertainty before
you receive it is the same because a symbol you don't have
can't influence your uncertainty.
So uncertainty is the average surprisal for all of the N symbols.
Uncertainty, defined this way, has several properties which
are desirable for information theory
[5].
So if we
sum the surprisals for each symbol received
and divide
by N (the total number
of symbols), we obtain the average surprisal:

(6) 
Using summation notation
this can be expressed as:

(7) 
where, once again, M is the number of symbol types,
N is the total number of symbols in the sequence,
N_{i} is the number of times the i^{th} symbol appears within the entire
sequence, and u_{i} is the surprisal for that symbol.
This summation is equivalent to expression (6)
because we have regrouped the surprisals. Instead of grouping them in
the order in which they occur, we have grouped them by the symbol which they
are related to. Consider the string xxyyzyyzzxxy. The first
expression (6) would represent this as:

(8) 
where the second expression (7) would represent it as:

(9) 
Now, by bringing the denominator inside the summation
of (7), we get:

(10) 
Since the frequency of the i^{th} symbol occurring (F_{i}) is found by dividing the
number of times the particular symbol occurs (N_{i}) by the number of symbols (N),
F_{i} can be substituted for
giving:

(11) 
Also since F_{i} gets closer to P_{i} as N gets larger, (mathematically
that's
),
and since
the number of sequences one could be dealing with might be pretty high,
(after all, life has existed for billions of years, and will continue, we hope,
to flourish)
P_{i} can be substituted for F_{i} giving:

(12) 
Finally, by substituting for u_{i} from equation
(5), we obtain Shannon's
famous general formula for
uncertainty:

(13) 
Next: Example
Up: Sequence Logos: A Powerful,
Previous: The Magnificent Bit
Tom Schneider
20030212
U.S. Department of Health and Human Services

National Institutes of Health

National Cancer Institute

USA.gov

Policies

Viewing Files

Accessibility

FOIA