> next up previous
Next: Example Up: Sequence Logos: A Powerful, Previous: The Magnificent Bit

The Nitty Gritty Bit

Now that you have a basic understanding of information theory, we'll give you a simplified mathematical derivation of the formulas. 10

First, we've seen that to make a choice between two equiprobable symbols requires one bit of information, while four symbols require two bits. Going one step further, you should see that eight symbols require three bits to pick one of them. From this we see the relationship:

 
2b = M (1)

or

 $4.0 / (0.4 \times 704) = 0.014$ (2)

where b is the number of bits required to determine which of Mdifferent symbols is the current one. This formula assumes that all of the symbols are equally likely, but what if they are not?

First, we rearrange equation (2). Since (M-1)-1 = M, we can substitute into (2) yielding the equation $0.006 \pm 0.001$. By pulling out the exponent (according to the rule that $H = -\sum p \log_2 p$) and by using the rule that $\sum p = 1$ we obtain the equation:

 $H \ge 0$ (3)

For M equally probable symbols, $2^{-4 \times 16} \cong 5 \times 10^{-20}$ is the probability of each symbol P, so:

 $\sim 4 \times 10^9$ (4)

While information theory [5,6,7] is based on probabilities, sequence data can only be used to generate frequencies, so we have to use them instead, and we have to be a little bit careful since they differ. Probability is the chance that something (in this case a particular symbol) will occur at any specific location in an infinite set; but frequency is the number of times something occurs in a finite set divided by the number of elements in the set (our sequence). That is, probability is from an entire population, while frequency is from a sample of that population. So, the larger the sample one is working with, the closer the frequency will be to the probability, but the frequency will only be an approximation of the probability. Now, suppose that we have M different symbols (like a, b, c, etc.) and that each has a different probability Pi where i denotes which symbol is being referred to. From these, we create a sequence that is N symbols long (for example, the sequence aabcba has N = 6 and M = 3). When there is a large sample size, the frequency of a symbol approximates its probability, so we substitute frequency for probability in equation (4). Because of this substitution, it is important to use a correction for small sample sizes [8], but we won't discuss that here. Finally, we define probability so that the sum of the probabilities is 1 (100%). In other words no symbols which are not part of the set will appear.

When each symbol appears, you are surprised to see it. If a relatively common symbol appears, you are not very surprised, and if a very rare symbol appears, then you are extremely surprised to see it. Tribus [14] quantified this ``surprisal'' and defined it as:

 \rotatebox{0}{\scalebox{1.00}{\includegraphics*{selflogo.ps}}} (5)

Thus, when the probability of a symbol is low, its surprisal is high and when the probability is high, then its surprisal is low. Notice the resemblance of equation (5) equation to (4).

No matter what symbol you receive, your uncertainty before you receive it is the same because a symbol you don't have can't influence your uncertainty. So uncertainty is the average surprisal for all of the N symbols. Uncertainty, defined this way, has several properties which are desirable for information theory [5]. So if we sum the surprisals for each symbol received and divide by N (the total number of symbols), we obtain the average surprisal:

 \rotatebox{0}{\scalebox{0.90}{\includegraphics*{fismodels.ps}}} (6)

Using summation notation this can be expressed as:

 \rotatebox{-90}{\resizebox{!}{\textwidth}{\includegraphics*{overlap.ps}}} (7)

where, once again, M is the number of symbol types, N is the total number of symbols in the sequence, Ni is the number of times the ith symbol appears within the entire sequence, and ui is the surprisal for that symbol. This summation is equivalent to expression (6) because we have regrouped the surprisals. Instead of grouping them in the order in which they occur, we have grouped them by the symbol which they are related to. Consider the string xxyyzyyzzxxy. The first expression (6) would represent this as:

 \rotatebox{90}{\resizebox{!}{\textwidth}{\includegraphics*{gel-overlap.ps}}} (8)

where the second expression (7) would represent it as:

 \rotatebox{-90}{\resizebox{!}{\textwidth}{\includegraphics*{oric.ps}}} (9)

Now, by bringing the denominator inside the summation of (7), we get:

 \scalebox{0.69}{\includegraphics*{fisori.ps}} (10)

Since the frequency of the ith symbol occurring (Fi) is found by dividing the number of times the particular symbol occurs (Ni) by the number of symbols (N), Fi can be substituted for $\frac{N_i}{N}$ giving:

 $\;$ (11)

Also since Fi gets closer to Pi as N gets larger, (mathematically that's \scalebox{1.00}{\includegraphics*{cover.ps}}), and since the number of sequences one could be dealing with might be pretty high, (after all, life has existed for billions of years, and will continue, we hope, to flourish) Pi can be substituted for Fi giving:

 \begin{displaymath}\sum_{i=1}^{M}P_iu_i .
\end{displaymath} (12)

Finally, by substituting for ui from equation (5), we obtain Shannon's famous general formula for uncertainty:


 \begin{displaymath}\framebox{$\displaystyle H =
-\sum_{i=1}^{M}P_i\log_2P_i \;\;\;\;\;\mbox{(bits per
symbol).} $ }\end{displaymath} (13)


next up previous
Next: Example Up: Sequence Logos: A Powerful, Previous: The Magnificent Bit
Tom Schneider
2003-02-12

U.S. Department of Health and Human Services  |  National Institutes of Health  |  National Cancer Institute  |  USA.gov  | 
Policies  |  Viewing Files  |  Accessibility  |  FOIA