Data for calculating
*R*_{sequence} comes from two sources. One is the
nucleotide sequences at which a recognizer has been shown to bind. The
other is the nucleotide composition of the genome in which
the recognizer functions. The sequences are aligned
by one base (the zero base) to give the largest
possible homology between them (see figure 9 for an example).
Some positions have little variation, while others have more. We
tabulate the frequency of each base *B* at each position *L* in the site,
to make
a table called *f*(*B*,*L*).
Focusing on one position at a time, we want to measure
the possible variations. For this we have chosen the "uncertainty" measure
introduced by Shannon in 1948 (Shannon, 1948; Shannon and Weaver, 1949;
Weaver, 1949;
Abramson, 1963; Singh, 1966; Gatlin, 1972; Sampson, 1976; Pierce, 1980;
Campbell, 1982; Schneider, 1984).

When there are *M* possible symbols, with
probabilities *P*_{i} (such that
)
,
the general formula for uncertainty is

One bit of information resolves the uncertainty of choice between two equally likely symbols. For nucleotide sequences, there are

(

If we sequenced randomly in the genome,
and aligned sequences arbitrarily,
we would see all 4 bases, with probabilities *P*(*B*) and our
uncertainty about what base we would see next would be:

This number is close to 2 bits for the organism

This is a measure of the sequence information gained by aligning the sites. The total information gained will be the total decrease in uncertainty:

(By summing, we make the simplifying assumption that the frequencies at one position are not influenced by those at another position. It is also possible to calculate

U.S. Department of Health and Human Services | National Institutes of Health | National Cancer Institute | USA.gov |

Policies | Viewing Files | Accessibility | FOIA