For the exact calculation of *E*(*H*_{nb}), there are four choices for each base
at a position of a site. If one were to calculate *H* for each possible
combination,
and then average them, there would be 4^{n} calculations to perform,
where *n* is the number of sites sequenced. The exact calculation would be
impractical for all but the smallest values of *n*: note that *n* = 17 implies
10^{10} calculations.

Fortunately the formula for a multinominal distribution allows one to
calculate many combinations at once (Breiman, 1969).
If *na*, *nc*, *ng* and *nt* are the numbers of
A's, C's, G's and T's in a site and
*Pa*, *Pc*, *Pg*, *Pt* are the frequencies of each base in the genome,
then the probability of
obtaining a particular combination of *na* to *nt* (called *nb*)
is estimated by:

where

Finally, to obtain the average uncertainty as decreased owing to sampling:

As a practical matter, one should note that equation (11) can be calculated quickly by taking the logarithm of the right side and spreading out all the components (including the factorials) into a set of precalculated sums (followed by exponentiation).

The catch in formula
(13)
is to avoid calculating all 4^{n}combinations.
A nested series of sums will cover all the required combinations in
alphabetical order:

At

times. Since this is polynomial in

With large numbers of sites, the exact calculation of *E*(*H*_{nb}) still becomes
enormously expensive. For ribosome binding sites, *n* varies with
position in the site. Even if the entire sequence around the site were
available, there are sites at the 5' end of a transcript, so there are regions
in the aligned set that must be blank. It is not practical to calculate
*E*(*H*_{nb}) exactly when *n* is between 108 and 149 (for the range -60 to +40).

U.S. Department of Health and Human Services | National Institutes of Health | National Cancer Institute | USA.gov |

Policies | Viewing Files | Accessibility | FOIA