In
Fig. 1,
we show the curve
*R*_{sequence}(*L*) for either 61 (a), 17 (b) or 6
(c) *Hin*cII sites
(GTPyPuAC; Roberts, 1983)
chosen from the left end of bacteriophage T7 (Dunn and
Studier, 1983). Here, the G's in the *Hin*cII
sites have been placed at position
*L*=0, and
*R*_{sequence}(*L*) was calculated for 20 bases on either side.
There are
two major 2-bit peaks of information content surrounding a 1-bit valley in
curve (a). None of the curves go to zero (the solid straight line) outside
the sites, although they come close at several points. This effect is not
small: for six sites
(Fig. 1c)
the background is at 0.44 bits per base
so that with sequences 41 bases long,
*R*_{sequence} will be overestimated by
18 bits. A sampling error correction for *Hs*(*L*)(*e*(*n*), Appendix I, page ).
can be joined
with *H*_{g} to give the final formula:

With this correction, the information content measured at various positions of an aligned set of random sequences will vary above and below zero. On the average it should be zero outside a binding site. The information content inside a site will rise above zero. These features can be seen in all figures, where the corrected zero is shown as a dashed line.

The standard deviation reported for each
*R*_{sequence} is based on the
variance of *H*_{nb} (Appendix I, page )
which is sensitive to the number of sequence
examples, but not to the actual sequences. It is only a measure of variance
in the correction for small sample sizes; the variation in the information
content of individual sites will be described elsewhere. The variance of the
sampling correction is shown in all figures as a bar extending one standard
deviation above and below the
*R*_{sequence}(*L*) curve.

U.S. Department of Health and Human Services | National Institutes of Health | National Cancer Institute | USA.gov |

Policies | Viewing Files | Accessibility | FOIA