Abstract for 1998 American Association for the Advancement of Science (AAAS) Annual Meeting and Science Innovation Exposition Philadelphia, Pennsylvania Monday, February 16, 1998, 3:00pm-6:00pm Track: Emerging Science: Transforming the Next Generation Session number: 101.0
Information-theory provides general solutions to the problem of how to recognize members of a group of related nucleic acid sequences. The average information in the sequence patterns at sites that interact with the same recognizer, Rsequence, represents the total sequence conservation of all of the sites. Rsequence is the sum of the information at each position in the site. This conservation can be visualized with a sequence logo which displays the number of bits and the base frequency at each position. Using the sequence logo, PCR*-based diagnostic tests for infectious microorganisms were designed by selecting primer sequences with high information content flanking an intervening region with low information. Orthologous, heterogeneous 16S rDNA sequences from >100 pathogenic species were amplified and validated. The individual information, Ri, of a single member of a sequence family is the dot product of that sequence vector and a weight matrix , Ri(b,l), which is based on the log2(frequencies) of each nucleotide (b) at each position (l). The average of the set of Ri values for a family of sites is Rsequence. Individual information can be used to quantify and visualize specific genetic elements in nucleic acid sequences. The Ri(b,l) matrix can be used to rank-order sequences, to search for new sequence elements, to compare sequences to other quantitative data such as binding energy or distance between binding sites, to distinguish mutations from polymorphisms, to engineer sequences of a given strength, and to detect errors in databases. These properties will be demonstrated in studies of nucleotide substitutions in human splice donor and acceptor and other binding sites.