Research Interests

There must be mathematical laws that describe nucleic acid sequences and molecular interactions; my goal is to find these laws. What aspects of nucleic acids can be approached, and what mathematics should one use? The fruitful answer for me has been to apply Shannon’s information theory to nucleic acid binding sites. During my Ph.D. thesis work I discovered that in many genetic systems the information in the binding site sequences on DNA or RNA to which proteins bind is just enough for the sites to be found in the genome [9]. This result is surprising because the number of sites and size of the genome are determined by history and physiology, so the amount of information in the binding sites must evolve toward the amount predicted using genome size and the number of sites. I confirmed these ideas both experimentally [12] and by using a computer simulation [50]. (You can try this model on your own computer by going to http://alum.mit.edu/www/toms/papers/ev.) Thus my work has three major components: theory, computer analysis and genetic engineering experiments.

Whenever one has a strong theory, anomalies are interesting. We have investigated several major ones at the lab bench because they lead to new insights into biology. One is the excess information found at bacteriophage T7 promoters [9, 12]. These sequences conserve twice as much information as the T7 polymerase requires to locate them in the presence of the bacterial genome. One possible explanation is that a second protein binds to the DNA. Alternatively the bacteriophage may be set up to overwhelm the bacterial defenses. We have found evidence supporting the latter hypothesis. In a second case, we discovered that the E. coli F plasmid incD region, which is responsible for correct plasmid partitioning to the daughter cells, has a three-fold excess conservation. This implies that three proteins bind there and we were able to identify three candidate binding proteins [17]. Another anomaly I found is unusually conserved bases involved in DNA replication and RNA transcription [22, 33]. Such cases can be detected by inspecting the sequence information along a binding site since the major groove of DNA can carry up to 2 bits of information while the minor groove can only support 1 bit. When the minor groove has more than 1 bit of information the DNA must not be in B form. We tested this idea in the bacteriophage P1 RepA system. Our experimental evidence suggests that the proteins are flipping bases out of the DNA to start helix melting, thereby initiating replication and transcription [54, 55].

Shannon’s measure of information has the form of an average, which raises the question: for binding sites, what are the individual components that make up this average? The obvious answer is to consider it to be the average of the information for individual sequences in the set of binding sites. This immediately allows one to write down an equation that defines the individual information and this solution was proven to be unique by Dr. John Spouge [34].

To help visualize these results, we invented methods for graphically displaying a set of binding sites for the average as sequence logos [13] and for individual sequences as sequence walkers [34, 35, 36, 37]. These graphics have revealed many interesting details of a variety of binding sites and are now being used by researchers around the world. They allow rapid and quantitative visualization of genetic regions, detection of database errors, analysis of single nucleotide polymorphisms (SNPs) to distinguish polymorphisms from mutations (http://alum.mit.edu/www/toms/g863a.html) and quantitative genetic engineering of sequences. We have found a correlation between information measures of splice junctions and the severity of genetic diseases [37], and obtained a patent on this method [43].

For convenience, I divide my theoretical work into several levels. Level 0 is the study of genetic sequences bound by proteins or other macromolecules, briefly described above. The success of this theory suggested that other work of Shannon should also apply to molecular biology. Level 1 theory introduces the more general concept of the molecular machine which dissipates energy to make choices. From this I was able to develop the concept of a machine capacity equivalent to Shannon’s channel capacity [15]. In Level 2, the Second Law of Thermodynamics is connected to the capacity theorem [16], and the limits on the functioning of Maxwell’s Demon become clear [25]. Levels 3, the efficiency of molecular machines, which is often 70%, and 4, explaining the observed efficiency, are in preparation, but a short version has been published [78] and a review [79]. My next major goal is to understand Level 5, the coding of molecular machines, by investigating the detailed structure and motions of molecules from the viewpoint of information and coding theory.


U.S. Department of Health and Human Services  |  National Institutes of Health  |  National Cancer Institute  |  USA.gov  | 
Policies  |  Viewing Files  |  Accessibility  |  FOIA