>
Next: Acknowledgements Up: Sequence Logos: A Powerful, Previous: From Uncertainty to Information

# Tying This to Sequence Logos

Understanding the concepts of information and uncertainty is crucial to understanding how a sequence logo is designed and what it shows. On a logo, the horizontal axis represents the position of the base, and the vertical axis represents the amount of information (in bits) which that position holds. So, if there is a letter ``T'' that is two bits tall at position 7 on a logo, this tells us that whatever the number of sequences analyzed, all, or close to all, had a ``T'' at that position.

The sequence logo also stacks the letters at each position in order of importance. In other words, the most common letter at a position will be placed at the top of the stack, while the least common letters will be placed at the bottom. So, the letters on top are the equivalent of the consensus sequence.

While the height of the stack is the information content at that position, the height of each letter in the stack corresponds to the frequency of the letter at that position. Take position +3 of the ``Lambda cI and cro'' logo (Fig. 1). Here, the predominant base is guanine, but there is a case of adenine. So, the ``G'' is both taller, and on top of the ``A'' meaning that guanine is more common than cytosine, adenine, and thymine at position +3.

Looking now at column -9, you will see the same letters there--``gattttctcttt''--as we used in the example for calculating equation (14). That was the calculation of Hafter, which is the uncertainty seen after the sites are found. Before the sites are found the protein is not in contact with the DNA and all 4 bases are possible. So the uncertainty Hbefore is 2 bits. Using equation (15), the information at position -9 is 2 - 1.42 = 0.58 bits. A small-sample correction [8] reduces this to the 0.38 bits high you see in Fig. 1.

This method is ideal for analysis of binding sites on both DNA and mRNA, as well as for analyzing proteins. In a consensus sequence, the base at each position is merely the most common one appearing at that location. This suggests that each base of the binding site is of equal importance. However, the more highly conserved bases are usually the most important for binding, and if a consensus was a good model, then all of the bases across the binding site would be of equal importance and their corresponding letters on the logo would be of equal heights. With the exception of certain restriction enzymes, logos almost invariably show that this is not the case, because they display varying conservation at different points in binding sites. In the sequence logo of Lambda cI and cro binding sites, the difference in importance of each position can easily be seen in the ups and downs of the logo. (It is curious that the conservation alternates between high and low values, but this is not true for other binding sites, so whether it is significant to the biology of these sites is unknown.)

Also in the logo, you will notice that there are error bars on the top of each stack of letters. These bars, which look like the letter ``I'', represent the error that is possible (1 standard deviation) in the value of the entire letter stack (the height of the whole stack) due to a limited sample size.

The cosine wave running above the logo represents the major and minor grooves of the DNA helix as seen from one side. The high points of the wave represent the major groove facing the protein, while the low points represent it facing away. Conversely, the low points of the wave represent the minor groove facing the protein, while the high points represent it facing away. This wave is there merely to help you visualize the grooves of the DNA and is not some sort of measurement of how much information should be at the site. However, the height of the wave on the vertical axis is significant. In B-form DNA, two bits of information can be conserved by protein contacts approaching the DNA from the major groove, but only one bit of information can be found by those in the minor groove [15]. The logo demonstrates this effect, since the stacks of letters are not as high where the minor groove faces the protein in the middle. An overlay diagram (Fig. 2) shows many sequence logos printed on top of each other, and as you can see, the height of the letter stacks rarely goes above the wave. Those stacks that do, have error bars which allow for them to pass under the wave.

Now that we've cleared that up, let's take a look at what's happening in that peculiar minor groove. The reason for the depression in the middle of these logos is that if a protein is using contacts to the minor groove, it is difficult or impossible to determine base pair orientation. That is, it is possible to determine an A-T pair, or a G-C pair, but it is not possible to determine whether an A-T pair is oriented as A-T or T-A, and if a G-C pair is G-C or C-G. The structure of the bases and phosphate backbone which make up the DNA are such that the minor groove will only allow a distinction between the two possible base pairs (a one bit decision) but not their orientation. The major groove will not only allow the distinction between A-T or G-C, but also the orientation of the individual base pairs.11 You may have noticed that the cosine wave is at a height of one bit in the minor groove. This is because, as we said earlier, only one binary question can be answered in the minor groove, so only one bit of information can be obtained there; and in the major groove, the cosine wave has a height of two bits, since two binary questions can be answered there. And now with this last loose end tied up, your sequence logo lesson is complete.

Other strings can be analyzed by this method [5]. For example, in the logo for an English dictionary (Fig. 3), we can see that the first letter is predominantly a consonant (s, c or p) , the second letter is a vowel, and that the third is again a consonant. Curiously, E trails over a hump for the remainder of the words. For the text of this paper the letter usage is different (Fig. 4). There are so many `the's (7%) that they show up in the first three positions of the logo. The predominance of N and Y at position 11 is particularly telling: it indicates the prominent use of the words probability, uncertainty and information in this paper.

The sequence logo is a powerful tool for analyzing DNA, RNA, protein sequences and words in a language [1]. It goes far beyond the old consensus method. The logo method of analysis reveals the importance of each position in a sequence, along with the importance of each base occurring at each position. A logo represents the amount of information present using a standard unit of measure, which allows for comparison of different types of sites.

Next: Acknowledgements Up: Sequence Logos: A Powerful, Previous: From Uncertainty to Information
Tom Schneider
2003-02-12

U.S. Department of Health and Human Services  |  National Institutes of Health  |  National Cancer Institute  |  USA.gov  |
Policies  |  Viewing Files  |  Accessibility  |  FOIA