# Recommendations for Making Sequence Logos

Thomas D. Schneider

Sequence logos are a graphical technique for displaying a summary of a set of aligned sequences. They were invented by Tom Schneider and his first high school student Mike Stephens. The original paper is available.

Weblogo is a web-based server to create sequence logos, written and supported by Steven Brenner and Gavin Crooks's groups. Although weblogo is highly useful for biologists to generate logos, like any other tool it can be misused. Below are recommendations for proper use of the logos so that they provde useful data for further studies. Links to the Glossary are provided. Please follow the links provided for more detail on each point.

1. References: Please cite the original reference:
@article{Schneider.Stephens1990,
author = "T. D. Schneider
and R. M. Stephens",
title = "{Sequence logos: a new way to display consensus sequences}",
journal = "Nucleic Acids Res",
volume = "18",
pages = "6097--6100",
pmid = "2172928",
pmcid = "PMC332411",
year = "1990"}

so that your paper can be tracked in the literature.
2. Avoid Flat Frequency Logos.

Under most circumstances one should not use frequency logos because important biological information is lost. A 'frequency logo' or 'equallogo' is like a regular sequence logo but all stacks have the same height so that each letter height is proportional to the frequency of the corresponding nucleotide or amino acid. The height of the stack of the standard sequence logo represents the sequence conservation of the sequences measured in bits, a precise unique unit that is related (but not proportional) to the binding energy (edmm). This is a biologically important summary and if it is not given then a person cannot easily tell what parts are more important that other parts. Furthermore, the user will miss the beautiful sine wave on many conventional DNA sequence logos (gallery).

For example, the figure of a flat' logo for the RepA DNA binding protein from bacteriophage P1 (helixrepa) does not give much indication of anything special. However, the corresponding sequence logo shows two major clusters of sequence conservation in positions -1 to +3 and +11 to +13 with an additional strong conservation at +7 and +8. By placing a sine wave over the logo that has a wave length 10.6 bases long, it becomes clear that the two big patches of sequence conservation are one turn of double stranded DNA apart. That RepA protein binds to the face of the DNA with the two strong conservation patches was subsequently confirmed experimentally (Papp et al 1993, Papp and Chattoraj 1994), as indicated by the solid (instead of dashed) sine wave. A protein binding to DNA through the major groove can distinguish up to 2 bits of information and this is consistent with the two large conservation patches. A protein binding to DNA through the minor groove DNA cannot specify more than 1 bit (baseflip) so the T at postion +7 violates B-form DNA. We proposed that RepA flips a base out of the DNA to initiate DNA replication and partially confirmed this experimentally (repan3). The discovery of base flipping to initiate DNA replication would not have occurred using frequency logos.

Indeed, using a frequency logo means that one will miss the original reason we invented sequence logos. We observed that human donor and acceptor splice sites have the same consensus sequence: CAG|GT. Yet the information conservation across the binding sites is not the same at each position. How could two RNA binding sites have the same consensus but be different? We invented sequence logos to visualize and resolve this paradox (Stephens.Schneider-splice1992).

The point of these examples is not to discourage people from using flat logos, but rather to show that there is a risk in missing important biological phenomena if the conservation of the binding site is not presented to the user.

Examples of "flat" logos:

• @article{Jager.Schmitz2009,
author = "D. Jager
and C. M. Sharma
and J. Thomsen
and C. Ehlers
and J. Vogel
and R. A. Schmitz",
title = "{Deep sequencing analysis of the \emph{Methanosarcina
mazei} G\"{o}1 transcriptome in response to nitrogen availability}",
journal = "Proc. Natl. Acad. Sci. USA",
volume = "106",
pages = "21878--21882",
pmid = "19996181",
pmcid = "PMC2799843",
comment = "2013/03/05 15:24:00",
year = "2009"}
Reference,Figure with legend, Image (jpg)

• Differences in microRNA detection levels are technology and sequence dependent.
Leshkowitz D, Horn-Saban S, Parmet Y, Feldmesser E.
RNA. 2013 Feb 19. [Epub ahead of print]
PMID: 23431331

• @article{Humphreys.Preiss2012,
author = "D. T. Humphreys
and C. J. Hynes
and H. R. Patel
and G. H. Wei
and L. Cannon
and D. Fatkin
and C. M. Suter
and J. L. Clancy
and T. Preiss",
title = "{Complexity of murine cardiomyocyte miRNA biogenesis,
sequence variant expression and function}",
journal = "PLoS One",
volume = "7",
pages = "e30933",
pmid = "22319597",
pmcid = "PMC3272019",
year = "2012"}

• @article{Chou.Schwartz2011,
author = "M. F. Chou
and D. Schwartz",
title = "{Biological sequence motif discovery using motif-x}",
journal = "Curr Protoc Bioinformatics",
volume = "Chapter 13",
pages = "Unit 13.15--24",
pmid = "21901740",
year = "2011"}

• @article{Oman.vanderDonk2010,
author = "T. J. Oman
and W. A. {van der Donk}",
natural product biosynthesis}",
journal = "Nat Chem Biol",
volume = "6",
pages = "9--18",
pmid = "20016494",
pmcid = "PMC3799897",
year = "2010"}
Figure 2

• @article{Viola.Gonzalez2012,
author = "I. L. Viola
and R. Reinheimer
and R. Ripoll
and N. G. Manassero
and D. H. Gonzalez",
title = "{Determinants of the DNA binding specificity of class I and
class II TCP transcription factors}",
journal = "J Biol Chem",
volume = "287",
pages = "347--356",
pmid = "22074922",
pmcid = "PMC3249086",
year = "2012"}
Figure 1

• @article{Ugolev.Schuldiner2013,
author = "Y. Ugolev
and T. Segal
and D. Yaffe
and Y. Gros
and S. Schuldiner",
title = "{Identification of conformationally sensitive residues
essential for inhibition of vesicular monoamine transport by the
noncompetitive inhibitor tetrabenazine}",
journal = "J Biol Chem",
volume = "288",
pages = "32160--32171",
pmid = "24062308",
pmcid = "PMC3820856",
year = "2013"}
Figure 4 and Figure 6

• @article{Ranjani.Goh2014,
author = "V. Ranjani
and S. Janecek
and K. P. Chai
and S. Shahir
and R. N. {Abdul Rahman}
and K. G. Chan
and K. M. Goh",
title = "{Protein engineering of selected residues from conserved
sequence regions of a novel Anoxybacillus alpha-amylase}",
journal = "Sci Rep",
volume = "4",
pages = "5850",
pmid = "25069018",
year = "2014"}
Figure 1

• @article{Borrok.Tsui2015,
author = "M. J. Borrok
and Y. Wu
and N. Beyaz
and X. Q. Yu
and V. Oganesyan
and W. F. Dall'Acqua
and P. Tsui",
title = "{pH-dependent Binding Engineering Reveals an FcRn Affinity
Threshold That Governs IgG Recycling}",
journal = "J Biol Chem",
volume = "290",
pages = "4282--4290",
pmid = "25538249",
pmcid = "PMC4326836",
comment = "2015/03/14 17:04:41 ",
year = "2015"}

Figure 3.
• Introduction to Computational and Systems Biology, MIT OpenCourseWare, Modeling Biological Function teaches flat logos which could prevent students from making discoveries. The course is part of Foundations of Computational and Systems Biology.
• @article{Jolma.Taipale2013,
author = "A. Jolma
and J. Yan and T. Whitington and J. Toivonen and K. R. Nitta and P.
Rastas and E. Morgunova and M. Enge and M. Taipale and G. Wei and K.
Palin and J. M. Vaquerizas and R. Vincentelli and N. M. Luscombe and
T. R. Hughes and P. Lemaire and E. Ukkonen and T. Kivioja and J.
Taipale",
title = "{DNA-binding specificities of human transcription factors}",
journal = "Cell",
volume = "152",
pages = "327--339",
pmid = "3332764",
year = "2013"}

• pmcid = "PMC1664720",
pubmed = "23422071",


3. Coordinates: Chose a sensible coordinate system. The zero coordinate is used in sequence walkers, so it is important to have a zero somewhere in the sequence. Usually we chose a well conserved base somewhere in the middle of the logo. See the glossary entry on binding site symmetry for further information.
4. Bits:
5. Range: Before producing a final figure, use a range larger than the region you are interested in. Example: -200 to +200 bases around a binding site. This allows you to see whether you have cut off part of a binding site. It also will make the noisiness of your logo clear and the variation should be about the size of the error bars. This will help you to avoid over interpreting the result.
6. Alignment:
• Report the alignment you used so that others can reproduce your logo!
• Give the exact source of each sequence (GenBank Accession number and version)
• Give the exact coordinates you used. Do not make your reader depend on the sequence to locate the sites. We have had cases where the given sequences in E. coli were ambiguous. This prevented us from extracting and analyzing the sequences ourselves to analyze ranges around the site larger than initially provided.
• Do not give partial sequences or variable length sequences (unless the sequence does not exist, as on the 5' end of an mRNA). That is, don't embed your model of the sites into the reporting of the alignment.
• A simple but precise way to express aligned sequences is with Delila instructions.
7. Number: Publish the number of sequences used to create the logo, preferably on the logo image itself. Providing a logo without indicating how many sequences are involved makes it impossible to judge how much to trust the image.
8. Information: Report the total information content of the logo. For DNA, RNA (and perhaps protein) binding sites, this is an important number called Rsequence. It is generally related to the size of the genome and number of sites. See the paper on Ev and run the Evj program to see how this works. The total information is also essential for computing the efficiency.
9. Error bars: Publish error bars. Without these one cannot tell how good the logo is. Because of the small sampling effect, this is closely related to the number of sequences (see the appendix of Schneider1986). Publish the error on the total information too. The total error is important for computing the efficiency. See also:
10. Symmetry: If your binding site is symmetric, publish a symmetrical sequence logo. See the LexA example at the top of the page. You can publish an asymmetric site for a dimeric protein if you can show statistically that the asymmetric site has more information or if it is correlated with a particular direction such as the direction of transcription. Arbitrary orientation in the alignment is not an acceptable practice.
11. Sine wave: For DNA (and even RNA!) put a sine wave on the logo and align it with major and minor grooves. This makes interesting predictions! See the papers:
• oxyr - Reading of DNA Sequence Logos: Prediction of Major Groove Binding by Information Theory. See also How To Read Sequence Logos.
• baseflip - Bases that do not match the sine wave can represent abnormal structures or base flipping.
• repan3 - An experiment that suggests DNA base flipping by the bacteriophage P1 repA protein
• flexrbs - Ribosome binding sites in E. coli have a region 5' to the initiation codon, the Shine-Dalgarno (SD), that base pairs to the 3' end of the 16S rRNA forming a helix. The logo of the SD appears to follow a sine wave, implying a helical structure.
• flexprom - The sigma70 subunit of E. coli RNA polymerase can can be aligned at the -35 region to a co-crystal structure. This allows determination of the face of the DNA where the -10 contacts the polymerase and reveals a base that is probably flipping out of the DNA at transcriptional initiation.
12. Avoid consensus sequences: Despite the implication of the title of the original paper, sequence logos are NOT consensus sequences! Note that one can not only read the consensus sequence (most frequent base at every position) from the top of the logo but one can also read the anti-consensus sequence (least frequent base at every position) from the bottom of the logo. One can also read everything in between. So logos, in themselves, to not represent a consensus and it is inappropriate to call a logo a consensus'. See the paper Consensus Sequence Zen.
13. Publish the raw sequence data used to make the logo. This allows others to reconstruct your sequence logo and to make computations on it. Give the sources of the data. This can be supplementary material or made available on the web.
14. Notes on using the Weblogo 3 Server from the WebLogo 3 : User's Manual
• Using natural log on the y axis simply makes thinking about the results more difficult. See the Information Theory Primer appendix which shows how to do logarithms in your head by using base 2.
• Putting any energy units on the y axis would be a mistake unless you know the efficiency. Energy and information are different things. The easiest way to see this is to note that a coin can store exactly one bit of information. But a coin held above a table that might be flipping has both potential and kinetic energy. That energy must be dissipated to set the coin down as heads or tails. The amount dissipated can vary, but the minimum is determined by the second law of thermodynamics. So there is an inequality relationship between energy and information and to put an energy scale on a logo generated from symbols is incorrect.
15. Avoid Relative Entropy. If you use relative entropy, then your results WILL NOT BE BITS and so this is a serious mistake. The simplest way to see this is to consider the states of a coin. A coin has only two states - heads and tails. (We ignore the possibility of balancing on the edge as this will not be stable in noisy situations.) A coin can only store 1 bit of information. It cannot store more than 1 bit of information since there are only two states. For the four nucleotides, the maximum information is therefore log2 4 = 2 bits. Yet the relative entropy measure can give values more than this. It is clear that the information needed to describe the sequence patterns never takes more than 2 bits, so the relative entropy is not a measurement in bits. If you use relative entropy then your results will not be comparable to energy because that comparison depends on using state functions for both energy and bits and relative entropy is not a state function since it mixes up two different states. See the papers ccmm, edmm and emmgeo. Further the isothermal efficiency cannot be correctly computed. If you use relative entropy for biological sequences, other workers will have to throw out your work and start over from the raw sequences.

For example, two papers on malarial proteins were published back-to-back in Science in which sequence logos were given for similar data. One paper (Hiller.Haldar2004) apparently used relative entropy and so showed an impossible amount of sequence conservation, near 5 bits for the 20 amino acids. To chose one object in 20 never takes more than log220 = 4.3 bits, see their Figure 2. The other paper (Marti.Cowman2004) did not cite the source of their method but it was presumably the original logo paper since the height of a fully conserved position is around 4.3 bits, see their Figure 1, and so the two logos show inconsistent heights. A reader could be left puzzled by the discrepancy. (Note also the lack of error bars on the figures.)

@article{Marti.Cowman2004,
author = "M. Marti
and R. T. Good
and M. Rug
and E. Knuepfer
and A. F. Cowman",
title = "{Targeting malaria virulence and remodeling proteins to
the host erythrocyte}",
journal = "Science",
volume = "306",
pages = "1930--1933",
pmid = "15591202",
year = "2004"}

@article{Hiller.Haldar2004,
author = "N. L. Hiller
and S. Bhattacharjee
and C. {van Ooij}
and K. Liolios
and T. Harrison
and C. Lopez-Estrano
and K. Haldar",
title = "{A host-targeting signal in virulence proteins reveals a
secretome in malarial infection}",
journal = "Science",
volume = "306",
pages = "1934--1937",
pmid = "15591203",
year = "2004"}


16. Beyond Sequence Logos: Sequence logos are only the first step towards understanding a pattern or binding site.
• The total information is significant (Schneider1986, ev).
• The patterns of base use give clues to the mechanism of DNA binding (helixrepa).
• Sine waves on a DNA logo predict the face of DNA being bound or show anomalies such as base flipping (see references above).
• Use information theory to look at individual sites using (sequence walkers). Because it is based on information theory, method has the advantage that the second law of thermodynamics provides a natual cutoff for the results, which are in bits for all systems.
• Compare the total information to the binding energy to determine the isothermal efficiency (emmgeo).
• Avoid other pitfalls in information theory and molecular information theory.
For a review see brmit.

Schneider Lab

origin:    2009 Jan 30
updated: 2015 Jul 01

U.S. Department of Health and Human Services  |  National Institutes of Health  |  National Cancer Institute  |  USA.gov  |
Policies  |  Viewing Files  |  Accessibility  |  FOIA