head of Tyrannosaurus Rex ev: Evolution of Biological Information,
Experimenting with Evolution: A Guide to Evj

Evj is an evolutionary model that runs inside your computer using the Java language. This page is a beginner's guide to experimenting with Evj. For more information you can read the original scientific paper, "Evolution of Biological Information".

Evj models the evolution of genetic control systems. In all living organisms, genes are parts of the DNA that (usually!) code for proteins. The proteins do all kinds of things for the cell, such as controlling whether other proteins are made or not. The regulatory proteins do this by binding to the DNA and turning on or off the genes of the other proteins. The classical example of gene regulation is the Lac Operon. Here are some pointers to read more about it:

  1. Start the Program.
    Click Here to Start the Model
    • A starter window will appear. If something goes wrong, see the messages there or see the Evjava page for other ways to launch the program. However, if all goes well,
    • then another window will appear, the Evj Display.
  2. I just want to see it go!!
    Ok ... in the new Evj Display window:
    Click Run
    You should see the letters at the bottom of the Evj Display change rapidly and lots of colored boxes appear and (maybe) disappear. (If nothing happened, did you click in the new window?)
  3. Woah!! What was that all about??? Ok ... in the Evj Display window:
    Click Pause
  4. Layout of the Evj Display (If you don't have Evj running, you can look at this old screenshot, or a standard run of version 2.37 jpg or tiff.)
    1. Controls and Displays
      Genome
      A C G T . . .
      At the top of the Evj Display are controls and displays for the program and on the bottom of the Display you will see the genome of a creature that is inside your computer.
      1. Controls and Displays. You can learn about each of the controls by using the Tooltips. Just move your mouse pointer over each item in the control panel to learn more about that control or display. If you hold the pointer still, the tooltip will show up as some text inside a box.
      2. Genome. The genome is made up of colored letters A, C, G and T. The string of these letters is called a `sequence'. It is a model for a DNA sequence as found in living organisms. (The letters in a genetic material are sometimes called 'bases'.) The coordinate system of the genome is given by numbers every 10 bases, with a '+' sign every 5 bases.
    2. Little colored boxes mark different parts of the genome. Blue or cyan boxes mark parts of a gene. Why do you think we chose blue? (Answer)
    3. The gene sequence is `translated' into a model of a protein that can search DNA sequence. We won't go into the details of how this is done yet, but it's described in the original scientific paper and we will discuss it below.
    4. There are three other kinds of colored boxes. Certain boxes are colored either red or green and at the start they probably are all red. These are both places on the genome where, if the protein were to bind it would help the organism control its genome. There are thousands of examples of genetic controls like this known to scientists.
    5. A red box means that the protein won't bind to that spot. As in real living things, this is bad for the organism because it can't turn on or off its genes.
    6. A green box means that the protein will bind to that spot. As in real living things, this is good for the organism because it can control its genes.
    7. red and green boxes can change color back and forth depending on whether the protein can bind there.
    8. A yellow box means that the protein binds to some other place in the genome. As in real living things, this is bad for the organism because it wastes protein or disturbs other genetic functions

    red binding site, not functional
    green binding site, functional
    yellow binding at the wrong place
    blue gene weight, positive
    cyan gene weight, negative

  5. Mistakes. Every time there is a red or yellow box the organism has made a mistake since the protein doesn't bind where it needs to in order to function. The saturation (shading) of the box colors shows how strongly the protein model binds to the sequence, but this has no influence on the mistakes in the Ev model.
  6. Evolution by Natural Selection. When Evj is running, your computer carries a small population of these creatures. Only the one with the fewest number of mistakes is shown. The evolution cycle is:
    The cycle of evolution as done in the Ev program.  A
circle of arrows between mutate, evaluate, sort, kill,
replicate and back to mutate.  An arrow running from
evaluate to sort to kill is inside and labeled 'selection'.

    Mutate each creature by changing its genome randomly. For example, the program might change the letter at position 82 from an A to a T. Mutation happens everywhere in the genome, both in the gene, in the binding sites, and in the spaces between. The location is random and the change is random. Note that mutations often occur during DNA replication in nature when the wrong base is inserted. Another mechanism is DNA damage, which can cause the wrong complementary base to be inserted.

    Mutations happen naturally, often from radioactive and other compounds in the environment, but you can speed up the process. If you smoke cigarettes, you put chemicals into your body that cause mutations in your cell's DNA. Eventually, some of your cells might lose their genetic growth controls, and will start to grow wildly. They could take over your lungs and kill you. This disease is cancer.
    Evaluate each creature: count the number of mistakes each one makes.
    This is the first step of selection.
    Some mutated lungs cells will not do well, but others might be able to grow without normal controls. Most mutations wreck controls; however, on occasion the random changes will improve a control protein binding site. It's easier to wreck your (old fashioned!) television or radio by hitting it than to make it work, but on occasion you might be lucky. Of course electronic equipment is complex and hitting is blunt, so unless you jiggle a disconnected wire into place, hitting it is not likely to help. In contrast, binding sites are so simple that changing them may often make the protein bind better.
    Sort the creatures by their mistakes.
    This is the second step of selection.
     
    Kill half of the creatures, the ones that make the most mistakes.
    This is the final step of selection.
    You probably kill some of your lung cells by smoking. (Searching PubMed for 'lung cells death smoke' led to Study of the mechanisms of cigarette smoke gas phase cytotoxicity. Anticancer Res. 2003 May-Jun;23(3A):2185-90. Piperi C, Pouli AE, Katerelos NA, Hatzinikolaou DG, Stavridou A, Psallidopoulos MC..)
    Replicate (make another copy of) the creatures that make the least mistakes. Lung cells will grow back if they survive the mutagens in the cigarettes. Cells that lose their genetic growth controls could grow much faster and then you would get cancer. That is, cancer is caused by evolution by natural selection of cells inside your body. It's your choice whether you want to help it along or not by increasing your mutation rate by smoking. Likewise, tanning exposes you to UV radiation from the sun that can mutate your DNA, and eating fried foods can too.

    The cycle is repeated every 'generation'.

  7. Questions
    1. If a creature makes fewer mistakes than all the others, will it survive? (Answer)
    2. How quickly would a creature that makes one less mistake than the others take over a population of 16 creatures? (Answer)
    3. Is 16 a reasonable population size? For comparison, how many bacteria are on a normal human? (Answer)

  8. Experiment Number 1: Flying through Evolution.
    In the control box
    Click Restart
    Click Run
    Crank up the Speed
    What colors do you see? What happens to the colors after a minute or so? Why? (How Speed is determined.)

  9. Experiment Number 2: The Effects of Selection.
    Perform Experiment Number 1. Then in the control box
    Click on the check box next to the word Selection.
    This turns off selection. What happens to the colors? Why? What happens if you turn selection back on again?

  10. Experiment Number 3: Flickering Bases -
    Sequence Conservation versus Neutral Drift.

    Here is a screenshot for reference. This is the results of a standard run to 10,000 generations on my machines. You should be able to get the same result just by repeating Experiment 1 and letting it go until it stops.

    Now take a look at the gene, which is marked with Blue and cyan boxes. Each box is marked with a number and a base 0A, 0C, 0T, etc, followed by a number called a 'weight':
    0A -159        
    The gene consists of a set of these 'weights'. Blue is the color for negative weights, and cyan is the color for positive weights. Numbers closer to zero have less saturated colors. (Where do those numbers come from?) The boxes can be rearranged into a table:
    0 1 2 3 4 5
    A -159 -386 -148 -326 -21 +363
    C -450 +193 +127 +341 -71 -178
    G -266 -28 -52 -10 -481 -149
    T -151 -342 +510 -252 -187 -178
    This is called a 'weight matrix'. It is a model of how a protein binds to DNA. Suppose, for example, that we look at the sequence at coordinate +190 on the genome: C G T C T A. How will the protein react to this sequence? To find this out, we put the sequence under the table, so that
    C is under position 0 of the table,
    G is under position 1,
    T is under position 2,
    C is under position 3,
    T is under position 4,
    A is under position 5.
    Then we pick the number from the row corresponding to the letter:
    C is -450,
    G is   -28,
    T is +510,
    C is +341,
    T is  -187,
    A is +363,
    Finally, we add these together to get +549. This is the number written on the site at position +190. If you look at the genome, you will see that there is one more number at the end of the gene called 'th', which stands for 'threshold'. In this case, the threshold is +300. If the sum of the weights is bigger than the threshold, then the protein model has found a binding site. For the site at +190, this is true, so the site is marked with a green box.
    1. The value of weight 2T is +510, so T is preferred at coordinate +2 in the binding sites. Is this true? (Note: this is the third base because we are counting from zero!) Look at the binding sites to find out. How many of them have a T in the third spot? (Answer)
    2. Perform Experiment Number 1 and then set the Number of cycles large so that the simulation runs continuously. (How do I set a number?) While the simulation is running, pick out a weight that is strongly positive (it will be colored a strong cyan). Now watch the corresponding position in the binding sites. For the weight 2T, which is +510 in the example screenshot, corresponding weights are at +128, +136, +144 etc. What do you see? (Answer)
    3. Pick a position on the genome that is not part of the gene and is not in one of the binding sites. These are places that are not above any colored rectangle. (For example, position 189.) What do you see when you run the evolution quickly? Why? (Answer)
    4. While the simulation is running at top speed, position your mouse over the Selection button but don't click it yet. Now find a well-conserved binding site base in the genome (i.e. one that has a high corresponding weight matrix and which is therefore stable). Watch this position while you click your mouse to turn off selection. What happens? Why? (Answer)
    5. Having watched the decay of sequence conservation, what happens if you turn selection on again? Is the base still conserved? Why or why not? Can you control this? (Answer)

    What is Neutral Drift? Kimura was the person who introduced the idea of neutral drift. These are changes to the genome that have little effect on survival. Here's one of his papers:
    KIMURA M, CROW JF. THE NUMBER OF ALLELES THAT CAN BE MAINTAINED IN A FINITE POPULATION. Genetics. 1964 Apr;49:725-38. PMID: 14156929

    Of course back in the 1960s they didn't have lots of sequence data so he made mathematical models. In contrast, the Ev model has an explicit genome and actual functions. A Google search for Kimura neutral gives a page by Gert Korthof who says:
    I included this work of Kimura to show that a critique of Darwinism is possible, without being ridiculed or ignored by the scientific community.
    Intelligent design is being ridiculed and ignored because it is bad science. The ideas don't hold up to careful scrutiny and when the ideas are disproven, it is not acknowledged. Kimura's idea did hold up to scrutiny, and we can see it in the Ev program when running full blast. Just watch the regions outside a binding site flicker!

  11. Experiment Number 4: Understanding Aligned Sites.

    By now you have surely noticed the jumping piles of letters in the control and display region. Sorry to make you wait for an explanation about them!

    Suppose we list all of the sites vertically:

    Aligned listing of 16 DNA sequences.

    Here are what the columns mean:
    • g10k is just the name I gave to the sites. (g10k stands for generation 10,000)
    • The next column, starting with 126, 134 ... etc., is the first base of the sites in the genome.
    • '+' means that the sites are all in the same direction.
    • The next number is just the number of the site.
    • The sequence of the site follows.
    Above the sites are three lines that give the number of each position in the site (numbering starts at zero). You should be able to confirm this aligned listing of the sites by comparing the sequences above with the genome.

    How many t's are at position 2 in the sites? (Answer)

  12. Experiment Number 5: Understanding Sequence Logos.

    How can we represent this complex pattern of letters? The sequence logo can do this:

    sequence logo for 16 aligned sequences

    The positions of the logo correspond to the positions in the sites. So let's look at position 2. At that position is a stack of letters, with T on the top because it is the most frequent base at that position. Below the T is a C because that is the next most frequent base. Under that (if you look closely!) you will find an A (in green) and a G (in orange). The rule is that the height of each letter is proportional to the frequency of that base at that position in the site.

    What determines the height of the stack of letters? The answer is that we can measure how conserved the letters are in bits of information. This is a longer story than can be fully explained here, but here is a pointer to get you started with the definition of a bit.

    So, to summarize, the sequence logo shows you in a compact graphical form which parts of the binding site are conserved and precisely by how much.

    Restart the simulation, Set the "Cycles to run" to 1000 and and click on Run. Watch the last base of the site in both the sequence logo and the genome. What happens? (Answer)


  13. Experiment Number 5: Watching Evolution with Sequence Logos.

    Restart the model, crank up the "Cycles to run" to at least 100,000, and click Run. Once a sequence logo has emerged, turn off selection. (The selection box may not respond. In that case, Pause, click the selection box and Run.) What happens to the logo? (Answer) What happens to the logo when you turn selection on again? (Answer)

  14. Experiment Number 6: How binding sites evolve: Rsequence and Rfrequency.

    The heights of the sequence logo stacks are in bits of information. It turns out that these are related to the size of the genome and the number of sites. To learn more about this, you can read the article The Nitty Gritty Bit.


This page is: http://alum.mit.edu/www/toms/paper/ev/evj/evj-guide.html.
A tinyurl for this page is http://tinyurl.com/evolution-in-a-nutshell.
You can preview the tinyurl with http://preview.tinyurl.com/evolution-in-a-nutshell

Acknowledgments. Thanks to Pete Lemkin and Adam Diehl for useful comments on this page.

Problems? Comments? Please email me, Tom Schneider, at schneidt@mail.nih.gov.

color bar Small icon for Theory of Molecular Machines: physics,
chemistry, biology, molecular biology, evolutionary theory,
genetic engineering, sequence logos, information theory,
electrical engineering, thermodynamics, statistical
mechanics, hypersphere packing, gumball machines, Maxwell's
Daemon, limits of computers


Schneider Lab

origin: 2005 Jun 3
updated: 2012 Jan 01 version = 1.47 of evj-guide.html

color bar
National Cancer Institute    National Institutes of Health    Health and Human Services    USA Gov - Official Web Portal    Viewing Files    Accessibility