ev: Evolution of Biological Information:
Frequently Asked Questions
Briefly, what is Ev?
Ev is a computer program that allows one to model
the way that information is gained in living organisms
by natural selection.
The example used is the patterns in DNA
to which proteins bind to regulate genes.
This is a well-understood system and so it makes
a good demonstration of evolution.
Also, the mathematics is precise and gives quantitative
results that match the results seen in nature.
What is information?
To do real science, we need precise definitions.
The one used in the Ev program is
This measure has been around since 1948
and it is well respected.
All modern communications and data storage systems,
including satellite communications,
are based on Shannon's theory.
This measure has been successfully used to study
many biological systems.
the rest of this web site
for many examples.
Isn't everything 'information'?
No, that way lies madness.
Here's the definition I'm starting to use to distinguish
between physical phenomena and data or information
that can be measured using Shannon's method:
Phenomena are recorded as data (information) when the state of a
device (including things like rhodopsin in the eye and CCDs)
associated with a living organism is changed by the phenomena.
I'm only a few thousand years behind in this thought:
"By convention there is colour, by convention sweetness, by convention
bitterness, but in reality there are atoms and space."
--- Democritus - 400 B.C.E.
What information are we talking about in the Ev program?
There are two information measures.
The only information
measured in the Ev program
is from the patterns in the binding sites.
The measurement of the information is called
This is the
information that we are interested
in tracking in the Ev model.
Doesn't the program contain that information from the start?
The program starts with random genetic sequences and
when one measures the information in the binding sites
it is (approximately) zero.
(There will be small fluctuations because of
Evolution Fairytale Forum)
But isn't Rfrequency an information measure too?
Very perceptive of you.
is a measure of the information needed to locate sites
in the genome:
Rfrequency = - log 2 (γ / G)
Because Rfrequency is a function of
the size of the genome
(the number of potential binding sites is G)
and the number of sites (γ),
it is fixed when the model begins and (usually) is
not changed during an evolution run.
So it doesn't teach us about how
does evolve towards
as you can see in
Figure 2b of the Ev paper
and in the figure to the right.
The dashed line shows Rfrequency, the green curve shows
Aren't you surprised that the information gain Rsequence
is exactly as predicted by Rfrequency?
Did you know that
Rsequence would evolve to Rfrequency
when you first ran the program?
No, I ran the program to see if this would happen or not.
I was testing
my PhD thesis.
If the program had failed, my thesis would have been in jeopardy!
But you set up the size of the genome and the number of sites,
so didn't you put information into the organisms that way?
More precisely, the input parameters define Rfrequency,
which is determined by
information put into the program,
but that is not
the information being measured from the organisms.
Remember that we are measuring Rsequence
from patterns in the genome,
and this starts
out near zero bits, as you can see from the green curve in the graph.
Also, the size of the genome and number of required sites can be set
to a wide variety of values and yet Rsequence still
evolves towards Rfrequency.
This only happens by
replication, mutation and selection,
demonstrating that those factors
are necessary and sufficient
for information gain to occur.
Replication, mutation and
are necessary and sufficient
for information gain to occur.
This process is called evolution.
Where is the environment in this picture of evolution
of binding sites?
The size of the genome and number of required sites
is the 'environment' from the viewpoint of the
binding site recognizer.
This is, of course, an exact mirror of the situation in nature.
DNA recognition proteins
can be activated to bind
or blocked from binding
DNA by outside factors
but once that has taken place,
the recognizers function by locating
positions on the DNA.
So they are buffered from the external environment
they only face the problem of locating their sites
on the genome.
The Ev organism recognizers have the same challenge.
Does the Special Rule smuggle information into the Ev program?
is answered in the on-line paper
Effect of Ties on the Evolution of Information by the Ev program.
Basically, changing the rule still gives an
information gain, so Dembski's prediction was wrong.
Has Dembski ever acknowledged this error?
Not to my knowledge.
Don't scientists admit their errors?
yes, by publishing a retraction explaining what happened.
Don't you make errors too? Do you admit them?
Yes and Yes, see:
Schneider Lab Errata and Corrigenda.
If you had a different recognition method
would you get a different result?
No, so long as the recognition function
gives a finely graded and ordered response to input sequences.
In the Ev program, recognition is done using a
numerical matrix of numbers, encoded in the genomes.
DNA is copied to RNA,
the RNA is translated
into a polypeptide and then the polypeptide
folds to make a protein.
Finally, the protein recognizes the binding sites
by physical interactions with the DNA.
We already know that when the recognition
method is the natural one, Rsequence is close
Even these vastly different mechanisms give the same results,
so the answer is no.
However, you are quite welcome to put a different
recognition method into the
Ev program source code
and see what happens.
If you do that you might be able to publish the results!
Why don't you do a real biological experiment instead
of just a computer model?
The primary reason is that we don't have infinite
resources and time.
If you have the resources (a molecular biology lab),
are interested in doing an experiment, and would
like to discuss it
please contact me.
The second reason is that nature has already done
experiments, and we generally see
that Rsequence is close to Rfrequency in real examples
The third reason is that many people have already done
related evolutionary experiments,
such as SELEX
and similar experiments
( J Am Chem Soc. 2004 Apr 28;126(16):5130-7.
Informational complexity and functional activity of RNA structures.
Carothers JM, Oestreich SC, Davis JH, Szostak JW.)
though to my knowledge
no one has tested whether Rsequence evolves to Rfrequency
If you were to change the Ev program by making
X into Y, then I predict that there won't
be an information gain.
Could you change the Ev program for me?
No. Don't be lazy,
go do it yourself!
The so-called random numbers really are not random,
they are made by an algorithm.
So is there 'information' imparted by the random number generator?
First, you can use a different random number series
by changing a parameter in the program.
You can also substitute in a different random number generator.
Finally, you could supply random numbers from a
This is available from
Genuine random numbers, generated by radioactive decay!
None of these changes should affect the results.
If they do, suspect that you have a bad random number generator!
Isn't the standard Ev mutation rate of one base change
per genome per generation excessive?
If you think about it (or try it yourself)
you will see that
if you slow it down
you get the same results:
Rsequence still will evolve towards Rfrequency.
Of course it will take longer to get the results.
Isn't the Ev mutation rate much higher than natural rates?
It's only 10 fold faster than HIV.
Interestingly, there are mutations in the bacteriophage T4
DNA polymerase that reduce mutation rates.
So the rate of mutation is itself under evolutionary
control (though not in the ev program).
Won't a slower evolution take too long in nature?
For practical reasons we usually use
a tiny population in Ev, generally only 16 organisms.
In nature there are usually populations of millions.
For example, in the lab a single cubic centimeter (ml, a milliliter)
of E. coli culture can easily contain
(That's 100 million.)
With an error rate of one in
(i.e., one in a million)
at each genetic location,
there will be plenty of variation to drive evolution.
Notice that we have 6 billion people on the planet,
so there is lots of opportunity for us to continue evolving.
(Have you been wearing your seatbelt?
People who don't wear seatbelts are being selected against ...)
If you had a reasonable sized genome would you find
that there won't be an information gain?
No. Don't be lazy,
go try it yourself!
But notice that it will take a lot more computation,
and the runs may take some years unless you write
a version that uses parallel processors.
Where did you get
that cool dinosaur picture?
It is copyrighted and is used with permission.
Do you believe in evolution?
No. I don't need to believe it.
It's blatently written in tons of evidence.
Do you believe in Evolution?.
riggins do you believe in evolution,
Is there an easy way to run the program myself without lots of work?
Run an Evolutionary Model on Your Own Computer.
This is a Java version of the program, and it runs
on Suns, Macs, Windows and Linux (Ubuntu: you will need to install Java - follow the directions).
Can the mistakes be expressed as Type I and Type II errors?
Let the Null Hypothesis (Ho) be:
"there is no site at this position in the genome".
Type I and II errors in HyperStat Online
The color coding is the one used in the Java version:
True state of null hypothesis
= there is no site
= there is a site
| Reject Ho
= site found
| Type I error
| Do not
= site not found
|| Type II error
for sites found in the right place,
for sites missed from the right place,
for sites found in the wrong place
is normally not displayed.
Why do the sequence logos sometimes go below zero?
The computed information has to be corrected for small sample size.
In the method used, this makes small negative deviations. See:
Why are there so few mistakes at the beginning?
I ran the program with a few different seeds, and the
best organism is at the first step already in a great
shape, with only around 20 mistakes. I think that is
not a reasonable starting state for the population;
the best organism at the first step should have at
least about 200 mistakes, if not be even closer to the
maximum number of mistakes. (Unfortunately, I cannot
modify the threshold to deal with that, and I am not
going to try more seeds either, since it does not
appear to go anywhere far from those values.)
You didn't say what your parameters were, but suppose that you have 16
sites and 64 organisms as in the standard java run. Sorting gives the
best organism, of course, so right away you have a strong bias. Why
20? I guess that this is most easily "accomplished" by having a
weight matrix that does not recognize ANYTHING, or has little
recognition capability. If it didn't recognize anything there would
be exactly 16 mistakes. This could happen by having a very high
initial threshold. If it accidently recognized 4 more sites in the
wrong locations that would account for your 20. This is a hypothesis
and so you can test it by looking closely at the organism that has
that situation. I do agree it is a somewhat curious effect. Would it
happen in nature? Sure. All that has to happen is a recognition
protein is duplicated (apparently a common occurance since we see lots
of nearly identical genes in various organisms and the recombination
mechanism for doing this is pretty well understood). Then one copy
diverges so that it doesn't recognize much at all on the DNA. As it
then starts to locate a few spots, if it matches, WHOSH selection
takes over and it locks on. This effect occurs in Ev too of course.
What would happen if the threshold were forced to always be zero?
Evidently, it doesn't make much difference.
Why is number of initial mistakes often the number of sites?
When the organisms are generated randomly at the beginning of a run,
some will have a high weight matrix threshold
and this means that their weight matrix cannot recognize anything.
In that case,
most non-sites are missed and the sites are missed too
so the initial number of mistakes of the best organism
is the number of sites.
In a large enough initial random population of creatures, it is likely
that one has a high threshold. That creature is likely to make the
fewest mistakes and so that one is displayed.
Here's an example:
There were 64 creatures and 16 sites.
Pascal version of the program
times, once a second,
using the timeseed so that a different initial random number started
About 16% (317 in 2000) of the organisms had 16 mistakes initially.
About 1% ( 15 in 2000) of the organisms had less than 16 mistakes initially.
How is the distribution of threshold values related to the distribution
of mistakes initially? I presume the lower the threshold, the higher
the number of mistakes.
Yup! but it's not a strong effect - notice the regression line.
(Density plot of the same data.)
origin: 2005 May 24
updated: 2013 May 08
U.S. Department of Health and Human Services
National Institutes of Health
National Cancer Institute