Delila Program: rsim

# rsim program

## By downloading this code you agree to the Source Code Use License (PDF). Pascal source code: rsim.p (wget instructions) Instructions on compiling MacOS binary: rsim Alphabetic List of Delila Programs Delila Programs by Most Recent Update Please report broken links delilabundle.zip = All Programs and MacOS Binaries Copyright Statement for Delila Programs

### Documentation for the rsim program is below, with links to related programs in the "see also" section.

{   version = 2.26; (* of rsim.p 2016 Jan 16}

(* begin module describe.rsim *)
(*
name
rsim: Rsequence simulation

synopsis
rsim(rsimp: in, cmp: in, xyin: out, output: out)

files
rsimp: paramters to control the program:

n: number of sequences to use to generate each fbl(simulated)
rangelow,
rangehigh: low and high bounds of the range of the matrix

Rs: estimated value of Rsequence from the rsgra program

SD: Standard Deviation of Rs based on sample size from the rsgra
program.
This defines the range Rslower = Rs - SD; Rsupper = Rs + SD.

seed: a real number between 0 and 1 used to start the random
number generator.  The date and time is used if this
number is greater than 1.  If the number is less than 0
use whatever the system random generator initializes with.

Some random generators start at the same place.  To force
a particular starting seed, the program will run through
the random numbers until it finds one at the desired seed.
Since this can take a while (seconds) the option for using
seed < 0 speeds things up.  It should make little difference
in the results.

simulations: number of fbl(true) to make

Rtlower: lower limit to Rsequence(true) to work with.  This allows
one to remove the small ones and get on with the ones of interest.

Rtupper: upper limit to Rsequence(true) to work with.

selection: if the first character of the line is 's', then only
those points which fall in the Rslower to Rsupper range are put
into xyin.  (Ie, only the 'p' values.)  This allows very large
crunches to be done which don't create such a large xyin file.

cmp: composition file from comp program.  If it is empty, the program
will assume equiprobable bases.

xyin: output of the program, input to the xyplo program
column 1: values of R(simulated) that fall within the Rslower
and Rsupper range are indicated by a 'p', others by 'n'.
column 2: Rsequence(true)
column 3: Rsequence(simulated)
These are followed by a comment that reports the Rsequence and
its standard deviation.

output: messages to the user.  The same comment at the end of xyin
is printed to output.

description
Rsim stands for Rsequence-simulation.  The program generates a set of
Rsequence values to determine the variation of Rsequence for small sample
sizes.

Method.

A frequency table is constructed with zero information content, namely
it contains 0.25 in all positions (l) and bases (b).  This table,
fbltrue, is 'evolved' by altering the frequencies until it has an
information content Rsequence(true) (=Rtrue) in the range Rtlower to
Rtupper (with a flat distribution).

A set of n sequences is generated using the fbltrue probabilities, and
the information content, Rsimulated, is calculated for the set.  We
select out those Rsimulated values which fall within the range of the
Rs+/-SD.  This is repeated many times.  The distribution of Rtrue values
(which correspond to the selected Rsimulated values) represents the range
of possible information contents of frequency tables which could have
produced the observed results.  In this way, we bootstrap ourselves to
get the range.  Note that SD is only a measure of small sample size.

Use.

Run an information analysis of the sites.  This analysis determines n,
rangelow and rangehigh for the rsimp.  From the output of rseq (rsdata
file), determine Rs and SD over the same range.  Begin with only a few
simulations.  It is preferable to determine how long each simulation
takes using a timing program like the UNIX /usr/5bin/time, so that the
time for the final simulations can be predicted.  Set Rtlower and
Rtupper wide at first to be sure to capture the whole distribution.
10,000 simulations is generally sufficient for the final analysis,
assuming that Rtlower and Rtupper are not too big.

Graph the results with the xyplo program, using the rsim.xyplop
file for parameters.  The output looks like:

Rsimulated
|                    .  . .
|                  .  .
|                 .  .
Rs + SD     |              o  o
Rs          |             oo o
Rs - SD     |           ooo
|         ..
|        .
|    .
|  .
| ..
----------------------------
Rtrue

^                       ^
Rtlower                 Rtupper

The program choses a random number between Rtlower and Rtupper, Rtrue.
Then it creates the fbltrue matrix with all 0.25 values.  This places
Rtrue at 0 initially.  The matrix is evolved up to the current Rtrue
value.  Therefore the set of all fbltrue matricies should have a flat
information content distribution.  YOU MUST CHECK THAT THIS IS TRUE!!
Copy the xyin file to the name 'data' and use the genhis program with
these parameters:

c 2
x n 30

to get a histogram of the distribution of Rtrue, coming from column 2 of
the file.  The distribution should be reasonably flat over the entire
region of the small circles (o) above.  If it is not, you must determine
what is wrong before continuing.

Those small circles represent the range that Rs +/- SD slices
horizontally from the distribution of Rtrue versus Rsimulated.  Recall
that an each Rtrue leads to an fbltrue from which a single simulation of
n binding sites is created; the information content of that is
Rsimulated.

So we want the distribution of Rtrue within the bounds of the slice.  To
do this, we select that slice for analysis.  In UNIX, we pull out all
lines from xyin which have 'p' in them (p means: "plot this").  Use:

grep p xyin > data

Then run genhis with these parameters:

c 2
p g
x n 30

Notice how well or poorly the plotted gaussian ("p g") fits your
distribution.  If it is a good fit you are done.  You can use the
standard deviation which genhis provides.  Use the original Rsequence
value for the mean.   (The mean found on the genhis listing this way
will be approximately Rsequence, but it has been created by passage
through the simulation, so is not as good as the orginal data.)
Alternatively, use the values from the final report at the end of the
xyin file (and shown to output).  NOTE: do NOT use the SEM value, use
the SD value, since this simulation is to determine the standard
deviation of Rsequence.

examples

rsimp:

1800       n: number of sequences to generate each fbl(simulated)
-3         rangelow: low bound of the range of the matrix
+6         rangehigh: high bound of the range of the matrix
8.036      Rs: estimated value of Rsequence
0.006      SD: variance of Rs based on sample size
0          seed: random number seed.  <0,>1 => timeseed used
10000      simulations: number of fbl(true) to make
7          Rtlower: lower limit to Rsequence(true) to work with.
9          Rtupper: upper limit to Rsequence(true) to work with.
no         selection of values within Rs+/-SD

documentation

@article{Schneider1986,
author = "T. D. Schneider
and G. D. Stormo
and L. Gold
and A. Ehrenfeucht",
title = "Information content of binding sites on nucleotide sequences",
journal = "J. Mol. Biol.",
volume = "188",
pages = "415-431",
year = "1986"}

@article{Stephens.Schneider.Splice,
author = "R. M. Stephens
and T. D. Schneider",
title = "Features of spliceosome evolution and function
inferred from an analysis of the information at human splice sites",
journal = "J. Mol. Biol.",
volume = "228",
pages = "1124-1136",
year = "1992"}

rsimp, rseq.p, xyplo.p, genhis.p, rsim.xyplop, genhisp.rsim

author
Thomas D. Schneider

bugs
Does not handle di-nucleotides or longer oligos

technical notes
Constants maxsize (procedure calehnb) and kickover (procedure
makehnblist) determine the largest n for which e(hnb) is used.  Above
this, ae(hnb) is used.  Do not set these below 50 without careful
analysis.  Other constants are in module rsim.const.

Although it is possible to create more than one Rsimulated from
each Rtrue, this causes vertical streaks on the graph, and so
will distort the simulation.  It's better to get a completely
clean one each time.

Originally, a psudo random generator was used to create fbltrue from a
random matrix (rather than 0.25) but this causes problems because such a
matrix contains information and so low information points are under
represented and higher ones over represented.  This distorts the
statistics!

The program contains a portable random number generator.  Unfortunately
this can be 10 times slower than the non-portable one available on most
systems.  The procedure rnd allows one to switch between the two.  When
the system generator is used, one may find that the random numbers repeat
exactly from one run to the next.  The seed parameter would not affect
the results.  To avoid this problem, the random number generator is run
until the requested seed is produced, within the tolerance given by the
constant seedtolerance.  The runs are displayed on the output.

equiprobable values.  It would still have to be evolved.  However, the
evolution would have to allow for loss of information in the matrix,
otherwise the distribution of Rtrue would not be flat.  (It would need
to work with a range rather than a bound.)  This would have the
advantage of making the resulting matricies "like" the original.  This
might give a better simuation, but it has not been tried.

*)
(* end module describe.rsim *)
{This manual page was created by makman 1.45}