Delila Program: rsim

rsim program

Documentation for the rsim program is below, with links to related programs in the "see also" section.

{version = 2.24; (* of rsim.p 1996 April 12}

(* begin module describe.rsim *)
(*
name
      rsim: Rsequence simulation

synopsis
      rsim(rsimp: in, cmp: in, xyin: out, output: out)

files
      rsimp: paramters to control the program:

         n: number of sequences to use to generate each fbl(simulated)
         rangelow,
         rangehigh: low and high bounds of the range of the matrix

         Rs: estimated value of Rsequence from the rsgra program

         SD: Standard Deviation of Rs based on sample size from the rsgra
            program.
            This defines the range Rslower = Rs - SD; Rsupper = Rs + SD.

         seed: a real number between 0 and 1 used to start the random
            number generator.  The date and time is used if this
            number is greater than 1.  If the number is less than 0
            use whatever the system random generator initializes with.

            Some random generators start at the same place.  To force
            a particular starting seed, the program will run through
            the random numbers until it finds one at the desired seed.
            Since this can take a while (seconds) the option for using
            seed < 0 speeds things up.  It should make little difference
            in the results.

         simulations: number of fbl(true) to make

         Rtlower: lower limit to Rsequence(true) to work with.  This allows
            one to remove the small ones and get on with the ones of interest.

         Rtupper: upper limit to Rsequence(true) to work with.

         selection: if the first character of the line is 's', then only
            those points which fall in the Rslower to Rsupper range are put
            into xyin.  (Ie, only the 'p' values.)  This allows very large
            crunches to be done which don't create such a large xyin file.

      cmp: composition file from comp program.  If it is empty, the program
         will assume equiprobable bases.

      xyin: output of the program, input to the xyplo program
         column 1: values of R(simulated) that fall within the Rslower
                   and Rsupper range are indicated by a 'p', others by 'n'.
         column 2: Rsequence(true)
         column 3: Rsequence(simulated)
         These are followed by a comment that reports the Rsequence and
         its standard deviation.

      output: messages to the user.  The same comment at the end of xyin
         is printed to output.

description
      Rsim stands for Rsequence-simulation.  The program generates a set of
      Rsequence values to determine the variation of Rsequence for small sample
      sizes.

      Method.

      A frequency table is constructed with zero information content, namely
      it contains 0.25 in all positions (l) and bases (b).  This table,
      fbltrue, is 'evolved' by altering the frequencies until it has an
      information content Rsequence(true) (=Rtrue) in the range Rtlower to
      Rtupper (with a flat distribution).

      A set of n sequences is generated using the fbltrue probabilities, and
      the information content, Rsimulated, is calculated for the set.  We
      select out those Rsimulated values which fall within the range of the
      Rs+/-SD.  This is repeated many times.  The distribution of Rtrue values
      (which correspond to the selected Rsimulated values) represents the range
      of possible information contents of frequency tables which could have
      produced the observed results.  In this way, we bootstrap ourselves to
      get the range.  Note that SD is only a measure of small sample size.

      Use.

      Run an information analysis of the sites.  This analysis determines n,
      rangelow and rangehigh for the rsimp.  From the output of rseq (rsdata
      file), determine Rs and SD over the same range.  Begin with only a few
      simulations.  It is preferable to determine how long each simulation
      takes using a timing program like the UNIX /usr/5bin/time, so that the
      time for the final simulations can be predicted.  Set Rtlower and
      Rtupper wide at first to be sure to capture the whole distribution.
      10,000 simulations is generally sufficient for the final analysis,
      assuming that Rtlower and Rtupper are not too big.

      Graph the results with the xyplo program, using the rsim.xyplop
      file for parameters.  The output looks like:

               Rsimulated
               |                    .  . .
               |                  .  .
               |                 .  .
   Rs + SD     |              o  o
   Rs          |             oo o
   Rs - SD     |           ooo
               |         ..
               |        .
               |    .
               |  .
               | ..
               ----------------------------
                        Rtrue

                 ^                       ^
                 Rtlower                 Rtupper

      The program choses a random number between Rtlower and Rtupper, Rtrue.
      Then it creates the fbltrue matrix with all 0.25 values.  This places
      Rtrue at 0 initially.  The matrix is evolved up to the current Rtrue
      value.  Therefore the set of all fbltrue matricies should have a flat
      information content distribution.  YOU MUST CHECK THAT THIS IS TRUE!!
      Copy the xyin file to the name 'data' and use the genhis program with
      these parameters:

c 2
x n 30

      to get a histogram of the distribution of Rtrue, coming from column 2 of
      the file.  The distribution should be reasonably flat over the entire
      region of the small circles (o) above.  If it is not, you must determine
      what is wrong before continuing.

      Those small circles represent the range that Rs +/- SD slices
      horizontally from the distribution of Rtrue versus Rsimulated.  Recall
      that an each Rtrue leads to an fbltrue from which a single simulation of
      n binding sites is created; the information content of that is
      Rsimulated.

      So we want the distribution of Rtrue within the bounds of the slice.  To
      do this, we select that slice for analysis.  In UNIX, we pull out all
      lines from xyin which have 'p' in them (p means: "plot this").  Use:

          grep p xyin > data

      Then run genhis with these parameters:

c 2
p g
x n 30

      Notice how well or poorly the plotted gaussian ("p g") fits your
      distribution.  If it is a good fit you are done.  You can use the
      standard deviation which genhis provides.  Use the original Rsequence
      value for the mean.   (The mean found on the genhis listing this way
      will be approximately Rsequence, but it has been created by passage
      through the simulation, so is not as good as the orginal data.)
      Alternatively, use the values from the final report at the end of the
      xyin file (and shown to output).  NOTE: do NOT use the SEM value, use
      the SD value, since this simulation is to determine the standard
      deviation of Rsequence.

examples

      rsimp:

1800       n: number of sequences to generate each fbl(simulated)
-3         rangelow: low bound of the range of the matrix
+6         rangehigh: high bound of the range of the matrix
8.036      Rs: estimated value of Rsequence
0.006      SD: variance of Rs based on sample size
0          seed: random number seed.  <0,>1 => timeseed used
10000      simulations: number of fbl(true) to make
7          Rtlower: lower limit to Rsequence(true) to work with.
9          Rtupper: upper limit to Rsequence(true) to work with.
no         selection of values within Rs+/-SD

documentation

@article{Schneider1986,
author = "T. D. Schneider
 and G. D. Stormo
 and L. Gold
 and A. Ehrenfeucht",
title = "Information content of binding sites on nucleotide sequences",
journal = "J. Mol. Biol.",
volume = "188",
pages = "415-431",
year = "1986"}

@article{Stephens.Schneider.Splice,
author = "R. M. Stephens
  and T. D. Schneider",
title = "Features of spliceosome evolution and function
inferred from an analysis of the information at human splice sites",
journal = "J. Mol. Biol.",
volume = "228",
pages = "1124-1136",
year = "1992"}

see also
      rsimp, rseq.p, xyplo.p, genhis.p, rsim.xyplop, genhisp.rsim

author
      Thomas D. Schneider

bugs
      Does not handle di-nucleotides or longer oligos

technical notes
      Constants maxsize (procedure calehnb) and kickover (procedure
      makehnblist) determine the largest n for which e(hnb) is used.  Above
      this, ae(hnb) is used.  Do not set these below 50 without careful
      analysis.  Other constants are in module rsim.const.

      Although it is possible to create more than one Rsimulated from
      each Rtrue, this causes vertical streaks on the graph, and so
      will distort the simulation.  It's better to get a completely
      clean one each time.

      Originally, a psudo random generator was used to create fbltrue from a
      random matrix (rather than 0.25) but this causes problems because such a
      matrix contains information and so low information points are under
      represented and higher ones over represented.  This distorts the
      statistics!

      The program contains a portable random number generator.  Unfortunately
      this can be 10 times slower than the non-portable one available on most
      systems.  The procedure rnd allows one to switch between the two.  When
      the system generator is used, one may find that the random numbers repeat
      exactly from one run to the next.  The seed parameter would not affect
      the results.  To avoid this problem, the random number generator is run
      until the requested seed is produced, within the tolerance given by the
      constant seedtolerance.  The runs are displayed on the output.

      One could start with a frequency table from an rsdata file instead of
      equiprobable values.  It would still have to be evolved.  However, the
      evolution would have to allow for loss of information in the matrix,
      otherwise the distribution of Rtrue would not be flat.  (It would need
      to work with a range rather than a bound.)  This would have the
      advantage of making the resulting matricies "like" the original.  This
      might give a better simuation, but it has not been tried.

*)
(* end module describe.rsim *)
{This manual page was created by makman 1.44}
{created by htmlink 1.55}
National Cancer Institute    National Institutes of Health    Health and Human Services    USA Gov - Official Web Portal    Viewing Files    Accessibility