By downloading this code you agree to the
Source Code Use License (PDF). |
{ version = 2.26; (* of rsim.p 2016 Jan 16} (* begin module describe.rsim *) (* name rsim: Rsequence simulation synopsis rsim(rsimp: in, cmp: in, xyin: out, output: out) files rsimp: paramters to control the program: n: number of sequences to use to generate each fbl(simulated) rangelow, rangehigh: low and high bounds of the range of the matrix Rs: estimated value of Rsequence from the rsgra program SD: Standard Deviation of Rs based on sample size from the rsgra program. This defines the range Rslower = Rs - SD; Rsupper = Rs + SD. seed: a real number between 0 and 1 used to start the random number generator. The date and time is used if this number is greater than 1. If the number is less than 0 use whatever the system random generator initializes with. Some random generators start at the same place. To force a particular starting seed, the program will run through the random numbers until it finds one at the desired seed. Since this can take a while (seconds) the option for using seed < 0 speeds things up. It should make little difference in the results. simulations: number of fbl(true) to make Rtlower: lower limit to Rsequence(true) to work with. This allows one to remove the small ones and get on with the ones of interest. Rtupper: upper limit to Rsequence(true) to work with. selection: if the first character of the line is 's', then only those points which fall in the Rslower to Rsupper range are put into xyin. (Ie, only the 'p' values.) This allows very large crunches to be done which don't create such a large xyin file. cmp: composition file from comp program. If it is empty, the program will assume equiprobable bases. xyin: output of the program, input to the xyplo program column 1: values of R(simulated) that fall within the Rslower and Rsupper range are indicated by a 'p', others by 'n'. column 2: Rsequence(true) column 3: Rsequence(simulated) These are followed by a comment that reports the Rsequence and its standard deviation. output: messages to the user. The same comment at the end of xyin is printed to output. description Rsim stands for Rsequence-simulation. The program generates a set of Rsequence values to determine the variation of Rsequence for small sample sizes. Method. A frequency table is constructed with zero information content, namely it contains 0.25 in all positions (l) and bases (b). This table, fbltrue, is 'evolved' by altering the frequencies until it has an information content Rsequence(true) (=Rtrue) in the range Rtlower to Rtupper (with a flat distribution). A set of n sequences is generated using the fbltrue probabilities, and the information content, Rsimulated, is calculated for the set. We select out those Rsimulated values which fall within the range of the Rs+/-SD. This is repeated many times. The distribution of Rtrue values (which correspond to the selected Rsimulated values) represents the range of possible information contents of frequency tables which could have produced the observed results. In this way, we bootstrap ourselves to get the range. Note that SD is only a measure of small sample size. Use. Run an information analysis of the sites. This analysis determines n, rangelow and rangehigh for the rsimp. From the output of rseq (rsdata file), determine Rs and SD over the same range. Begin with only a few simulations. It is preferable to determine how long each simulation takes using a timing program like the UNIX /usr/5bin/time, so that the time for the final simulations can be predicted. Set Rtlower and Rtupper wide at first to be sure to capture the whole distribution. 10,000 simulations is generally sufficient for the final analysis, assuming that Rtlower and Rtupper are not too big. Graph the results with the xyplo program, using the rsim.xyplop file for parameters. The output looks like: Rsimulated | . . . | . . | . . Rs + SD | o o Rs | oo o Rs - SD | ooo | .. | . | . | . | .. ---------------------------- Rtrue ^ ^ Rtlower Rtupper The program choses a random number between Rtlower and Rtupper, Rtrue. Then it creates the fbltrue matrix with all 0.25 values. This places Rtrue at 0 initially. The matrix is evolved up to the current Rtrue value. Therefore the set of all fbltrue matricies should have a flat information content distribution. YOU MUST CHECK THAT THIS IS TRUE!! Copy the xyin file to the name 'data' and use the genhis program with these parameters: c 2 x n 30 to get a histogram of the distribution of Rtrue, coming from column 2 of the file. The distribution should be reasonably flat over the entire region of the small circles (o) above. If it is not, you must determine what is wrong before continuing. Those small circles represent the range that Rs +/- SD slices horizontally from the distribution of Rtrue versus Rsimulated. Recall that an each Rtrue leads to an fbltrue from which a single simulation of n binding sites is created; the information content of that is Rsimulated. So we want the distribution of Rtrue within the bounds of the slice. To do this, we select that slice for analysis. In UNIX, we pull out all lines from xyin which have 'p' in them (p means: "plot this"). Use: grep p xyin > data Then run genhis with these parameters: c 2 p g x n 30 Notice how well or poorly the plotted gaussian ("p g") fits your distribution. If it is a good fit you are done. You can use the standard deviation which genhis provides. Use the original Rsequence value for the mean. (The mean found on the genhis listing this way will be approximately Rsequence, but it has been created by passage through the simulation, so is not as good as the orginal data.) Alternatively, use the values from the final report at the end of the xyin file (and shown to output). NOTE: do NOT use the SEM value, use the SD value, since this simulation is to determine the standard deviation of Rsequence. examples rsimp: 1800 n: number of sequences to generate each fbl(simulated) -3 rangelow: low bound of the range of the matrix +6 rangehigh: high bound of the range of the matrix 8.036 Rs: estimated value of Rsequence 0.006 SD: variance of Rs based on sample size 0 seed: random number seed. <0,>1 => timeseed used 10000 simulations: number of fbl(true) to make 7 Rtlower: lower limit to Rsequence(true) to work with. 9 Rtupper: upper limit to Rsequence(true) to work with. no selection of values within Rs+/-SD documentation @article{Schneider1986, author = "T. D. Schneider and G. D. Stormo and L. Gold and A. Ehrenfeucht", title = "Information content of binding sites on nucleotide sequences", journal = "J. Mol. Biol.", volume = "188", pages = "415-431", year = "1986"} @article{Stephens.Schneider.Splice, author = "R. M. Stephens and T. D. Schneider", title = "Features of spliceosome evolution and function inferred from an analysis of the information at human splice sites", journal = "J. Mol. Biol.", volume = "228", pages = "1124-1136", year = "1992"} see also rsimp, rseq.p, xyplo.p, genhis.p, rsim.xyplop, genhisp.rsim author Thomas D. Schneider bugs Does not handle di-nucleotides or longer oligos technical notes Constants maxsize (procedure calehnb) and kickover (procedure makehnblist) determine the largest n for which e(hnb) is used. Above this, ae(hnb) is used. Do not set these below 50 without careful analysis. Other constants are in module rsim.const. Although it is possible to create more than one Rsimulated from each Rtrue, this causes vertical streaks on the graph, and so will distort the simulation. It's better to get a completely clean one each time. Originally, a psudo random generator was used to create fbltrue from a random matrix (rather than 0.25) but this causes problems because such a matrix contains information and so low information points are under represented and higher ones over represented. This distorts the statistics! The program contains a portable random number generator. Unfortunately this can be 10 times slower than the non-portable one available on most systems. The procedure rnd allows one to switch between the two. When the system generator is used, one may find that the random numbers repeat exactly from one run to the next. The seed parameter would not affect the results. To avoid this problem, the random number generator is run until the requested seed is produced, within the tolerance given by the constant seedtolerance. The runs are displayed on the output. One could start with a frequency table from an rsdata file instead of equiprobable values. It would still have to be evolved. However, the evolution would have to allow for loss of information in the matrix, otherwise the distribution of Rtrue would not be flat. (It would need to work with a range rather than a bound.) This would have the advantage of making the resulting matricies "like" the original. This might give a better simuation, but it has not been tried. *) (* end module describe.rsim *) {This manual page was created by makman 1.45}{created by htmlink 1.62}