Delila Program: multiscan

# multiscan program

### Documentation for the multiscan program is below, with links to related programs in the "see also" section.

{   version = 3.96; (* of multiscan.p 2013 Feb 07}

(* begin module describe.multiscan *)
(*

name
multiscan: scan a Delila system book for many multipart binding sites

synopsis
multiscan(book: in, ribl: in, scanp: in, histog: in, multiscanp: in,
data: out, scanfeatures: out, scaninst: out, output: out);

files

book: a book from the delila system

ribl: a concatenated file containing all the ribls of all elements used in
the scan.  The order of these ribls needs to match the order of
elements in the scanp file file, and the order of elements in the
multiscanp file.

scanp: a concatenated file containing scanp files for all of the
elements.  The scanp files are the same as with the scan program,
and a detailed description of these files is in scan.p.

multiscanp: the parameter file for multiscan.

Parameterversion: Real;   This is the version of the paramter file, if
it is less than the version used by the program, the program will not
work.

Maximum range:  Integer;   This is the range with which the sites have
to be inside for the them to be able to work together.  So if a minus
10, a minus 35, and an Activator do not fall within this range then
they will not be considered as a feasible combination of sites.  This
is used so that the user can limit the spacing histograms, you may want
to do a scan that has a smaller range than what your maximum histogram
spacings allow, without having to change your histograms, so you change
your maximum range to limit it.

Total Ri cutoff: Real;  The total information for a site has to be
above this value.  The equation of total information is given in
greater detail in the program description.

Filter toggle: char; This tells whether or not to do a filter based on
the strongest parent site.  This is the first site in the list of
sites.  So if your first site was a minus 10 site, followed by a -35,
and activator, it would find the best combination of -35 and activators
for the -10 and only report that combination.

'f' means filter.  'n' means don't filter.

Essential number cutoff: Real; So if a number is non-essential we only
include it if it contributes positive information.  If you want your
cutoff to be higher, you raise this value and the non-essential (info -
GS) has to be greater than this to be included.  If it is negative it
uses the scanp cutoff value only and doesn't subtract the gap, so if
the scanp allows for it to be a site, then it will be included, no
matter if it is positive or not, or if its GS is larger than it.

Individual model string: "string", (string), char, char, char,
char+integer; This is the string that defines how you treat each
model.  The paramters are as follows:

Model name: string; This is the name of the model.  This name needs
to be the same as that in the Ribl file.  The model name needs to be
in " ".

Parent name: string;  The name that this model points to.  This needs
to be the exact same name as the parent model.  The parent name needs
to be in ( ).

Essentiality: char;  Is it essential 'e' or non-essential 'n'.  See
above for a better explanation of these terms.

Orientation toggle: char;  Does the site have to have the same 's',
different 'd', or any 'n' orientation relative to its parent.  So the
minus 35 has to have the same orientation as the minus 10 so its
toggle would be 's'.  But, an activator could have either orientation
so its toggle would be 'n'.

GS toggle: char;  It tells whether or not to include the gap surprisal.
'g' you include the gap surprisal in the total information.
'n' you do not include the gap surprisal.

If the gap surprisal is turned off, then the program uses the
furthest range in the histog that has some value greater than
zero, and all bases in between, regardless of whether they are
above zero.  Also it will write out the GS as 'NA' in the lister
map, which means the GS is not counted (ie set to zero).

Bar toggle: char; If you want to write out the gap surprsial bar then
use the symbol 'b'.  Any other symbol will keep the GS bar from

Overlap toggle and overlap difference: char and integer;  This
controls the amount of overlapping that the site will allow.  You can
have some overlap, no overlap, or complete overlap.
'o' means allow overlap, no matter how much it overlaps.
'n' means allow no overlap at all.
'd' means allow overlap as long as the zero coordinates are further
than the overlap distance apart.

histog: concatenated histog files for each of the elements.  These files
need to be in the same order as the multiscanp, scanp, and ribl files.
Histog files need to be whole numbers and can be generated using
genhis.

data: The results.  Comments are lines that begin with '*'.  This file
contains information containing all of the components in the scan.  A
line is given for each element that gives the model name, coordinate,
matrix orienation, Ri evaluation, gap size, and gap surprisal.  For
each grouping the total information and the order of elements, and all
elements used are reported.

scanfeatures: The results in the "features" format for input to the lister
program.  This consists of comment lines (beginning with "*"),

@ U00096 3646159.0 -1 "p10" " 4.0 bits"     3.97    -0.0     0.0
definition lines (as shown above), and features of the form:

See program lister.p for details.

scaninst:  The results are given in the form of delila instructions:
name "dnaA"; piece K01789; get from 229 -100 to 229 +100 direction -;
The name is the long name in the book.
This is turned on by the last scanp parameter set in the scanp file.

output: messages to the user

description

Multiscan searches a Delila book with an information-theory based
flexible binding site model and reports the located features.

Multiscan is the 3rd upgrade of scan. The first upgrade was to be able
to handle two part models, and was called Discan.  Discan worked by
comparing scanfeature files of two different models with some spacing
histogram to compute the total information of the site based on the
information of the the two parts and the the Gap surprisal of the
spacing between the two parts.  Biscan did the same computation but
did not use the scanfeatures output, instead it scanned the sequence
directly.

The biscan algorithm was to find the first site, and then based on the
spacing histogram provided, looked to see if a second site was near
by, using the the same code that Scan uses.  If it found a suitable
site it would then report the site and all of the relevant information
about it.   Biscan is much faster and has more flexibility in control
than discan did, and therefore maded discan obsolete.

Discan and Biscan could only handle two part models, Multiscan can
handle an infinite number of models binding together with a large
number of controls over the binding characteristics of each model.
Multiscan can replace Biscan, but cannot replace scan (yet).

Here is a brief description of multiscan and the way that it can be
manipulated:

The models that are used by multiscan are either "essential" or
"non-essential".  Essential sites have to be present for a site to be
found (for example, in a promoter the minus 10 has to bind in order
for the polymerase to work).  Non-essential sites, are sites that do
not have be present for the system to work (for example and activator
does not have to be present for the RNAp to start initiation).  This
distinction between essential and non-essential sites is important in
understandind the multiscan algorithm.

Non-essential sites are only included if the site contributes to the
total information of the system.  That is, if some activator had an
individual information of 3 bits, but a gap surprisal value of 4 bits,
it would have an overall net nformation contribution of -1 bits to the
system, not improving the total strength of the site.  Therefore it
will not be included.

Anoter important concept is the idea of parent sites.  Since we based
the gap surprisal value on the distance between two sites, we have to
identify which model another model will bind relative to.  For
example, the minus 35 will bind relative to the minus 10 so its parent
is the minus 10.  Some activator will bind relative to the minus 35 so
its parent is the minus 35 and its grandparent is the minus 10.
Another activator may bind relative to the minus 35 so its parent is
also the minus 35 and can be thought of as the brother of the first
activator.  But like a parent, that molecule has to be bound in order
to have a child bind relative to it.  So if a parent does not find a
binding site, no children can bind.

The math for the calculating the total information of a site is the
same as presented in biscan and in the Shultaberger, Schneider
ribosome paper.  The only difference is that instead of two parts, we
now have unlimited parts.  So the math is:

Total info = Ri(UP) - GS(UP-35) + Ri(35) + Ri(10) - GS(10-35)  if
Ri(UP) - GS(UP-35) > 0 bits

else

Total info = Ri(10) + Ri(35) - GS(10-35).

The lister maps are fairly complicated now.  Since there are so many
components which could be part of a model, we used the following
techniques to help group them.  First:  All connecting bars have the
coordinate of the parent molecule.  That is if you had a p10, p35, and
Fis and p10 was the first site in your model (in the paramter file,
not spacially on the DNA), the the connector bars for all the other
sites would contain its coordinate in their feature name tags.

Second: The total information line has the names of all the components
used in the model, and the order that they fall on the DNA.  So if Fis
was the most upstream site, followed by p35, and then p10 the line

You use catted scanp files, catted ribl files, and catted histog files
as well as a multiscanp file.

examples

an example multiscanp file:

1.03    version of multiscan that this parameter file is designed for.
200     maximum range in bases
5.0     total Ri cutoff
f       f means filter out duplicates of the features
0       essential number cutoff (Ri - gap), if negative use scanp cutoff, no gap.
"p10" (null) e s n n o3  parent, essential or non, orient tog, use GS, bar, overlap tog
"p35" (p10)  e s g n o3  parent, essential or non, orient tog, use GS, bar, overlap tog
"Fis" (p35)  e s n b o3  parent, essential or non, orient tog, use GS, bar, overlap tog
.

The order of gap surprisal numbers is now (2005 Sep 1) defined to be:
gapsurp - gap surprisal (in Ri position)
MinGapSurprisal - minimum gap (in P position)
MaxGapSurprisal - maximum gap (in Z position)
This makes the computation of the sum easier.

documentation

@article{Schneider.Ri,
author = "T. D. Schneider",
title = "Information Content of Individual Genetic Sequences",
journal = "J. Theor. Biol.",
volume = "189",
number = "4",
pages = "427-441",
note = "http://www-lecb.ncifcrf.gov/$\sim$toms/paper/ri/",
comment = "indiv.tex",
comment = "Submitted, April 1997",
year = "1997"}

@article{Schneider.walker,
author = "T. D. Schneider",
title = "Sequence Walkers:
a graphical method to display how binding proteins
interact with {DNA} or {RNA} sequences",
journal = "Nucl. Acids Res.",
volume = "25",
comment = "walker.tex, November 1, issue 21",
note = "http://www-lecb.ncifcrf.gov/$\sim$toms/paper/walker/,
erratum: NAR 26(4): 1135, 1998",
pages = "4408-4415",
year = "1997"}

@article{Shultzaberger.Schneider2001,
author = "R. K. Shultzaberger
and R. E. Bucheimer
and K. E. Rudd
and T. D. Schneider",
title = "{Anatomy of \emph{Escherichia coli}
Ribosome Binding Sites}",
journal = "J. Mol. Biol.",
volume = "313",
pages = "215-228",
comment = "Shultzaberger.Schneider.flexrbs",
{http://www.lecb.ncifcrf.gov/\~{}toms/paper/flexrbs/}
{http://www.lecb.ncifcrf.gov/\~{}toms/paper/flexrbs/}",
year = "2001"}

@article{Shultzaberger.Schneider-flexprom2006,
author = "R. K. Shultzaberger
and Zehua Chen
and Karen A. Lewis
and T. D. Schneider",
title = "{Anatomy of \emph{Escherichia coli} $\sigma^{70}$ promoters}",
journal = "Nucleic Acids Res.",
volume = "35",
pages = "771--788",
{http://www.ccrnp.ncifcrf.gov/\~{}toms/paper/f{l}exprom/}
{http://www.ccrnp.ncifcrf.gov/\~{}toms/paper/f{l}exprom/}",
year = "2006"}

sites.p, ri.p, genhis.p, lister.p, dnaplot.p

The first published paper using the program:
http://www.ccrnp.ncifcrf.gov/~toms/paper/flexprom/

This example lister map using multiscan output
shows multiple sequence walkers:

author
Ryan Kent Shultzaberger

bugs
* Multiscan cannot handle negative coordinates!!

* The large connecting bar may not be right if the series of features do
not have the most important site as the furthest downstream, and a positive
orientation.  You need to check this.

* Sites with even symmetry have coordinates at position x.5, so the
distribution has to have half base values (-0.5 0.5, 1.5, 2.5).
I don't know if you can use them.

* There is a problem with coupling non-essential sites.  Let me give an
example:  You want to look at where the sigma 70 model (composed of a
minus 10 and minus 35) binds realative to a RBS, but in your scan you do
not want the promoter to be essential.  That is, you can identify RBSs
and only include the promoter if it has positive information.  Since the
promoter is a flexible thing, and you need both a p10 and p35 for the
site to be complete, then the algorithm should only include the promoter
if it contains both a p10, and p35.  The current algorithm does not know
that both things need to be present and would include the p10 if it was
good, even if the p35 wasn't.   This is a tricky concept.  My
recommendation is to make all flexible elements essential, and that way
you will not have any problems.

* The rho normalization factor parameter has not been added, so the
information contribution of any regulator is the Ri of the site.

* It would be nice if the program could detwemine the range size
based on the sizes of the individual histograms.

* If some weird segmentation fault comes up, it may be related to freef
being a multifeatureptr, instead of a featuretypeptr.  For some reason
the program worked with freef as a featuretypeptr.  Try changing it back
in themain and see what happens.

* 2005 Sep 1: The Ri value is now properly reported for the features,
but the P and Z values are set to zero.  This should be provided
at some point.

technical notes
The mean and standard deviation of the Ri distribution are stored just
after the Ri(b,l) table in the ribl file.  They are produced automatically
by the ri program.
*)
(* end module describe.multiscan *)
{This manual page was created by makman 1.44}