Delila Program: cluster

cluster program

Documentation for the cluster program is below, with links to related programs in the "see also" section.

{version = 5.06; (* of cluster.p 1992 September 18}

(* begin module describe.cluster *)
   cluster: cluster indana subindexes into groups of duplicate entries

   cluster(clusterp: in, subind: in, inst: in, book: in,
           pairs: out, clumps: out, output: out)

   clusterp: The cluster parameter file that consists of the following:
             FIRST LINE  'y' turns the flag on, 'n' turns it off
                  (debugging) allows one to look at raw data in the bags.
             The debugging flag controls the printing of the raw data above the
             regular output of the cluster program, which is created solely by
             procedure showRAWbag.  This can then be compared with the data in
             the chart for correctness.  Raw data consists of the series of
             coordinate pairs in the bag and the sides they are matched on.
             printed above the standard output structure.

             example: -  (  630,   69)   R
                      L  (  649,   88)   -  {20}  {20}
                                          |   630       663
                      HUMUK               |     ----------
                                          |         34
                      HUMUPA              |     ----------
                                          |    69       102

             It is important to note that the raw data will only appear in the
             pairs output file, and will not be written in clumps at all.  This
             means that parameter 3, writepairs, must also be turned on for
             this flag to be effective.

             SECOND LINE 'y' turns the flag on, 'n' turns it off
                  (showfragments) allows one to see pairs that are fragmented.
             The showfragments toggle controls printing the outputs of pairs
             with "imperfect" matches.  That is, in some cases a repeating
             sequence will match in several frames, causing repeated sequence
             matching and producing a large list of coordinate pairs.  This
             list can be shown if the parameter is turned on, but the statement
             "WARNING:  sequence pairs are overmatched" will appear if it is
             turned off.  The actual sequences will be shown in either case,
             so the comparison can always be done by hand by the user.  The
             output is excessively long, but the sequences will be shown, so
             the comparison can be done by the user.

             example:    1     acggatcgtgtgtgtgtgtgtgtgtacgatcggatcgat
                         2     acggatcgtgtgtgtgtgtgtgtgtacgatcggatcgat

             These sequences will have matches between all of the 'gt' base
             pairs, resulting in an overwhelming number of matches.  The
             maximum number of possible matches is found by taking the length
             of the sequences and dividing it by the value in the overmatched
             parameter (FIFTH LINE) times the number of instructions that
             match between any two pieces in the dbinst.  This results in
             a maximum number of matches between any two pieces.  Any pieces
             above this limit will can have their output completely shown or
             can generate a warning message (see showfragments, SECOND LINE).
      In addition to preventing the example case, showfragments will
      also prevent the display of any other case that may cause an
      excessive number of matches.

             THIRD LINE 'y' turns the flag on, 'n' turns it off
                  (writepairs) controls the printing of the pairs output file.
             If writepairs is on, the original clustering pairlist will be
             printed into the output file pairs.  If it is off, this file will
             not be printed.  This parameter must be turned on to effectively
             use the debugging parameter (see FIRST LINE).

             FOURTH LINE 'y' turns the flag on, 'n' turns it off
                  (writeclumps) controls printing of the clumps output file.
             If writeclumps is on, the original clustering pairlist will be
             sent through the clumping procedures.  The output file clumps will
             contain the sequences involved in the matches on the pair in
             addition to the clumped version of the pairlist.  The clumping
             process takes an excessive amount of time for very large files,
             since the program must traverse the entire pairlist to find all
             related pairs, then put the pairs on to the clumplist, then go
             through the book and find sequences to match every instruction
             in every pair of every clump.  Although it is much easier to
             determine which pieces are true repeats through use of the clumps
             file, it is certainly possible to do so by simply using the pairs
             output file.

             FIFTH LINE any integer
                    (matchparameter) is the number of matches to be allowed
             between two instructions.  This can be determined by dividing the
             sequence length from the book by the minimum window size from the
             subindex, or a maximum number of matches between instructions can
      be set.  An integer less than or equal to 0 will calculate maximum
      matches using the above method.  Any number greater than 0 will be
      used as the new maximum matches.

             example:  if the instructions call for the sequences

               piece1: get from 100 -50 to 100 +50;
               piece2: get from 200 -50 to 200 +50;

               The sequence length is 101.  If the windowsize read from the
               subindex = 15, then 6 possible matches can occur between these
               two instructions (101 div 15 = 6).

      The TOTAL number of matches between two pieces is found by
      multiplying matchparameter by the number of instructions in a
      given pair.  If a piece has more matches than this, it is
      considered to be overmatched, the bag will not be printed, and the
      statment 'WARNING: sequence pairs have too many matches.' will
      appear.  Overmatched pairs can be printed using the showmatches
      parameter (see SECOND LINE).
   subind: a subindex from the indana program matching the inst and the book
   inst: a set of delila instructions that correspond to the book
   book: a delila book that contains the sequences being clumped
   pairs: the output list of paired sequences
   clumps: the output list of clumped sequences
   output: When errors occur, the program halts and produces an error message

   Duplicate entries in the subind subindex are clustered into a unified list
   of pairs and copied to output files as sequence numbers, lengths, and
   sequence base pairs.

   Pairs are determined by the indana program, which delegates sequence
   similarities with an '*'.  Cluster takes the subindex and shows the
   coordinate range and length of the similarity by pairs.  The pairs file is
   a list of relationships between two sequences, the clumps file takes this
   list of pairs and groups related ones together. The seqalign modules of the
   program then access the book and get the corresponding sequences to print
   out with the instruction number and piece name.


see also
   index.p, indana.p

   R. Michael Stephens

   None currently known.

technical notes
   The read for the indana window size is based on the '[' character before
   the number in the subind heading.  Any changes to indana that alter this
   format must be reflected in the getwindowsize procedure.
(* end module describe.cluster *)
{This manual page was created by makman 1.44}
{created by htmlink 1.55}
National Cancer Institute    National Institutes of Health    Health and Human Services    USA Gov - Official Web Portal    Viewing Files    Accessibility