[BiO BB] Need fair alignment tool comparison/ using DSCAM for tool testing

Mike Marchywka marchywka at hotmail.com
Mon Feb 18 11:12:10 EST 2008

As I mentioned in previous posts, I'm using the drosophila DSCAM genes for testing some tools.
I assembled a fasta file composed of 3 fly entries,

$ cat all_fasta | grep ">"
>AF260530  Drosophila melanogaster Dscam gene, complete cds.
>DQ317106  Drosophila yakuba Dscam gene, exons 3 through 24.
>DQ317109  Drosophila pseudoobscura Dscam gene, exons 3 through 24.

and tried aligning them with clustalw but minutes later still didn't have a result. I was wondering if
someone could suggest a set of parameters or alternative alignment tool to do a fast
alignment, even if a bit sloppy. I had always used to slow/accurate approach and don't
know what options may be available for faster work- these sequences are each about 50k long.

In the meantime, I was able to get a satisfactory result using exact string matches using successively
shorter and shorter strings. This approach yields acceptable results in under a minute and, if needed, you
could segment the questionable areas and feed them to clustal or other tool for "better" alignment.
It seems to be fast due to only comparing sequences to a reference sequence ( O(n*l^2) but "l" can be smaller
than sequence length as unique features can be found O(l*log(l))  ) .  There are, of course, likely to
be various pathological cases but for sequences known to be similar it seems to work ok and the indexing
feature allows extraction of substrings with particular distributions ( occuring only once in each sample for example).
I have aligned 2 ecoli strains in perhaps a few minutes and there weren't any obvious pathological
results ( I obviously didn't check the whole thing either by eye or programatically). 

Others have asked about testing method, I'd like to show how I'm going about this with the DSCAM example.
The alignment is only one part of more general interest in finding similar/different features between samples.
These sequences, it turns out, have exon locations in the ncbi entries. So, it was pretty easy to check the alignments
by examining the locations of the exons in the aligned composite. In this case, I aligned as follows,

$ string_test -fastas all_fasta -index 8 -length 25 -fix 12 -output 3 -filterN -filterID -status -fcompare_all> anchor_hits

and could feed the "anchor_hits" locations to an aligner that could start with these ( actually, it can now start on its own
but that is a developmental issue ) and refine subsections ( or spot gross transpositions perhaps, would have to check),

$ mm_align_tool -fastas all_fasta -pair_rules anchor_hits -ref 0 -all_samples -refine marlow -pair_params uniq -pair_align monotonize -doall -dest algned_fasta -output fasta

The aligned sequences can be output in several formats and the "algned_fasta" file can be presented along with various
rules or annotations using another tool to create bmp, html, or txt files :
( right now it requires some input parameters, hence the dummy echo)

716  echo | $progpath/annotater -source manno.src> testcomp5.html
464  echo | $progpath/annotater -source bmp.src
585  echo |$progpath/annotater -source text.src>text.txt 2>junk

where the ".src" file just contains command line parameters,
$ cat text.src
-width 120 -font $progpath/4x6-KOI8.pcf
 -mrules fixed_exons2
-merge_rules comp_5_hitss
-font /cygdrive/c/mydocs/scripts/cc/affx//4x6-KOI8-R.pcf
-acid_rank /cygdrive/c/mydocs/scripts/cc/affx//nmstrings
-acid_map 20
-xlate -inter -banner
-annotate algned_fasta

So, I could first look at a different alignment metric by outputting a table of correspondences between input and aligned locations,

  602  $progpath/mm_align_tool -fastas all_fasta -pair_rules anchor_hits -ref 0 -all_samples -refine marlow -pair_params uniq -pair_align monotonize -doall -dest xxxx_raw  -output table

and using the table to move "absolute location" rules to their location in the  aligned sequence:

 $table_tool -v -table table_table  -table_rules ncbi_exons  | sed -e 's/exon/exon /g'| sed -e 's/[.:]/ /g' | sort -k 3 -g | more

This  generates a cryptic feature comparison map which shows that most of the exons
end up in the same location on each sequence but see the publication below,
even the differently named exons were aligned from different species in most cases
( these exon rules are followed by {sequence number, aligned position, offset from first entry } ):
>exon |1|exon 3 {2,16121,0}{3,16121,0}{4,16157,36}
>exon |1|exon 4 1 {2,18582,0}{3,18582,0}{4,18582,0}
>exon |1|exon 4 10 {2,22532,0}{3,22532,0}{4,22532,0}
>exon |1|exon 4 11 {2,22836,0}{3,22836,0}{4,22845,9}
>exon |1|exon 4 12 {2,23217,0}{3,23217,0}{4,23217,0}
>exon |1|exon 4 2 {2,19006,0}{3,19006,0}{4,19006,0}
>exon |1|exon 4 3 {2,19736,0}{3,19736,0}{4,19736,0}
>exon |1|exon 4 4 {2,20545,0}{3,20545,0}{4,20545,0}
>exon |1|exon 4 5 {2,20872,0}{3,20872,0}{4,20872,0}
>exon |1|exon 4 6 {2,21269,0}{3,21269,0}{4,21269,0}
>exon |1|exon 4 7 {2,21597,0}{3,21597,0}{4,21597,0}
>exon |1|exon 4 8 {2,21895,0}{3,21895,0}{4,21895,0}
>exon |1|exon 4 9 {2,22229,0}{3,22229,0}{4,22212,-17}
>exon |1|exon 5 {2,25020,0}{3,25020,0}{4,25020,0}
>exon |1|exon 6 1 {2,27251,0}{3,27249,-2}{4,27251,0}
>exon |1|exon 6 10 {2,29659,0}{3,29826,167}{4,29451,-208}
>exon |1|exon 6 11 {2,29845,0}{3,30074,229}{4,29640,-205}
>exon |1|exon 6 12 {2,30074,0}{3,30614,540}{4,30114,40}
>exon |1|exon 6 13 {2,30614,0}{3,31054,440}{4,30438,-176}
>exon |1|exon 6 14 {2,30831,0}{3,31255,424}{4,30614,-217}
>exon |1|exon 6 15 {2,31054,0}{3,31475,421}{4,30831,-223}
>exon |1|exon 6 16 {2,31255,0}{3,31684,429}{4,31054,-201}
>exon |1|exon 6 17 {2,31475,0}{3,32161,686}{4,32160,685}
>exon |1|exon 6 18 {2,31688,0}{3,32437,749}{4,32376,688}
>exon |1|exon 6 19 {2,31926,0}{3,33251,1325}{4,32590,664}
>exon |1|exon 6 2 {2,27524,0}{3,27524,0}{4,27497,-27}
>exon |1|exon 14 {2,66424,0}{3,66424,0}{4,66424,0}
>exon |1|exon 15 {2,66631,0}{3,66631,0}{4,66631,0}
>exon |1|exon 16 {2,66884,0}{3,66884,0}{4,66884,0}
>exon |1|exon 17 1 {2,70469,0}{3,70469,0}{4,70469,0}
>exon |1|exon 17 2 {2,71004,0}{3,71004,0}{4,71004,0}
>exon |1|exon 18 {2,72995,0}{3,72995,0}{4,72995,0}
>exon |1|exon 19 {2,74100,0}{3,74100,0}{4,74063,-37}
>exon |1|exon 20 {2,74948,0}{3,74948,0}{4,74948,0}
>exon |1|exon 21 {2,75334,0}{3,75334,0}{4,75334,0}
>exon |1|exon 22 {2,75594,0}{3,75594,0}{4,75594,0}
>exon |1|exon 23 {2,76979,0}{3,76979,0}{4,76997,18}
>exon |1|exon 24 {2,78233,0}{3,78233,0}{4,78233,0}

that can be reconciled with known inter-species exon similarities, for example 


It turns out that the exon3 offset for sequence 4 is probably due to a rule issue, not an alignment issue
( excerpt from the test alignment map including aligner diagnostics such as '{' in fasta file and 
likely translation products in all 3 fwd frames ) as the alignment in this high-similarity region appears good:

New Section : 16080 to 16200 section 134 of 669
8         9         0         1         2         3         4         5         6         7         8         9
>>AF260530  Drosophila melanogaster Dscam gene, complete cds.
>>AF260530  Drosophila melanogaster Dscam gene, complete cds.
 >exon|1|exon3 XXX
>>DQ317106  Drosophila yakuba Dscam gene, exons 3 through 24.
 >exon|1|exon3 XXX
>>DQ317109  Drosophila pseudoobscura Dscam gene, exons 3 through 24.
 >exon|1|exon3 XXX

I'm aware of the following related alignment literature, open to ideas:

$ string_test -about|unix2dos >/dev/clipboard

Contact: marchywka at hotmail.com Nov 2007
Comment: uses some indexing to get speed up, 
Comment: motivation for RC rules from this etc , 
Commment: and should work well on text or (modified slightly ) binary code too
Note: More code in mm_align_tool
Note: Based loosely on references such as these but 'common sense'
Note: seemed to work well as these are after-the-fact lookups
Ref: http://www.google.com/search?hl=en&safe=off&q=string+alignment+site%3Aciteseer.ist.psu.edu 
Ref: http://citeseer.ist.psu.edu/csuros05rapid.html
Comment: Csuros, M., Ma, B.: Rapid homology search with two-stage extension and
Comment: daughter seeds. In: Proc. 11th Int. Computing and Combinatorics Conf. (COCOON).
Comment: Volume 3595 of LNCS., Springer-Verlag (2005) 104-- 114
Ref: http://citeseer.ist.psu.edu/468459.html
Ref: http://citeseer.ist.psu.edu/kahveci04speeding.html
Feb  2 2008 09:35:40 string_test.h182


Mike Marchywka
586 Saint James Walk
Marietta GA 30067-7165
404-788-1216 (C)<- leave message
989-348-4796 (P)<- emergency only
marchywka at hotmail.com
Note: Hotmail is blocking my mom's entire
ISP claiming it is to reduce spam but probably
to force users to use hotmail. Please DON'T
assume I am ignoring you and try
me on marchywka at yahoo.com if no reply
here. Thanks.

Need to know the score, the latest news, or you need your Hotmail®-get your "fix".

More information about the BBB mailing list