[BiO BB] time efficient global alignment algorithm

Tue Aug 4 10:26:20 EDT 2009

>> I'm trying to perform a large amount of sequence alignments of long DNA
>> sequences, some up to 163,000+ bp in length. I was trying to use the
>> standard Needleman-Wunsch algorithm, but the matrix used requires a
>> large amount of memory...about 100 GB of memory. This obviously won't work.
> 
> How many were you trying to align? You mean 163kb or 163Mb?
> I was looking for test or comparisons for some alignment code I 
> had which indexed the target sequences, don't recall the suggestions
> for that discussion but I was able to do simple genomes reasonably well 
> ( I think I used 2 strains of e coli or something about 5 megs long)
> on a desktop. If you can find responses to my request from a few years 
> ago that may ( or may not ) help. I'd offer my code, and indeed I think
> I have it on a website, but I stopped development and not sure
> it is nearly useful as-is unless you just want coarse alignment on
> two similar sequences. 

Hundreds of thousands.  I'm trying to eliminate duplicates or near 
duplicates (>90% similarity).  I'm using the methodology from 
cd-hit-est.  However I'm not successful in getting that application to 
run on the number of sequences I have.  Right now, I'm trying to cluster 
the nt database, however later I would like to cluster other sequences 
from other sources.

> Many implementations of just about anything are bad with
> memory management- sometimes just blocking or sorting or
> compacting the internal representation can make a big improvement.
> Not sure what exists along these lines but often some simplifcations
> don't change results but decrease time/memory on futile possibilities. 

Agreed.  However in doing the dynamic programming matrix, you still need 
to allocate an m x n matrix of ints.  With sequences of 163,000 bp in 
length, you need about 100GB of RAM.  Unless there is a way to using a 
compact representation of the DP matrix that I'm not aware of.

> Are all of these nominally the same or are you trying to align
> noise to noise? 

Yes, they are nominally the same...they have at least 50% of the 
non-overlapping words of the shorter of the two sequences.

>> Ideally, something with the speed similar to BLAST.
> 
> I guess in an odd way my approach could get there as it essentially 
> queries each string for "interesting" short sequences but I'd have to
> check order ( howmany of these does it use etc). Last time I checked the
> academic lit, IIRC this exact-string matching was an open research area maybe there have been advancements
> in last few years that are trivial to code or exist in an academic's lab.

If there are, I haven't heard of any.  My thought was to run a BLAST 
alignment on the two sequences using bl2seq.  Then string together the 
non-overlapping HSPs and perform a global alignment on the regions in 
between the HSPs.  This is easy enough, but I want to see if there is a 
solution already out there first.

Ryan