[BiO BB] time efficient global alignment algorithm

Wed Aug 5 03:36:14 EDT 2009

2009/8/4 Ryan Golhar <golharam at umdnj.edu>:
>>> I'm trying to perform a large amount of sequence alignments of long DNA
>>> sequences, some up to 163,000+ bp in length. I was trying to use the
>>> standard Needleman-Wunsch algorithm, but the matrix used requires a
>>> large amount of memory...about 100 GB of memory. This obviously won't
>>> work.
>>
>> How many were you trying to align? You mean 163kb or 163Mb?
>> I was looking for test or comparisons for some alignment code I had which
>> indexed the target sequences, don't recall the suggestions
>> for that discussion but I was able to do simple genomes reasonably well (
>> I think I used 2 strains of e coli or something about 5 megs long)
>> on a desktop. If you can find responses to my request from a few years ago
>> that may ( or may not ) help. I'd offer my code, and indeed I think
>> I have it on a website, but I stopped development and not sure
>> it is nearly useful as-is unless you just want coarse alignment on
>> two similar sequences.
>
> Hundreds of thousands.  I'm trying to eliminate duplicates or near
> duplicates (>90% similarity).  I'm using the methodology from cd-hit-est.
>  However I'm not successful in getting that application to run on the number
> of sequences I have.  Right now, I'm trying to cluster the nt database,
> however later I would like to cluster other sequences from other sources.

First thing that came to mind when I read the above was cd-hit. What
is cd-hit-est and how come it fails?

I'm curious because I'm maintaining (or was) the cd-hit website for
the project on bioinformatics.org:

http://www.bioinformatics.org/cd-hit/

I'm planning to move that over into the wiki where it can (hopefully)
stay more up to date.

Dan.