[Biodevelopers] Could banded search help a general purpose
Smith-Waterman based algorithm?
Theodore H. Smith
delete at elfdata.com
Fri Apr 6 10:01:44 EDT 2007
Hi people,
As a hobbyist (although hopefully I'll make something real people
will use in the real world), I am working on some bio-informatics
code involving Smith-Waterman. Sort of like SSEARCH, but faster.
Now, I am trying to make my algorithm fast AND accurate for general
purpose use. I do not want to lose any accuracy over the full Smith-
Waterman implementation, so any bandedness might do, will have to be
banded in such a way that it only ignores cells that would never
contribute to the final answer anyhow.
That is tricky to implement, making the bandedness accurate. I've got
some potential solutions which are based around making a second
matrix, which would be a "score recovery matrix". That would contain
the most amount of score that any cell could recover, assuming
everything went perfect!
Or I could just skip bandedness altogether, which would save me a lot
of brain time and theoretical complication in inventing solutions to
minimise CPU time taken to process the vast amount of data needed to
process.
...
So my question is... is banding helpful to a general purpose Smith-
Waterman based searcher?
I can imagine a few situations. But I'm not sure if I am imagining
them properly. Maybe someone can correct on this? This is what I
think right now:
1) Searching for a short protein sequence (30-ish?) within a database
of long sequences (about 300?). I don't think bandedness will help me
more than 3% speed increase here overall.
2) Searching for a short protein sequence (30-ish) within a database
of short sequences (30-ish). Bandedness might help here. Not sure how
much. Maybe 3x speed increase.
3) Searching for a large sequence in a database of short sequences,
allowing for loose matches... I think bandnedness wouldn't help,
maybe 3% gain at best, possibly the overhead would hurt.
4) Searching for large sequences in a database of large sequences,
and we want tight matches. Bandedness would DEFINITELY help here! We
may get 20x or more speed up.
I guess from my guesswork... it seems like bandedness would help me
sometimes a lot. And othertimes hurt a bit or just be useless.
How often do bio-informatists do large sequence searching against
databases containing many large sequences?
My leaning is towards implementing the bandedness... After all it's
supposed to be general purpose.
--
http://elfdata.com/plugin/
"String processing, done right"
More information about the Biodevelopers
mailing list