[Biodevelopers] Question on gaps vs substitution
Theodore H. Smith
delete at elfdata.com
Sat Jul 7 11:59:28 EDT 2007
Hi people,
I have a question on gap penalties and substitution scoring.
The answer to this question could make a factor of around 1.5x
difference the speed of a blast-like algorithm I'm working on. And
double the complexity of one small component used in the algorithm.
So, I'm using a Smith-Waterman approach combined with some other
approaches to speed up Smith-Waterman by a few order of magnitude. So
my problem area is totally Smith-Waterman based and anyone who knows
Smith-waterman algorithm might be able to help me.
Is there ever a case where two inserts, could be less costly than a
substituion? In fact it's not even just "two inserts", it's an insert
followed by a delete.
Basically, a case where going right, then down, is less costly than
going diagonally?
So if we had a matrix like this:
... , 5, 4, ...
... , 4, 3, ...
Let's say in this case, the replacement from the "5" cell, to the "3"
cell would have taken 7 points, leaving us with -2 (probably would be
raised to zero as how S-W works). But going right from 5 to 4 only
took 1 point away, and then going down from 4 to 3 took 1 point.
This would require that an insert followed by a deletion, is cheaper
than a replacement.
Can this be true?
I've inspected the BLOSUM62 matrix (-4 is the worst penalty for
replacement), and the default gap penalties (-11,-1), and it seems
like using BLOSUM62, it's impossible to get "insert+delete" instead
of one replacement.
So this would be a good case for me, as it would simplify my
algorithm. I'm happier if insert+delete is always worse than
replacement.
But does this pattern hold across all biological uses? Can there be a
case where an insert followed by a delete, is better than a
replacement? If so, would it be something so contrived and unlikely
and not really useful, that I can just ignore it and tell my users
that my software has a design restriction that insert+delete must
always score worse than a substitution?
Sorry for this basic question! I know I am asking a basic question,
but then it really does help to have your basics correct before
forming some super-fast algorithm. I can't have it super-fast and
wrong now can I :) Basics always come first.
Also, if an insert is followed by a delete, does it count as -11-11
(-22) or -11-1 (-12)?
Thanks to all who can help! My algorithm is really close to
finishing. I can see it almost in my hands. I will be very proud to
finish it. Whether or not the algorithm fast enough to be is of use,
is another matter, but it will function correctly at least!
--
http://elfdata.com/plugin/
"String processing, done right"
More information about the Biodevelopers
mailing list