[BiO BB] Understanding Smith-Waterman scoring
Theodore H. Smith
delete at elfdata.com
Fri Feb 10 09:53:18 EST 2006
On 10 Feb 2006, at 14:22, Peter Rice wrote:
> Theodore H. Smith wrote:
>
>> How does it score alignments that come in sections? Does it give
>> a penalty if a sequence must be split up?
>
> You get one alignment.
>
> If more than one "section" aligns ... with the parts in the same
> order in both proteins ... you can have a misaligned region and/or
> gaps in the sequences. There are penalty scores for the
> misalignments and the gaps.
OK. I understand. The most popular tools in use today, only find the
best (or at least one) locally aligned section, but not all of them.
Is this a problem in general? Or is it that multiple sections to be
aligned, are quite rare in the kind of queries that biologists do today?
> There is also a Smith-Waterman-Eggert variation of the algorithm
> that finds a scond, third, fourth ... alignment that excludes all
> those already reported.
Am I right in seeing that this isn't talked about as much as Smith-
Waterman though? It sounds promising for the line of work I am doing
however, thanks very much for telling me of Smith-Waterman-Eggert, it
looks like a good lead.
>> What would matching BBBBAAAA to AAAABBBB give?
>
> AAAA matching AAAA or BBBB matching BBBB (unless A has a positive
> score to match B, then other results are possible)
Which would I get? Does it depend on the tool? Do I get the first
alignment, the last, or the best?
>> I'd expect it to generate two "sections", like this:
>
> No, but you will get the second section from the Smith-Waterman-
> Eggert algorithm. Each will have its own local alignment score.
Thanks. Sounds very interesting.
>> But what should the overall score be? Is it still 8? Or should we
>> give a penalty because we've had to split this up? Is it normal
>> for alignment tools to give penalties to segmented sequences.
>> Also is there some kind of "minimum length" that a Smith-Waterman
>> based aligner would allow? Would it say that you can't have
>> sections below a certain length? Are there any tools which let
>> you specify such a minimum section length?
>
>> If you don't like that example above of AAAABBBB (as it can be
>> reversed), then try this example. Assume all the proteins get a
>> score of 1 against themselves. The protein: ABCDEFGH, if I did a
>> Smith- Waterman score comparison against DCHABGEF, would the score
>> still be 8. After all, all the proteins are there, just in a
>> different order.
>> I would expect this to get a score of zero or below.
>
> Be careful not to confuse protein (the whole sequence) with amino
> acid or residue (one character).
You might not be surprised to find out that I come from a software
developer background. I won't make that mistake again.
> You will get at least 1 residue matching. Maybe more as some of the
> mismatches will have a positive score.
>
> Hope that helps. It is cmoplicated :-)
Yes it's been of great help. And yes it is complicated :)
More information about the BBB
mailing list