[BiO BB] Understanding Smith-Waterman scoring

Peter Rice pmr at ebi.ac.uk
Fri Feb 10 09:22:39 EST 2006


Theodore H. Smith wrote:

> How does it score alignments that come in sections? Does it give a  
> penalty if a sequence must be split up?

You get one alignment.

If more than one "section" aligns ... with the parts in the same order in both 
proteins ... you can have a misaligned region and/or gaps in the sequences. 
There are penalty scores for the misalignments and the gaps.

There is also a Smith-Waterman-Eggert variation of the algorithm that finds a 
scond, third, fourth ... alignment that excludes all those already reported.

Smith-Waterman is a local alignment method, so any unaligned parts of either 
sequence do not count in the score.


> What would matching BBBBAAAA to AAAABBBB give?

AAAA matching AAAA or BBBB matching BBBB (unless A has a positive score to 
match B, then other results are possible)

> I'd expect it to generate two "sections", like this:

No, but you will get the second section from the Smith-Waterman-Eggert 
algorithm. Each will have its own local alignment score.

> But what should the overall score be? Is it still 8? Or should we  give 
> a penalty because we've had to split this up? Is it normal for  
> alignment tools to give penalties to segmented sequences. Also is  there 
> some kind of "minimum length" that a Smith-Waterman based  aligner would 
> allow? Would it say that you can't have sections below  a certain 
> length? Are there any tools which let you specify such a  minimum 
> section length?

> If you don't like that example above of AAAABBBB (as it can be  
> reversed), then try this example. Assume all the proteins get a score  
> of 1 against themselves. The protein: ABCDEFGH, if I did a Smith- 
> Waterman score comparison against DCHABGEF, would the score still be  8. 
> After all, all the proteins are there, just in a different order.
> 
> I would expect this to get a score of zero or below.

Be careful not to confuse protein (the whole sequence) with amino acid or 
residue (one character).

You will get at least 1 residue matching. Maybe more as some of the mismatches 
will have a positive score.

Hope that helps. It is cmoplicated :-)

Peter




More information about the BBB mailing list