[BiO BB] Understanding Smith-Waterman scoring

Fri Feb 10 09:53:18 EST 2006

On 10 Feb 2006, at 14:22, Peter Rice wrote:

> Theodore H. Smith wrote:
>
>> How does it score alignments that come in sections? Does it give  
>> a  penalty if a sequence must be split up?
>
> You get one alignment.
>
> If more than one "section" aligns ... with the parts in the same  
> order in both proteins ... you can have a misaligned region and/or  
> gaps in the sequences. There are penalty scores for the  
> misalignments and the gaps.

OK. I understand. The most popular tools in use today, only find the  
best (or at least one) locally aligned section, but not all of them.

Is this a problem in general? Or is it that multiple sections to be  
aligned, are quite rare in the kind of queries that biologists do today?

> There is also a Smith-Waterman-Eggert variation of the algorithm  
> that finds a scond, third, fourth ... alignment that excludes all  
> those already reported.

Am I right in seeing that this isn't talked about as much as Smith- 
Waterman though? It sounds promising for the line of work I am doing  
however, thanks very much for telling me of Smith-Waterman-Eggert, it  
looks like a good lead.

>> What would matching BBBBAAAA to AAAABBBB give?
>
> AAAA matching AAAA or BBBB matching BBBB (unless A has a positive  
> score to match B, then other results are possible)

Which would I get? Does it depend on the tool? Do I get the first  
alignment, the last, or the best?

>> I'd expect it to generate two "sections", like this:
>
> No, but you will get the second section from the Smith-Waterman- 
> Eggert algorithm. Each will have its own local alignment score.

Thanks. Sounds very interesting.

>> But what should the overall score be? Is it still 8? Or should we   
>> give a penalty because we've had to split this up? Is it normal  
>> for  alignment tools to give penalties to segmented sequences.  
>> Also is  there some kind of "minimum length" that a Smith-Waterman  
>> based  aligner would allow? Would it say that you can't have  
>> sections below  a certain length? Are there any tools which let  
>> you specify such a  minimum section length?
>
>> If you don't like that example above of AAAABBBB (as it can be   
>> reversed), then try this example. Assume all the proteins get a  
>> score  of 1 against themselves. The protein: ABCDEFGH, if I did a  
>> Smith- Waterman score comparison against DCHABGEF, would the score  
>> still be  8. After all, all the proteins are there, just in a  
>> different order.
>> I would expect this to get a score of zero or below.
>
> Be careful not to confuse protein (the whole sequence) with amino  
> acid or residue (one character).

You might not be surprised to find out that I come from a software  
developer background. I won't make that mistake again.

> You will get at least 1 residue matching. Maybe more as some of the  
> mismatches will have a positive score.
>
> Hope that helps. It is cmoplicated :-)

Yes it's been of great help. And yes it is complicated :)