Scoring matrix

From Bioinformatics.Org Wiki

Jump to: navigation, search

The aim of a sequence alignment is to match "the most similar elements" of two sequences. This similarity must be evaluated somehow. For example, consider the following two alignments:

(a)   AIWQH
      :  ::
      AL-QH

(b)   AIWQH
      :  ::
      A-LQH

They seem quite similar: both contain one "indel" and one substitution, just at different positions. However, if we think of the letters as amino acid residues rather than elements of strings, alignment (a) is the better one, because isoleucine (I) and leucine (L) are similar sidechains, while tryptophan (W) has a very different structure. This is a physico-chemical measure; we might prefer these days to say that leucine simply substitutes for isoleucine more frequently---without giving an underlying "reason" for this observation.

However we explain it, it is much more likely that a mutation changed I into L and that W was lost, as in (a), than that W changed into L and I was lost. We would expect that a change from I to L would not affect the function as much as a mutation from W to L---but this deserves its own topic.

To quantify the similarity achieved by an alignment, scoring matrices are used: they contain a value for each possible substitution, and the alignment score is the sum of the matrix's entries for each aligned amino acid pair. For gaps (indels), a special gap score is necessary---a very simple one is just to add a constant penalty score for each indel. The optimal alignment is the one which maximizes the alignment score.

PAM matrices are a common family of score matrices. PAM stands for Percent Accepted Mutations, where "accepted" means that the mutation has been adopted by the sequence in question. Thus, using the PAM 250 scoring matrix means that about 250 mutations per 100 amino acids may have happened, while with PAM 10 only 10 mutations per 100 amino acids are assumed, so that only very similar sequences will reach useful alignment scores.

PAM matrices contain positive and negative values: if the alignment score is greater than zero, the sequences are considered to be related (they are similar with respect to the used scoring matrix), if the score is negative, it is assumed that they are not related. "Relationship" here may refer to evolution as well as functionality of the proteins, and of course the choice of the matrix affects the result, so one has to make an assumption on the similarity of the sequences in order to receive a useful result: rather distant sequences won't produce a good alignment using PAM 10, and the optimal aligment of two very similar sequences with PAM 500 may be less useful than that with PAM 50.

Finally, it should be noted that only some scoring matrices use similarity to evaluate alignments, but others use distance, so the be careful interpreting the results!

After this brief and necessarily superficial overview, you might want to read some more about scoring matrices.

Personal tools
Namespaces
Variants
Actions
wiki navigation
Toolbox