[BiO BB] Substitution matrices vs HMM

Fri Oct 29 14:37:50 EDT 2004

Goel, Manisha wrote:

> Hi All,
>
> I was trying to develop an algorithm for describing/predicting a 
> pattern (e.g. transmembrane region, signal peptide etc) in protein 
> sequences.
>
> I want to derive this pattern from the multiple sequence alignments.
> But I was wondering if I should use substitution matrices or HMMs to 
> describe/represent these patterns.
> Are there any definite advantages of using one over the other ? Does 
> the choice depend on what I am trying to define ?
> Can somebody please direct me to relevant literature or suggest 
> something from personal experience ?
>
>
> Thanks in advance,
> Manisha Goel
>

Judging by the wording of your question, you should read up a bit more 
on sequence analysis before you go and try this. Substitution matrices 
are NxN matrices describing the probability of substitution of one 
alphabet letter by another (in the case of proteins, each letter 
normally represents an amino acid, hence N=20). They are not a tool for 
pattern detection, they are used for sequence alignment.

You may want to think of positional specific score matrices (PSSMs). 
Those are LxN sized matrices generated from multiple alignments. Where L 
is the length of the protein (or the part of the protein you wish to 
investigate), and N is the number of letters in your alphabet (again, 
with proteins N=20, usually). Each entry in the matrix is the 
probability of the amino acid appearing in that particular position in a 
multiple alignment. So given a set of known good multiple alignements, 
you can generate a PSSM for those. From the PSSM you can generate a 
*profile*, which is the same LxN size matrix, with each cell being some 
sort of transformation of the raw value in the PSSM. (I'm being a bit 
superficial here). You can then use the profile to fish out new sequence 
from a database of sequences, or from a database of profiles.

As you pointed out, another way of doing this is using HMMs to describe 
the patterns. Not getting into that, I'll just say that profile HMMs are 
also profiles generated from PSSMs, but unlike the previous profile, 
there is a more robust probabilitic model used to generate them.

Lots of work has been done on this. In order to avoid duplication of 
previous work, I suggest you jump start your background research with 
the following review:

*Rost B, Liu J, Nair R, Wrzeszczynski KO, Ofran Y.*
Cell Mol Life Sci. 2003 Dec;60(12):2637-50.

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=pubmed&dopt=Abstract&list_uids=14685688

Table 1 lists many resources you should look into before you attempt 
something new. It might be a good idea to see who has done what there, 
and how they did it.

For general background, I recommend the following book:

*Biological Sequence Analysis : Probabilistic Models of Proteins and 
Nucleic Acids*
by Richard Durbin, Sean R. Eddy, Anders Krogh, Graeme Mitchison

http://www.amazon.com/exec/obidos/tg/detail/-/0521629713/qid=1099074233/sr=1-1/ref=sr_1_1/002-6013332-8507249?v=glance&s=books

BTW, the fact that lots of work has been done already, shouldn't 
discourage you from going in. The field of pattern detection is far from 
perfect. Putting it mildly...

Good luck,

Iddo

-- 
Iddo Friedberg, Ph.D.
The Burnham Institute
10901 N. Torrey Pines Rd.
La Jolla, CA 92037 USA
Tel: +1 (858) 646 3100 x3516
Fax: +1 (858) 713 9930
http://ffas.ljcrf.edu/~iddo