[BiO BB] Substitution matrices vs HMM
Iddo
idoerg at burnham.org
Fri Oct 29 14:37:50 EDT 2004
Goel, Manisha wrote:
> Hi All,
>
> I was trying to develop an algorithm for describing/predicting a
> pattern (e.g. transmembrane region, signal peptide etc) in protein
> sequences.
>
> I want to derive this pattern from the multiple sequence alignments.
> But I was wondering if I should use substitution matrices or HMMs to
> describe/represent these patterns.
> Are there any definite advantages of using one over the other ? Does
> the choice depend on what I am trying to define ?
> Can somebody please direct me to relevant literature or suggest
> something from personal experience ?
>
>
> Thanks in advance,
> Manisha Goel
>
Judging by the wording of your question, you should read up a bit more
on sequence analysis before you go and try this. Substitution matrices
are NxN matrices describing the probability of substitution of one
alphabet letter by another (in the case of proteins, each letter
normally represents an amino acid, hence N=20). They are not a tool for
pattern detection, they are used for sequence alignment.
You may want to think of positional specific score matrices (PSSMs).
Those are LxN sized matrices generated from multiple alignments. Where L
is the length of the protein (or the part of the protein you wish to
investigate), and N is the number of letters in your alphabet (again,
with proteins N=20, usually). Each entry in the matrix is the
probability of the amino acid appearing in that particular position in a
multiple alignment. So given a set of known good multiple alignements,
you can generate a PSSM for those. From the PSSM you can generate a
*profile*, which is the same LxN size matrix, with each cell being some
sort of transformation of the raw value in the PSSM. (I'm being a bit
superficial here). You can then use the profile to fish out new sequence
from a database of sequences, or from a database of profiles.
As you pointed out, another way of doing this is using HMMs to describe
the patterns. Not getting into that, I'll just say that profile HMMs are
also profiles generated from PSSMs, but unlike the previous profile,
there is a more robust probabilitic model used to generate them.
Lots of work has been done on this. In order to avoid duplication of
previous work, I suggest you jump start your background research with
the following review:
*Rost B, Liu J, Nair R, Wrzeszczynski KO, Ofran Y.*
Cell Mol Life Sci. 2003 Dec;60(12):2637-50.
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=pubmed&dopt=Abstract&list_uids=14685688
Table 1 lists many resources you should look into before you attempt
something new. It might be a good idea to see who has done what there,
and how they did it.
For general background, I recommend the following book:
*Biological Sequence Analysis : Probabilistic Models of Proteins and
Nucleic Acids*
by Richard Durbin, Sean R. Eddy, Anders Krogh, Graeme Mitchison
http://www.amazon.com/exec/obidos/tg/detail/-/0521629713/qid=1099074233/sr=1-1/ref=sr_1_1/002-6013332-8507249?v=glance&s=books
BTW, the fact that lots of work has been done already, shouldn't
discourage you from going in. The field of pattern detection is far from
perfect. Putting it mildly...
Good luck,
Iddo
--
Iddo Friedberg, Ph.D.
The Burnham Institute
10901 N. Torrey Pines Rd.
La Jolla, CA 92037 USA
Tel: +1 (858) 646 3100 x3516
Fax: +1 (858) 713 9930
http://ffas.ljcrf.edu/~iddo
More information about the BBB
mailing list