[BiO BB] About clustering genes to gene family

Dan Bolser dmb at mrc-dunn.cam.ac.uk
Thu Aug 7 14:57:19 EDT 2003

What you describe can occur for 2 good reasons...

You are forming a 'complex cluster', created by *multiple domain* 

A has domains in common with B,
B has domains in common with C.

A and C have no domains in common, and hence no homology.


A: |------W------/-----X-----|
B:                       |------x-----/-----Y-------|
C:                                         |------y-------/--------hello 


A and C are too distantly related for sequence searches to uncover their
true homology. However, sequence B is *intermediate* to A and C,
having homology to both...

        /   \
      /       \
    /           \
 A              C

NB: Sequence similarity is not a metric, as it does not obey triangular 
(I think it is metric at high levels of similarity though?)

In this case you have used the transitive nature of sequence similarity 
to uncover
distant homology via an intermediate sequence.

Jong Park and Sarah Techimann worked on both these ideas, and has 
created a 
family clustering package called GENEFAMMER, Specifically DIVCLUS breaks up
complex clusters into domain families. Transitivity is implemented 
(kinda) in psiblast /
hmm models, all three of which are used in PFAM, so you might want to 
look there
for your families.

Or you could insist your allignments cover 90% of the shortest sequence, 
and then
cluster using single linkage.


Zheng Fu wrote:

>Hi everyone,
>Does anyone know how to clustering genes to a gene family based on the
>sequence alignments.
>For two genes, we can define a threshold to seperate the homolog and
>non-homolog. But for three or more genes,how to define the homologs?(Such
>as Gene A and Gene B has high alignment score, A and C also has high sore,
>but B and C doesn't have high socre, can we say ABC are homologs?
>Thank you.

More information about the BBB mailing list