[BiO BB] About clustering genes to gene family

Dan Bolser dmb at mrc-dunn.cam.ac.uk
Fri Aug 8 07:02:06 EDT 2003


This method uses an all against all blast comparison as
input to the clustering. Can you really do that 'routinely'
with 500,000 sequences without dedicated hardware?

I guess once you have your initial 'pairs DB' you can then
add new sequences in without much work, and I guess the
actuall clustering is the 'efficient' part of the method.

The handling of multidomain proteins is interesting,
but I don't really see how it differs from demanding
a certain length of allignment within the family.

Although the technique is mathmatically clean,
it is a bit hazy when it comes to the multi domain
issue. I.e. if we have protein 1 with domains ABC,
what happens to protein 2 with domains AB?

What happens to the 'families' of type 1 and 2
in this strategy?

I love the extension of pairwise similarity to
group similarity using the network of blast
hits - that is really nice, but the biological
significance of the r factor (number of clusters)
is not investigated, which is a shame.

Anyone heard of BAG for domain decomposition
from such a network?

Thanks for the  info,
Dan.

Marcos Oliveira de Carvalho wrote:

>Hi Carol,
>I use TribeMCL software with good results.
>
>Here is the URL -> http://www.ebi.ac.uk/research/cgg/tribe/
>
>And here is the abstract of the paper about TribeMCL:
>
>TribeMCL is a method for clustering proteins into related groups, which 
>are termed 'protein families'. This clustering is achieved by analysing 
>similarity patterns between proteins in a given dataset, and using these 
>patterns to assign proteins into related groups. In many cases, proteins 
>in the same protein familywill have similar functional properties. 
>TribeMCL uses a novel clustering method (Markov Clustering or MCL) which 
>solves problems which normally hinder protein sequence clustering. These 
>problems include: multi-domain proteins, peptide fragments and proteins 
>which possess domains which are very widespread (promiscuous domains). The 
>efficiency of the method makes it applicable to the clustering of very 
>large datasets. We routinely use the algorithm to cluster datasets as 
>large as 500,000 peptides. 
>
>Cheers
>Marcos
>
>On Thu, 7 Aug 2003, Zheng Fu wrote:
>
>  
>
>>Hi everyone,
>>
>>Does anyone know how to clustering genes to a gene family based on the
>>sequence alignments.
>>For two genes, we can define a threshold to seperate the homolog and
>>non-homolog. But for three or more genes,how to define the homologs?(Such
>>as Gene A and Gene B has high alignment score, A and C also has high sore,
>>but B and C doesn't have high socre, can we say ABC are homologs?
>>
>>Thank you.
>>
>>Carol
>>
>>
>>    
>>
>
>  
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.bioinformatics.org/pipermail/bbb/attachments/20030808/25c2be22/attachment.html>


More information about the BBB mailing list