[BiO BB] redundant data

Fri Jan 9 06:28:03 EST 2004

++ Pankaj--
> hi everybody,
> i have a set for 200 sequences where the sequence similarity varies between
> 28-90%. i want to select a representative set from this bigger set so that i pick
> up sequences which are representative of the whole set. ie from this bigger set i
> want to remove the sequences that are very similar and represent them by just a
> single sequence. ie i want to have a non redundant set. can anyone please tell how
> thanx in advance
> pankaj

One of the very quickest (and also easiest) ways to do this is using the excellent 
program cd-hit ...

http://bioinformatics.ljcrf.edu/cd-hi/

It should run in a couple of seconds for 200 sequences.

I have some perl scripts to parse the output into mysql (tab delimited) for easy
cluster analysis if you like.

There are a couple of small problems with this software which the author is aware of
but is too busy to fix. It would be nice to make this a project to develop the
software here.

Alternatively you can use blastclust, which does what its name suggests, but has an
extra 'coverage' parameter which is not explicitly present in cd-hit. It is slower,
but on 200 sequences it will still finish in around 1 min. Also blastclust allows an
arbitary sequence identity threshold for clustering, whereas cd-hit is limited to a
minimum of 40% identity.

On bigger sequence sets (>5,000) the fundamental differences between blastclust and
cd-hit make cd-hit a good choice.

With all sequence clustering algorithms you have to worry about 'the domain
problem', but I am not sure which technique currenly deals with this the best. I
know of one algorithm (DIVCLUS) which was explicitly designed to handle this
problem,

Park J, Teichmann SA.
DIVCLUS: an automatic method in the GEANFAMMER package that finds homologous domains
in single- and multi-domain proteins. Bioinformatics. 1998;14(2):144-50.

http://bioinformatics.oupjournals.org/cgi/pmidlookup?view=reprint&pmid=9545446

Ta,
Dan.

> _______________________________________________
> BiO_Bulletin_Board maillist  -  BiO_Bulletin_Board at bioinformatics.org
> https://bioinformatics.org/mailman/listinfo/bio_bulletin_board