[BiO BB] How to find the same proteins?

Thu Mar 23 09:47:58 EST 2006

If my earlier reply ever gets by the moderator you will see that generally 
nlm supports automated searches via eutils but they appear to
support blast only via a special utility. The clustering added from your site
is a nice additional feature but it is amazingly easy to download clustering
software from many sources and run with scripts for any purpose- I used gene expression array software
to organize authors from a biotech message board. 

*************************************************************************
Mike Marchywka
EyeWonder
Instant Streaming, Infinite Results

1447 Peachtree Street
9th Floor
Atlanta, GA 30309

w.678-891-2033
c. 
h.770-565-8101
mmarchywka at eyewonder.com
alt: marchywka at hotmail.com
Instant Streaming, Intelligent results.
*************************************************************************

-----Original Message-----
From:
bio_bulletin_board-bounces+mmarchywka=eyewonder.com at bioinformatics.org
[mailto:bio_bulletin_board-bounces+mmarchywka=eyewonder.com at bioinformati
cs.org]On Behalf Of Dan Bolser
Sent: ThursdayMarch-23-2006 09:18 AM
To: The general forum at Bioinformatics.Org
Subject: Re: [BiO BB] How to find the same proteins?

Semen Esilevsky wrote:
> Dear all,
> I'm a novice in bioinformatics and this question is
> probably stupid, but...
> I have a list of ~200 PDB id's. For each of them I
> have to build a list of all entries in PDB, which
> represent the same protein (say, >99% sequence
> similarity and no large gaps). Could someone suggest
> me the least painfull way of doing this?
> As far as I understand all what I need is the database
> where all pairwice BLAST allignments of PDB chains are
> stored. I've found one as a part of a PISCES server,
> but it is incomplete and contains some internal
> inconsistensies. Could someone suggest me a better one
> or there is a simpler way out?

It is not a stupid question, but rather a common problem for the whole 
field! It would be useful if you could describe the problems you are 
having with PISCES, as that is a very popular and commonly used database.

The simplest approach I can think of is to combine your list of proteins 
with a full fasta database of the PDB (unless your proteins are already 
in that fasta file), and then run CD-HIT on the fasta file (with your 
own choice of sequence identity clustering threshold)...

http://bioinformatics.org/cd-hit/

'The same' proteins (defined here by sequence identity) will be found in 
the same CD-HIT clusters.

Hmm... That reminds me...

> Best,
> Semen
> 
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around 
> http://mail.yahoo.com 
> _______________________________________________
> Bioinformatics.Org general forum  -  BiO_Bulletin_Board at bioinformatics.org
> https://bioinformatics.org/mailman/listinfo/bio_bulletin_board

_______________________________________________
Bioinformatics.Org general forum  -  BiO_Bulletin_Board at bioinformatics.org
https://bioinformatics.org/mailman/listinfo/bio_bulletin_board