[BiO BB] How to find the same proteins?

Thu Mar 23 08:36:38 EST 2006

( sorry if this is a little off target- this is my first post to the list
and I'm cleaning out mailbox after returning from vacation)

Funny you should ask, the pubmed webmaster blasted me ( pun intended)
for suggesting they don't support automated searches.
See their eutils support page and you can write your own scripts.

"Dear NCBI user,

our requirements are described here - basically we ask for 3 sec. delay
between subsequent calls.
http://eutils.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html
(see User requirements).

see also:
http://www.ncbi.nlm.nih.gov/books/bv.fcgi?call=bv.View..ShowTOC&rid=coursework.TOC&depth=2

Regards,

NCBI Help desk
"

The hard part is not going to be fixed by running blast locally-
it is computationally rather than IO intensive.
I had this problem looking for chance epitope matches between
a whole protein vaccine ( a specific phosphatase ) and other things
and now I'm trying to look up some patented peptide sequences
for accidental matches.

I have attempted to clean up a script for illustration that 
uses their normal user interface- it is rather cumbersome and
involved and does not use the eutils facility. 
It does show that you can  take a specific sequence from a pubmed entry,
reformat it for other web services ( like epitope prediction), and send those results to 
blast to look for hits. Then, you can use eutils and scripts to filter
as needed at the time.

I test these using my last name as a sequence ( to see what I'm related to :))
and did verify they still work: 
$ blast -expect 1000 -format Text MARCHYWKA
$ blast_cleanedup_a_litte  -expect 1000 -format Text MARCHYWKA

Again, this uses their webform as a kluge, I have enother script for 
using their eutils facility but have neveer run it on blast, just to do bulk
abstract downloads ( which, by the way, can be organized with the gene
expression array software- it doesn't know genes from keywords or conditions
from documents...).

Let me know if you do an automated search using eutils.

( I finally decided not to post whole, messy kluge, just this part if you want to
use it right away. I would suggest seeing what eutils has but this should work
if you have cygwin or linux).
QUERYSTR="QUERY=${SEQ}&QUERY_FROM=&QUERY_TO=&DATABASE=nr&ENTREZ_QUERY=&ENTREZ_QUERY=All+organisms&COMPOSITION_BASED_STATISTICS=0&EXPECT=20000&WORD_SIZE=2&MATRIX_NAME=PAM30&GAPCOSTS=9+1&PSSM=&OTHER_ADVANCED=&PHI_PATTERN=&SHOW_OVERVIEW=on&SHOW_LINKOUT=on&GET_SEQUENCE=on&NCBI_GI=on&FORMAT_OBJECT=Alignment&FORMAT_TYPE=HTML&MASK_CHAR=0&MASK_COLOR=0&DESCRIPTIONS=100&ALIGNMENTS=50&ALIGNMENT_VIEW=Pairwise&I_THRESH=0.005&FORMAT_ENTREZ_QUERY=&FORMAT_ENTREZ_QUERY=All+organisms&EXPECT_LOW=&EXPECT_HIGH=&LAYOUT=TwoWindows&FORMAT_BLOCK_ON_RESPAGE=None&AUTO_FORMAT=Semiauto&PROGRAM=blastp&CLIENT=web&SERVICE=plain&PAGE=Proteins&CMD=Put"
QUERYSTR="QUERY=${SEQ}&QUERY_FROM=&QUERY_TO=&DATABASE=nr&ENTREZ_QUERY=&ENTREZ_QUERY=Homo+sapiens+[ORGN]&COMPOSITION_BASED_STATISTICS=0&EXPECT=2000&WORD_SIZE=2&MATRIX_NAME=PAM30&GAPCOSTS=9+1&PSSM=&OTHER_ADVANCED=&PHI_PATTERN=&SHOW_OVERVIEW=on&SHOW_LINKOUT=on&GET_SEQUENCE=on&NCBI_GI=on&FORMAT_OBJECT=Alignment&FORMAT_TYPE=HTML&MASK_CHAR=0&MASK_COLOR=0&DESCRIPTIONS=1000&ALIGNMENTS=1000&ALIGNMENT_VIEW=Pairwise&I_THRESH=0.005&FORMAT_ENTREZ_QUERY=&FORMAT_ENTREZ_QUERY=Homo+sapiens+[ORGN]&EXPECT_LOW=&EXPECT_HIGH=&LAYOUT=TwoWindows&FORMAT_BLOCK_ON_RESPAGE=None&AUTO_FORMAT=Semiauto&PROGRAM=blastp&CLIENT=web&SERVICE=plain&PAGE=Proteins&CMD=Put"
QUERYSTR="QUERY=${SEQ}&QUERY_FROM=&QUERY_TO=&DATABASE=nr&ENTREZ_QUERY=&ENTREZ_QUERY=All+organisms&COMPOSITION_BASED_STATISTICS=0&EXPECT=20&WORD_SIZE=2&MATRIX_NAME=PAM30&GAPCOSTS=9+1&PSSM=&OTHER_ADVANCED=&PHI_PATTERN=&SHOW_OVERVIEW=on&SHOW_LINKOUT=on&GET_SEQUENCE=on&NCBI_GI=on&FORMAT_OBJECT=Alignment&FORMAT_TYPE=HTML&MASK_CHAR=0&MASK_COLOR=0&DESCRIPTIONS=1000&ALIGNMENTS=1000&ALIGNMENT_VIEW=Pairwise&I_THRESH=0.005&FORMAT_ENTREZ_QUERY=&FORMAT_ENTREZ_QUERY=All+organisms&EXPECT_LOW=&EXPECT_HIGH=&LAYOUT=TwoWindows&FORMAT_BLOCK_ON_RESPAGE=None&AUTO_FORMAT=Semiauto&PROGRAM=blastp&CLIENT=web&SERVICE=plain&PAGE=Proteins&CMD=Put"

QS1="QUERY=${SEQ}"
QS2="&QUERY_FROM=&QUERY_TO=&DATABASE=nr&ENTREZ_QUERY=&ENTREZ_QUERY=All+organisms&COMPOSITION_BASED_STATISTICS=0"
QS3="&EXPECT=${expect}&WORD_SIZE=2&MATRIX_NAME=PAM30&GAPCOSTS=9+1&PSSM=&OTHER_ADVANCED=&PHI_PATTERN=&SHOW_OVERVIEW=on"
QS4="&SHOW_LINKOUT=on&GET_SEQUENCE=on&NCBI_GI=on&FORMAT_OBJECT=Alignment&FORMAT_TYPE=${format}&MASK_CHAR=0&MASK_COLOR=0"
QS5="&DESCRIPTIONS=1000&ALIGNMENTS=1000&ALIGNMENT_VIEW=Pairwise&I_THRESH=0.005&FORMAT_ENTREZ_QUERY="
QS6="&FORMAT_ENTREZ_QUERY=All+organisms&EXPECT_LOW=&EXPECT_HIGH=&LAYOUT=TwoWindows&FORMAT_BLOCK_ON_RESPAGE=None"
QS7="&AUTO_FORMAT=Semiauto&PROGRAM=blastp&CLIENT=web&SERVICE=plain&PAGE=Proteins&CMD=Put"

QUERYSTR="${QS1}${QS2}${QS3}${QS4}${QS5}${QS6}${QS7}"

RESULTSTR1="FORMAT_PAGE_TARGET=Format_page_919098664&RESULTS_PAGE_TARGET=Blast_Results_for_919098664&RID=1130591558-739-35617443386.BLASTQ3"
RESULTSTR2="&SHOW_OVERVIEW=on&SHOW_LINKOUT=on&GET_SEQUENCE=on&NCBI_GI=on&FORMAT_OBJECT=Alignment&FORMAT_TYPE=${format}&MASK_CHAR=0&MASK_COLOR=0"
RESULTSTR3="&DESCRIPTIONS=100&ALIGNMENTS=50&ALIGNMENT_VIEW=Pairwise&I_THRESH=0.005&FORMAT_ENTREZ_QUERY=&FORMAT_ENTREZ_QUERY=All+organisms&EXPECT_LOW="
RESULTSTR4="&EXPECT_HIGH=&RID=1130591558-739-35617443386.BLASTQ3&RTOE=10&CLIENT=web&FORMAT_OBJECT=Alignment&CMD=Get&PAGE=Proteins&_PGR=0"
RESULTSTR5="&PID=739&FORMAT_PAGE_TARGET=&RESULTS_PAGE_TARGET=&LAYOUT=TwoWindows&FORMAT_BLOCK_ON_RESPAGE=None&STEP_NUMBER=1&EXPECT=20000"
RESULTSTR6="&HITLIST_SIZE=100&DESCRIPTIONS=100&ALIGNMENTS=50&AUTO_FORMAT=Semiauto"
POSTURL="http://www.ncbi.nlm.nih.gov/BLAST/Blast.cgi"
LYNXCMD="lynx -width=0 -source -accept_all_cookies -dump -post_data "
echo $QUERYSTR | lynx -width=0 -source -accept_all_cookies -dump -post_data "${POSTURL}" >.temp_blast_0
VAR1=`cat .temp_blast_0 | grep RID | tail -n 1 | awk '{print $3}'`
VAR2=`cat .temp_blast_0 | grep "_TARGET"  | sed -n 's/.*Format.page.\([0-9]*\).*/\1/p'`
echo $VAR1
echo $VAR2
R1="FORMAT_PAGE_TARGET=Format_page_${VAR2}&RESULTS_PAGE_TARGET=Blast_Results_for_${VAR2}&RID=${VAR1}"
R2="&SHOW_OVERVIEW=on&SHOW_LINKOUT=on&GET_SEQUENCE=on&NCBI_GI=on&FORMAT_OBJECT=Alignment&FORMAT_TYPE=${format}&MASK_CHAR=0&MASK_COLOR=0"
R3="&DESCRIPTIONS=1000&ALIGNMENTS=1000&ALIGNMENT_VIEW=Pairwise&I_THRESH=0.005&FORMAT_ENTREZ_QUERY=&FORMAT_ENTREZ_QUERY=All+organisms&EXPECT_LOW="
R4="&EXPECT_HIGH=&RID=${VAR1}&RTOE=10&CLIENT=web&FORMAT_OBJECT=Alignment&CMD=Get&PAGE=Proteins&_PGR=0"
R5="&PID=739&FORMAT_PAGE_TARGET=&RESULTS_PAGE_TARGET=&LAYOUT=TwoWindows&FORMAT_BLOCK_ON_RESPAGE=None&STEP_NUMBER=1&EXPECT=20000"
R6="&HITLIST_SIZE=1000&DESCRIPTIONS=1000&ALIGNMENTS=1000&AUTO_FORMAT=Semiauto"
STATUS="WHERE_IS_POST_TEST"
until [ "${STATUS}" == "" ] 
do
sleep 3
echo ${R1}${R2}${R3}${R4}${R5}${R6}| $LYNXCMD "${POSTURL}" > .temp_blast_1
if [ "$?" -ne "0" ]
then
echo "Failed to get "
echo "${R1}${R2}${R3}${R4}${R5}${R6}"
fi
STATUS=`cat .temp_blast_1 | grep "Status=WAITING" `
STATUSX=`cat .temp_blast_1 | grep "Status=" `
statusline "$SEQ $STATUSX"
done

*************************************************************************
Mike Marchywka
EyeWonder
Instant Streaming, Infinite Results

1447 Peachtree Street
9th Floor
Atlanta, GA 30309

w.678-891-2033
c. 
h.770-565-8101
mmarchywka at eyewonder.com
alt: marchywka at hotmail.com
Instant Streaming, Intelligent results.
*************************************************************************

-----Original Message-----
From:
bio_bulletin_board-bounces+mmarchywka=eyewonder.com at bioinformatics.org
[mailto:bio_bulletin_board-bounces+mmarchywka=eyewonder.com at bioinformati
cs.org]On Behalf Of Pankaj
Sent: ThursdayMarch-23-2006 05:15 AM
To: The general forum at Bioinformatics.Org
Subject: Re: [BiO BB] How to find the same proteins?

For this u can go to NCBI BLAST page and go to BLASTP. There u can paste ur
sequence and select PDB as the database to query. Just click on BLAST and u
get all seq similar to ur sequence. Filter out the results to find PDB ids
>99% similar to ur protein.
Sine u have 200 proteins u can download NCBI database and run local BLAST also.
Cheers 
Pankaj Khurana
Research Scholar
National Institute of Immunology
New Delhi
India
--
Open WebMail Project (http://openwebmail.org)

---------- Original Message -----------
From: Semen Esilevsky <yesint4 at yahoo.com>
To: bio_bulletin_board at bioinformatics.org
Sent: Thu, 23 Mar 2006 01:30:58 -0800 (PST)
Subject: [BiO BB] How to find the same proteins?

> Dear all,
> I'm a novice in bioinformatics and this question is
> probably stupid, but...
> I have a list of ~200 PDB id's. For each of them I
> have to build a list of all entries in PDB, which
> represent the same protein (say, >99% sequence
> similarity and no large gaps). Could someone suggest
> me the least painfull way of doing this?
> As far as I understand all what I need is the database
> where all pairwice BLAST allignments of PDB chains are
> stored. I've found one as a part of a PISCES server,
> but it is incomplete and contains some internal
> inconsistensies. Could someone suggest me a better one
> or there is a simpler way out?
> 
> Best,
> Semen
> 
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around 
> http://mail.yahoo.com 
> _______________________________________________
> Bioinformatics.Org general forum  -  BiO_Bulletin_Board at bioinformatics.org
> https://bioinformatics.org/mailman/listinfo/bio_bulletin_board
------- End of Original Message -------

_______________________________________________
Bioinformatics.Org general forum  -  BiO_Bulletin_Board at bioinformatics.org
https://bioinformatics.org/mailman/listinfo/bio_bulletin_board