[BiO BB] Protein Datatypes for function prediction

Mike Marchywka marchywka at hotmail.com
Tue Jul 24 07:50:50 EDT 2007


Also, while I was going to suggest "ab initio QM" as computing power is 
increasing
rapidly, don't dismiss even simple text processing. That is, I was trying to 
find some
simple ways to screen some AFFX probe sets  I thought I would try this:

If you wanted to ask, "what is a tyrosine phosphatase?"
First, you could go download all the sequences that mention the term:
( a few thousand, enough to cluster perhaps)
518  eutilsnew -protein -out ptp -v "protein tyrosine phosphatase"
Extract fasta files,
519  /cygdrive/c/mydocs/scripts/cc/affx/file_parsing -fastas ptp ptp_fastas
Use your favorite word finder,
524  /cygdrive/c/mydocs/scripts/cc/affx/string_correlator -motif ptp_fastas 
10 >some_ptp_words
And sort them for the most common
532  cat some_ptp_words | awk '{print $3}' | sort | uniq -c | sort -g -r | 
more

2374 VCLGNICRSP
1678 FVCLGNICRSP
1182 LFVCLGNICRSP
  975 GNICRSPMAE
  904 GNICRSPTAE
  698 VLFVCLGNICRSP
  [...]

I hadn't been able to check this until I got my blast script to find the cdd 
database
but it does seem to predict known stuff:

blastnew -out ptp_cdd -hits 50 -cdd -expect 1000 VCLGNICRSP

$ cat ptp_cdd | more
BLASTP 2.2.16 [Mar-25-2007]


                                                                 Score    E
Sequences producing significant alignments:                      (bits) 
Value
gnl|CDD|65262 pfam01451, LMWPc, Low molecular weight phosphotyro...    25   
1.1
gnl|CDD|30743 COG0394, Wzb, Protein-tyrosine-phosphatase [Signal...    25   
1.1
gnl|CDD|29014 cd00115, LMWPc, Low molecular weight phosphatase f...    25   
1.1
gnl|CDD|47555 smart00226, LMWPc, Low molecular weight phosphatas...    24   
1.9
gnl|CDD|68479 pfam04906, Tweety, Tweety. The tweety (tty) gene h...    18    
  105
gnl|CDD|70330 pfam06856, DUF1251, Protein of unknown function (D...    18    
  138
gnl|CDD|71435 pfam07999, RHSP, Retrotransposon hot spot protein....    17    
  306
gnl|CDD|70701 pfam07245, Phlebovirus_G2, Phlebovirus glycoprotei...    17    
  306
gnl|CDD|72414 pfam08996, zf-DNA_Pol, DNA Polymerase alpha zinc f...    17    
  400


It wouldn't be difficult to build up vocabulary lists that distringuish 
different types
of proteins and handle variants the same way you handle plurals, 
capitalization,
etc in language processing. This wasn't my immediate interest but it
is something to consider if you need a quick-and-easy approach.


Mike Marchywka
586 Saint James Walk
Marietta GA 30067-7165
404-788-1216 (C)<- leave message
989-348-4796 (P)<- emergency only
marchywka at hotmail.com





>From: "Iddo Friedberg" <idoerg at gmail.com>
>Reply-To: "General Forum at Bioinformatics.Org" 
><bio_bulletin_board at bioinformatics.org>
>To: "General Forum at Bioinformatics.Org" 
><bio_bulletin_board at bioinformatics.org>
>Subject: Re: [BiO BB] Protein Datatypes for function prediction
>Date: Mon, 23 Jul 2007 15:06:44 +0200
>
>Shameless plug: read my review:

_________________________________________________________________
http://liveearth.msn.com




More information about the BBB mailing list