[BiO BB] looking for conserved domain downloadable databases.

Mike Marchywka marchywka at hotmail.com
Sat Jul 26 15:30:13 EDT 2008

I was trying to find something like the prosite rules database that may be include more conserved domains.
That is, I've got a bunch of short peptides and I want to determine if any of them have functional
significance. I would imagine that function prediction servers may have such database but probably
not in downloadable form. In particular, I took about 3000 short sequences that have something to
do with cell cycle arrest ( eutilsnew is my own script but you get the idea), 
eutilsnew -protein -v -out stuff '"cell cycle" arrest'
$progpath/file_parsing -fastas stuff  stuff_fasta

I have a way to get the most frequently occuring short strings. In this case, I got some interesting hits,
( and also found out that "M" occurs at the start quite often, adding some confidence that the code is running

 $progpath/string_test -fastas stuff_fasta  -status -conserved | grep [A-Z] | sort -g -r -k 2 > cca_roots
$ head cca_roots
   M 2321
PENL 565
   L 545
FENL 461
YENL 458
   F 456
   W 455
WENL 454
  MS 425
RSPS 396

In any case, I wanted to see if the regular expression [PFYW]ENL means anything.
First, I did get a control group,
( only got the first 1500 and used ctrl-c to "select" the first few),

eutilsnew -v -protein -out some_hydo "hydroxylase"

$ head hydro_roots
   M 1418
GDAA 312
GAGL 308
DAAH 299
AGLL 283
GLLS 266
IGLA 263
PVAG 258
LLSS 253
AGQG 253

The prosite rule list that I have shows some "ENL" candidates explicitly( non of which
include PWY or W as a leading acid )  and maybe more that
are more cryptic,

$ grep ENL /cygdrive/c/mydocs/scripts/cc/affx/prosite_rules
P.{2}[LIVMF]{2}[LIVMS].[GDN].{3}[DENL].{3}[LIVM].E.{4}[GNQKRH][LIVM][AP]>rule|216|PEPDTIDE Prosite RIBOSOMAL_S2_2
K[LIVMF]DG[LIVMAS][SAG].{4}Y.{2}[GRD].[LF].{4}[ST]RG[DN]G.{2}G[DE][DENL]>rule|832|PEPDTIDE Prosite DNA_LIGASE_N1

but that is all I have to go on. I did a quick look at NCBI CDART and related pfam resources but
couldn't figure out how to download anything useful.  I couldn't immediately get blast to return
any hits on "ENL" and I'm not sure what all parameters I'd need to tweak to search on short things.


