Mike Marchywka marchywka at hotmail.com
Wed Oct 24 08:27:36 EDT 2007

>Before, you should transform the database file such that

I've taken my local blast databases and used their fasta form for
"grepping" ( using my own code that calls either greta or boos regex 
against various genome sequences. It turns out to be too slow for repetitive
usage but I would comment as follow.

The patterns of biological interest tend to be subsets of regex so you can
implement special code that is a lot faster when your query isn't 
For example, a "conserved" domain may look like  "neutral"-many irrlelvant-
cysteine-X-cysteine-many irrelevant-H- etc (I just made this up but it is 
based on
many thing I've seen in the literature). You may have a hard time blasting 
for this
but you can grep for it with something like 

If you want a real-life example, here are some from prosite using my prosite 
PERL translation scheme ( I hate illustrating with real things that may not 
be right):

[LIVM][VIC].[^H]G[DENQTA].[GAC][^L].[LIVMFY]{4}.{2}G >rule|16|PEPDTIDE 
[EQ][^LNYH].[ATV][FY][^LDAM][^T]W[^PG]N >rule|18|PEPDTIDE Prosite ACTININ_1

>From what I've seen, this is too slow for grep against many genes ( or 
pre-translated peptides)
but you can compile the query and target for much faster searching ( similar
to a transient database index ). Even literal string matching can be slow 
doing this - I have 500k empirically discovered ( highly-redundant lots of 
junk )
repeats that I can now label against 100, 60kb sequences in "reasonable" 
which I could not do before. This works fine for the 600 or so mirna 
I finally figured out how to download from sanger too :)

>You could use fgrep. Fgrep is faster than grep.
>Before, you should transform the database file such that
>each sequence takes one line without blank and without
>line breaks (using tr and sed)
>Database files are optimized for hole cards for
>historical reasons. Lines are  wrapped after at least after 72
>characters, preventing the use of fgrep.
