[Bioclusters] blastmerge available

David Vilanova bioclusters@bioinformatics.org
Tue, 12 Mar 2002 09:12:22 +0100


Hello guys ,
I've found an interesting post in another list that I would like to
share with the bioclusters list. For those that are beginners (as I am)
here below you can find some useful scripts...

David Vilanova


Post....

"David Mathog" <mathog@caltech.edu> wrote in message
news:<3C6DA633.71C3401B@caltech.edu>...
> Tired of waiting for an MPI or PVM version of BLAST?  Me too.  So I 
> put together a little C program called "blastmerge" that does the last

> step in the following: Fragment a big database(nt, nr, Ensembl etc.), 
> distribute the BLAST formatted fragments across compute nodes, run the

> same query set on all of them at once, reassemble a single
> blast output file from all of the the pieces.   Probably this has been
done
> before but
> I couldn't find an example - so here's mine.  It's available from:
> 
>   ftp://saf.bio.caltech.edu/pub/software/molbio/blastmerge.c
> 
> Here's how you use it:
> 
> 1.  fragment a fasta file into N pieces and distribute 1 piece to
>     each of N nodes. Here's some code to do the fragmenting part and
the
>     distribution part is up to you:
> 
>       ftp://saf.bio.caltech.edu/pub/software/molbio/fastasplitn.c
> 
> 2.  Rename each node's fragment nt_lcl (for instance) and then run 
> formatdb (from the NCBI)
>     on each node.
> 3.  Put links from the NFS shared BLAST directory (assuming you have 
> one) that is
>     pointed to by BLASTDB to the actual nt_lcl files.  If you don't
have
>     such a database let BLASTDB point to /usr/local/databases. 4.  On 
> all nodes "N" run a BLAST job like:
>     blastall -pblastn -d nt_lcl -i dmmrna.nfa -e .000001 \>dna0${N}
>     where N might vary from 0->8.
> 5.  When they all finish reassemble the output like this:
>     blastmerge outfile dna00 dna02 (etc.) dna08
> 
> The trick of course is to get the database fragments small enough so 
> that they stay in cache.  Even if you can't manage that the IO off of 
> N disks will be N times faster than off of 1.
> 
> Caveats.
> 
> A.  limited testing, caveat emptor, first release etc.
> B.  There is a limit of 10 input files.  To do more than that run
batches
>     of 10 at a time and then merge those results.
> 
> And for good measure, here's one more program you can use for running 
> a large number of queries against a small NFS served BLAST database.  
> In that case getting to the "one big output file" just requires 
> concatenating the output files.
> 
>   ftp://saf.bio.caltech.edu/pub/software/molbio/fastarange.c
> 
> use it like:
>   (on node 0)
>   fastarange 1    1000 <queries | /usr/common/bin/blastall -p blastp
-d small
> (options)
>   (on node 1)
>   fastarange 1001 2000 <queries | /usr/common/bin/blastall -p blastp 
> -d small
> (options)
>   etc.
> 
> Have fun and please report any bugs that turn up.
> 
> Regards,
> 
> David Mathog
> mathog@caltech.edu