[Bioclusters] blastmerge available
David Vilanova
bioclusters@bioinformatics.org
Tue, 12 Mar 2002 09:12:22 +0100
Hello guys ,
I've found an interesting post in another list that I would like to
share with the bioclusters list. For those that are beginners (as I am)
here below you can find some useful scripts...
David Vilanova
Post....
"David Mathog" <mathog@caltech.edu> wrote in message
news:<3C6DA633.71C3401B@caltech.edu>...
> Tired of waiting for an MPI or PVM version of BLAST? Me too. So I
> put together a little C program called "blastmerge" that does the last
> step in the following: Fragment a big database(nt, nr, Ensembl etc.),
> distribute the BLAST formatted fragments across compute nodes, run the
> same query set on all of them at once, reassemble a single
> blast output file from all of the the pieces. Probably this has been
done
> before but
> I couldn't find an example - so here's mine. It's available from:
>
> ftp://saf.bio.caltech.edu/pub/software/molbio/blastmerge.c
>
> Here's how you use it:
>
> 1. fragment a fasta file into N pieces and distribute 1 piece to
> each of N nodes. Here's some code to do the fragmenting part and
the
> distribution part is up to you:
>
> ftp://saf.bio.caltech.edu/pub/software/molbio/fastasplitn.c
>
> 2. Rename each node's fragment nt_lcl (for instance) and then run
> formatdb (from the NCBI)
> on each node.
> 3. Put links from the NFS shared BLAST directory (assuming you have
> one) that is
> pointed to by BLASTDB to the actual nt_lcl files. If you don't
have
> such a database let BLASTDB point to /usr/local/databases. 4. On
> all nodes "N" run a BLAST job like:
> blastall -pblastn -d nt_lcl -i dmmrna.nfa -e .000001 \>dna0${N}
> where N might vary from 0->8.
> 5. When they all finish reassemble the output like this:
> blastmerge outfile dna00 dna02 (etc.) dna08
>
> The trick of course is to get the database fragments small enough so
> that they stay in cache. Even if you can't manage that the IO off of
> N disks will be N times faster than off of 1.
>
> Caveats.
>
> A. limited testing, caveat emptor, first release etc.
> B. There is a limit of 10 input files. To do more than that run
batches
> of 10 at a time and then merge those results.
>
> And for good measure, here's one more program you can use for running
> a large number of queries against a small NFS served BLAST database.
> In that case getting to the "one big output file" just requires
> concatenating the output files.
>
> ftp://saf.bio.caltech.edu/pub/software/molbio/fastarange.c
>
> use it like:
> (on node 0)
> fastarange 1 1000 <queries | /usr/common/bin/blastall -p blastp
-d small
> (options)
> (on node 1)
> fastarange 1001 2000 <queries | /usr/common/bin/blastall -p blastp
> -d small
> (options)
> etc.
>
> Have fun and please report any bugs that turn up.
>
> Regards,
>
> David Mathog
> mathog@caltech.edu