Hello guys , I've found an interesting post in another list that I would like to share with the bioclusters list. For those that are beginners (as I am) here below you can find some useful scripts... David Vilanova Post.... "David Mathog" <mathog@caltech.edu> wrote in message news:<3C6DA633.71C3401B@caltech.edu>... > Tired of waiting for an MPI or PVM version of BLAST? Me too. So I > put together a little C program called "blastmerge" that does the last > step in the following: Fragment a big database(nt, nr, Ensembl etc.), > distribute the BLAST formatted fragments across compute nodes, run the > same query set on all of them at once, reassemble a single > blast output file from all of the the pieces. Probably this has been done > before but > I couldn't find an example - so here's mine. It's available from: > > ftp://saf.bio.caltech.edu/pub/software/molbio/blastmerge.c > > Here's how you use it: > > 1. fragment a fasta file into N pieces and distribute 1 piece to > each of N nodes. Here's some code to do the fragmenting part and the > distribution part is up to you: > > ftp://saf.bio.caltech.edu/pub/software/molbio/fastasplitn.c > > 2. Rename each node's fragment nt_lcl (for instance) and then run > formatdb (from the NCBI) > on each node. > 3. Put links from the NFS shared BLAST directory (assuming you have > one) that is > pointed to by BLASTDB to the actual nt_lcl files. If you don't have > such a database let BLASTDB point to /usr/local/databases. 4. On > all nodes "N" run a BLAST job like: > blastall -pblastn -d nt_lcl -i dmmrna.nfa -e .000001 \>dna0${N} > where N might vary from 0->8. > 5. When they all finish reassemble the output like this: > blastmerge outfile dna00 dna02 (etc.) dna08 > > The trick of course is to get the database fragments small enough so > that they stay in cache. Even if you can't manage that the IO off of > N disks will be N times faster than off of 1. > > Caveats. > > A. limited testing, caveat emptor, first release etc. > B. There is a limit of 10 input files. To do more than that run batches > of 10 at a time and then merge those results. > > And for good measure, here's one more program you can use for running > a large number of queries against a small NFS served BLAST database. > In that case getting to the "one big output file" just requires > concatenating the output files. > > ftp://saf.bio.caltech.edu/pub/software/molbio/fastarange.c > > use it like: > (on node 0) > fastarange 1 1000 <queries | /usr/common/bin/blastall -p blastp -d small > (options) > (on node 1) > fastarange 1001 2000 <queries | /usr/common/bin/blastall -p blastp > -d small > (options) > etc. > > Have fun and please report any bugs that turn up. > > Regards, > > David Mathog > mathog@caltech.edu