[Bioclusters] NCBI database download and format code

02 May 2003 09:32:51 -0400

On Fri, 2003-05-02 at 09:16, Jeremy Mann wrote:
> > On Thu, 2003-05-01 at 18:29, Jeremy Mann wrote:
> >> I am curious if any knows of any commercial or open source solution to
> >> breaking up the NCBI dbs into various sizes. Here, our present
> >> solution is
> >
> > You can use the "formatdb -v N" option to have the database
> > automatically divided into groups of N x 10**6 letters.  I would
> > recommend this route for the database formatting side.  Keep the
> > original db around for the other tools.
> 
> Then how would you tell blastall which nodes have which *piece* of the
> database?

If you want to do MIMD type processing (eg node 1 has db chunk 1, node 2
has db chunk 2, etc), and have each blast job work on one chunk of the
data, you will need to create a method to 

a) distribute the relevant bits to the compute nodes in question
b) perform the actual run against the smaller data set
c) aggregate the results back to the submitter
d) reassemble the result for final presentation (optional, depending
upon how it is being used)

mpiBLAST will do some of this for you.  You need a shared storage
location for the smaller bits, but you could easily push the "shared"
bits to local storage prior to running mpiblast.  You would then simply
need to use a modified mpiblast.conf file (easy) to point to where the
bits sit.

As for how to do this in general for NCBI BLAST (and other codes), some
of us are working on products to enable exactly this across clusters et
al.  To avoid having this become an advertisement, please feel free to
contact me off-list and I can explain more.

-- 
Joseph Landman, Ph.D
Scalable Informatics LLC,
email: landman@scalableinformatics.com
web  : http://scalableinformatics.com
phone: +1 734 612 4615