[Bioclusters] Daemonizing blast, ie running many sequences through
1 process
Joe Landman
bioclusters@bioinformatics.org
Fri, 07 Nov 2003 09:03:52 -0500
Hi Michael:
Michael.James@csiro.au wrote:
>We have a problem with 66 nodes becoming NFS bound
> when blasting many (>10,000) sequences
> against the same database set.
>
>One approach (which we are trying) is to cache database files locally,
> so nodes can re-read their files without bottlenecking on NFS.
>
>
This is a good idea. Local reads will be faster than remote reads,
most of the time, on modern hardware. Moreover, if you are hitting your
NFS server that hard (run a 'vmstat 1' on it while doing the run and
look at the "bi", "int", "cs" columns and compare those to a "quiescent"
state with no runs going), the bottleneck is likely not to be the
network or the compute nodes, but in the network port, NFS daemons, wire
bandwidth, or NFS throughput on the server. You can tune the number of
NFS daemons (under RedHat, look in the /etc/init.d/nfs file for the
variable indicating the number of nfsd to start) and this might help a
little. However, moving the data to the local machines is likely to
have a greater impact. Consider also database segmentation (use the
formatdb "-v" switch). You want the databases small enough to not
overflow your compute node memory.
>A totally different approach, with even better performance potential,
> would be if a blast process could start up, load its database(s)
> and process multiple queries until told to exit.
>
>
BLAST was not designed for this, but can be coaxed into something like
this. The "telling it to exit" bit is done through an EOF of the input
file. You would need to partition up your input files, and then submit
jobs which transfered only that sequence data with it, and then
performed the BLAST. You would have to reassemble the output files.
This is what I had done with ctblast (SGI GenomeCluster), and MSC.LIFE a
number of years ago.
To do this you would need some tools (if you wanted to hack it
yourself). First input data segmentation (see
http://scalableinformatics.com/downloads/segment.pl). Second, using the
formatdb -v X command, and third, some sort of job scheduler and
reassembler interface.
I would recommend also looking at mpiBLAST (http://mpiblast.lanl.gov) as
Aaron Darling did most of this stuff for you. Moreover, you can grab
the run_mpiblast (http://scalableinformatics.com/run_mpiblast.html) tool
from my downloads page (http://scalableinformatics.com/downloads) to
help you use mpiBLAST on your systems.
There are other options for this. Most of them would need to be
purchased. None of them are "cheap", including the tools we are working on.
Joe
>This dilutes the startup cost across all the jobs to be run on that node.
>
>Does NCBI blast do this?
>Is there a blast that does?
>Anyone interested in writing one?
>What's involved?
>
>Thanks for any pointers,
>michaelj
>
>
>
--
Joseph Landman, Ph.D
Scalable Informatics LLC,
email: landman@scalableinformatics.com
web : http://scalableinformatics.com
phone: +1 734 612 4615