[Bioclusters] Daemonizing blast, ie running many sequences through 1 process

Joe Landman bioclusters@bioinformatics.org
Fri, 07 Nov 2003 09:03:52 -0500


Hi Michael:

Michael.James@csiro.au wrote:

>We have a problem with 66 nodes becoming NFS bound
> when blasting many (>10,000) sequences
> against the same database set.
>
>One approach (which we are trying) is to cache database files locally,
> so nodes can re-read their files without bottlenecking on NFS.
>  
>

This is a good idea.   Local reads will be faster than remote reads, 
most of the time, on modern hardware.  Moreover, if you are hitting your 
NFS server that hard (run a 'vmstat 1' on it while doing the run and 
look at the "bi", "int", "cs" columns and compare those to a "quiescent" 
state with no runs going), the bottleneck is likely not to be the 
network or the compute nodes, but in the network port, NFS daemons, wire 
bandwidth, or NFS throughput on the server.  You can tune the number of 
NFS daemons (under RedHat, look in the /etc/init.d/nfs file for the 
variable indicating the number of nfsd to start) and this might help a 
little.  However, moving the data to the local machines is likely to 
have a greater impact.  Consider also database segmentation (use the 
formatdb "-v" switch).  You want the databases small enough to not 
overflow your compute node memory.

>A totally different approach, with even better performance potential,
> would be if a blast process could start up, load its database(s)
> and process multiple queries until told to exit.
>  
>
BLAST was not designed for this, but can be coaxed into something like 
this.  The "telling it to exit" bit is done through an EOF of the input 
file.  You would need to partition up your input files, and then submit 
jobs which transfered only that sequence data with it, and then 
performed the BLAST.  You would have to reassemble the output files.  
This is what I had done with ctblast (SGI GenomeCluster), and MSC.LIFE a 
number of years ago.

To do this you would need some tools (if you wanted to hack it 
yourself).  First input data segmentation (see 
http://scalableinformatics.com/downloads/segment.pl).  Second, using the 
formatdb -v X command,  and third, some sort of job scheduler and 
reassembler interface.

I would recommend also looking at mpiBLAST (http://mpiblast.lanl.gov) as 
Aaron Darling did most of this stuff for you.  Moreover, you can grab 
the run_mpiblast (http://scalableinformatics.com/run_mpiblast.html) tool 
from my downloads page (http://scalableinformatics.com/downloads) to 
help you use mpiBLAST on your systems. 

There are other options for this.  Most of them would need to be 
purchased.  None of them are "cheap", including the tools we are working on.

Joe

>This dilutes the startup cost across all the jobs to be run on that node.
>
>Does NCBI blast do this?
>Is there a blast that does?
>Anyone interested in writing one?
>What's involved?
>
>Thanks for any pointers,
>michaelj
>
>  
>

-- 
Joseph Landman, Ph.D
Scalable Informatics LLC,
email: landman@scalableinformatics.com
web  : http://scalableinformatics.com
phone: +1 734 612 4615