[Bioclusters] Condor cluster and BLAST

Wed Jan 25 09:51:40 EST 2006

Sandy,

First the nitty gritty:
--------------------------
* I recommend using the binary downloads at ftp://ftp.ncbi.nih.gov/ 
blast/.   If computation is ever the limiting factor on your system  
then switch over to custom binaries.

* Follow the README with regards to .ncbirc files and the location of  
substitution matrices.

* I usually install binaries and BLAST targets on an NFS shared  
directory.  This saves me the trouble of updating binaries on all the  
nodes if anything changes.  when access to the datasets becomes the  
performance limiting factor (if it ever does), I rig a system as  
described below.

* Pull down a couple of pre-formatted targets from NCBI (ftp:// 
ftp.ncbi.nih.gov/blast/db) to demonstrate functionality.  Then  
schedule a conversation with your users about what target sets they  
actually want.

* If response time on single queries is ever the limiting factor on  
your system, there are many parallel BLAST solutions available.  If  
it becomes something that people are willing to spend money on, there  
are also some really impressive hardware accelerators out there.   
Don't worry about either of these unless you have a demonstrated need  
for them.

More detail:
-----------------
Installing and tuning BLAST is a very broad question with lots of  
history and strong opinions surrounding it.  Here are some general  
thoughts:

BLAST is I/O bound on large target sets.  The very best thing you can  
do to improve BLAST performance is to make sure that you have  
sufficient RAM on each compute node to hold the index files for your  
target sets.  Second to that, fast local disk on the nodes is a big  
help.  I've had great luck with software RAID across two internal disks.

Once the above are met, the next bottleneck will be getting the  
target set from shared storage out to the nodes.  Most people who are  
building a serious BLAST farm set up some way to synchronize the  
commonly used targets out to the local disk on the nodes.  This begs  
the question of ensuring that you do not disrupt running jobs with a  
data update.  For small installations, this is most simply handled  
with sociology rather than technology.

You're unlikely to achieve super-linear speedups by parallelizing  
BLAST across the nodes.  Parallel BLAST solutions are great for  
improving response time on a single query, but most users  
(particularly those with command line access to a cluster) are not  
interested in just running a single query.  The most common use case  
is the user with thousands of independent queries all to be run as a  
batch.  The most effective way to get this sort of job done is one- 
job-per-cpu.  As many folks have pointed out, this is "high- 
throughput" computing rather than "high-performance" per se.

BLAST targets need to be freshened and updated on a regular basis.   
This requires some sort of agreement with the users as to their  
expectations.  If nobody plans to use the WGS dataset, that's around  
56GB of disk space and network bandwidth that can be saved.   Some  
datasets (NR, NT, etc) are published every few months with daily  
updates in between.  Others have different schedules.  Figure out  
what your users need before building a system to try to support it.

Good luck!

-Chris Dwan