[Bioclusters] blastall and SGE

Chris Dwan bioclusters@bioinformatics.org
Wed, 29 Sep 2004 16:04:00 -0400


On Sep 29, 2004, at 3:22 PM, Juan Carlos Perin wrote:

> This is very disappointing considering a single G5 can search the NT 
> database in under 3 minutes, while running on multiple nodes actually 
> takes well over ten minutes.

This seems like a great opportunity to bring up the old parallel 
computing saw:

Parallelizing a computational task adds overhead.  In using multiple 
CPUs on a single problem, you almost always end up doing more work than 
you would have, had you just run the task on a single processor.  The 
parallel cost can include time spent in the scheduler, time spent 
reading files from a shared fileserver, time spent partitioning the 
target set, and the time of merging the results back together.  At 
least in BLAST, there is little to no interprocess communication to 
slow things down, thank goodness.

The classic formulation was done by Gene Amdahl many years ago:

Time to run on one CPU = serial_portion + parallelizable_portion
Time to run on N CPUs   = serial_portion + (parallel_portion / N) + 
parallel_cost(N)

Total work done increases, but the time to complete any single job 
drops.  Speedup is limited by the non-parallelizable portion of the 
code, in this case partitioning the target and merging the results.

There are lots of exceptions to this rule.  The big ones are all points 
where performance as a function of problem size is discontinuous.  This 
usually happens when the memory requirements cross a hardware boundary: 
  Cache -> RAM -> Disk.

Any time that tasks are trivially parallel (a large batch of input 
files to be searched against the same target, for example) it will 
almost always be more efficient (in terms of CPU-minutes spent on the 
problem as a whole) to run each job as a single thread on a single CPU. 
  This is easier to implement (submit a bunch of jobs to the queuing 
system), easier to tune (tune once, run everywhere), and easier to 
debug.

The vast majority of the users of BLAST farms are more interested in 
throughput than response time.  They have thousands of query sequences, 
and they want results for all of those queries.

There are some users who really want response time from BLAST.  Most 
users of the NCBI BLAST server fall in this category.  Parallelized 
BLAST is for these folks.  The process of tuning a cluster to run a 
single BLAST job as fast as it possibly can is non-trivial, as lots of 
people on this list know.

So the question really comes down to "what do your users want, batch 
throughput or response time?"

Chris Dwan
The BioTeam