[Bioclusters] Gridlet test of BLAST using datagrid directories.

Mon, 2 Dec 2002 11:07:06 -0600 (CST)

> This test (quick hack) shows how one can use multiple
> computers, including spare MacOSX, Windows, and Linux
> workstations, to distribute and speed up large biosequence
> analyses, BLAST in this example.  If you can split large data
> sets to small subsets distributed to many computers, analyze each
> subset and reassemble subset results to a whole, you should
> be able trade time for compute nodes.

I apologize if this is obvious or redundant.  

The scheme you've coded (split the target into N chunks, where N is
the number of compute nodes availalable, ship one chunk to each
compute node, and then assemble results at the end), may well reduce
response time on any single query.  It is also highly susceptible to
transient or permanent failures in the network, compute node, or
re-assembly stages.  It also adds a great deal of overhead to each of
the jobs.  The fact that the target needs to be re-formatted
every time we gain or lose a compute node seems particularly iffy.

A simpler model (used by many of the folks on this list) exploits the
parallelism between jobs, rather than within each query:  Any single
search is run on a single computational node.  A queuing system of
your choice  is used to schedule jobs onto nodes, manage transient and
permanent failures, stage data, and all that other neat stuff.  Some
sites even share jobs between clusters, idle workstations, and
servers by establishing a common repository of larger, un-split
targets. 

The only objection I've run into with this simpler scheme is that it's
not "grid" enough.

-Chris Dwan
 Center for Computational Genomics and Bioinformatics
 University of Minnesota