[Bioclusters] Parallel blast

Ivo Grosse bioclusters@bioinformatics.org
Fri, 07 Jun 2002 11:27:51 -0400

Joe Landman <landman@scientificappliance.com> wrote on Fri, 7 Jun 2002:

> I do not know precisely what Paracel's code does.

Also I don't know *precisely* what the code does.  I only know 
*vaguely* that

- it can fragment the query sequence and also the database, and

- it recomputes the final P and E values based on the set of P and E 
values obtained for the query-sequence / database fragments, and 
Paracel is proud on the fact that their final P and E values are 
identical (plus/minus epsilon) to the P and E values that would have 
been obtained by running NCBI Blast on the non-fragmented 
query-sequence and the non-fragmented database.

> pathological case (e.g. worst case) was something Ivo Grosse suggested
> with Chr21 vs pufferfish, where I was getting about 8x speedup on 16
> CPUs.  

If I remember correctly, Paracel's Blast had almost exactly the same 
speed, so a speed-loss of 50% per node seemed normal for programs that 
also fragment the database.

> work by segmenting the input query sequences, optionally segmenting the
> databases (this isnt always a performance win though),

Exactly.  I guess when only fragmenting the query sequence, but not the 
database, the Blast throughput should scale almost linearly with the 
number of nodes, till the fileserver cannot handle the output anymore.

The only problem with not fragmenting the database is that:

- the database may not fit into memory, or

- you may need to buy more memory for *each* of the compute nodes, and 
if alternatively you would spend that amount of money for additional 
nodes, then it may be that a Blast program that can split the database 
runs faster on the larger cluster than a Blast program that cannot 
split the database runs on the smaller cluster.