[Bioclusters] Re: mpiBLAST Performance

bioclusters@bioinformatics.org bioclusters@bioinformatics.org
Tue, 1 Jul 2003 05:50:42 +0200 (CEST)

> From: "landman" <landman@scalableinformatics.com>
> To: bioclusters@bioinformatics.org,
> 	Joe Landman <bioclusters@bioinformatics.org>
> Subject: Re: [Bioclusters] mpiBLAST Performance
> Date: Mon, 30 Jun 2003 00:12:00 -0500
> Reply-To: bioclusters@bioinformatics.org
> I had looked into this a number of years ago for SGI GenomeCluster.  A
> colleague had noticed that he obtained much better load balance for
> parallel ClustalW using a "sort" method (making the chunks more
> uniform), than by leaving the data as found.
> I tried this with the query sequences, and found a little benefit.  I
> did not try with the database.

I tried it. it speeds up parallel blasting.

it seeems it is due to the fact that in genebank the sequences are present
in deposit date order or something like that. this means that you can
end-up with segments that for example full of bacterial sequences that
have significant homology one to the other. if your query is a bacterial
fragment it will tend to have many hits in this db fragments, and some of
them will also have long alignments to your query. so the computation on
this fragmentwil be longer than on the other fragments because of the uneven hit
distribution, and load balancing will be screwed-up.
you randomize the db entry order, and the problem goes away.

> Some of the database entries are huge.
> These huge entries pose a problem with the alignment algorithms.  If
> there were a way one could build an approximate function that
> represents the time to calculate an alignment, you might be able to get
> creative with the subdivision.  Even then you would really need to make
> sure the scheduler was aware of the huge bubble.
> The idea is that the load balance gets shot all out of whack when one
> or two database fragments dominate the time due to excessively long
> strings.  The shuffle should try to preserve something like the length
> distribution in the entire database.  Even better would be a simple
> code to scan through the database, make approximate segments, and
> indicate how "close" to the full database distribution they are.
> Joe
> On Sun, 29 Jun 2003 23:45:43 -0400, Lucas Carey wrote
>> Has anyone looked into why there is such a large speedup when
>> shuffling the database? Does this hold for the query as well? Are  you
>> just randomizing the db sequence entries?

Ivan Rossi - ivan@biodec.com