[Bioclusters] Questions on mpiBLAST

Thu Feb 3 13:27:40 EST 2005

"parallelizing" blast across cluster nodes only results in significant 
speed gains if you are trying to solve a large problem set or have a 
massive target database that in no way shape or form can squeeze into 
physical memory on one node.

The performance of BLAST is rate-limited first by how much RAM you have 
and then by how fast your disk I/O system is.

I think Joe Landman has also seen incredible variations in blast 
performance by experimenting with non-GNU architecture optimized 
compilers like those from IBM, Intel and the Portland Group.

16 machines with 2Gb of RAM reading database files off of ethernet based 
NFS is a "normal" compute farm config.

Outside of mpiblast you could be seeing performance lags caused by your 
network (if you are reading/writing via NFS or AFP) or by physical memory.

I'm not an expert on mpiblast but hope to start soon a personal project 
to integrate it with grid engine mostly to satisfy my own curiosity.

I agree with what Hrishikesh about your times -- you are searching with 
a very small query set and you did not mention your target database.

You may see better performance using one machine -- the first query will 
be slow but the other queries will come back faster since most or part 
of the target database will still be mmapped or whatever in RAM.

If you really want to test mpiblast out you need to pick a much larger 
query and target DB set.

-Chris

Hrishikesh Deshmukh wrote:

> Hi,
> I am no authority on BLAST, i guess you see a linear speedup increase
> only when the problem is huge, for 20 odd sequences mpiblast doesn't
> play, your ncbi blast is good enough! Just curious are the results for
> ncbi and mpiblast for the same dataset (input) match exactly?!
> 
> I am tryting to get BLAST and mpiBLAST running on Sun Grid, right now
> BLAST works in serial mode and mpiBLAST is kinds stuck!
> 
> Cheers,
> Hrishi
> 
> 
> On Thu, 03 Feb 2005 11:45:45 -0500, Xiaowu Gai <xgai at genome.chop.edu> wrote:
> 
>>Hi Everyone:
>>
>>We have a 16-node Xserve cluster, with 2GB memory on each node and dual
>>processors.  I was able to install mpiBLAST on it, along with LAM/MPI.
>>However, the performance that I saw with some test runs has not been that
>>good and quite confusing.  Here is what I did:
>>
>>1.) I formatted the nt database:
>>
>>mpiformatdb -N 16 -i nt
>>
>>2.) I ran the mpiblast on one, two, five, ten, twenty, and more sequences
>>(about 500bp each) and with the command:
>>
>>time mpirun N mpiblast -p blastn -d nt -i single.fa -o blast_results.
>>
>>Here are the numbers:
>>
>>Single: 1m39.054s
>>Two: 0m11.009s
>>Five: 0m16.021s
>>Ten: 0m46.591s
>>twenty: 3m7.541s
>>..
>>
>>I am all confused.  First of all, the performance is not that impressive.
>>Secondly, the numbers are very confusing to me.  Why is that a single
>>sequence query takes so much more time than a two (BTW, I reran the query of
>>a single sequence right after the query of two and got similar results)? And
>>query of five takes only 5 seconds more than the query of two and  so on..
>>
>>I am afraid that I have done something wrong and would really appreciate any
>>thoughts.
>>
>>Thanks
>>
>>Xiaowu
>>
>>_______________________________________________
>>Bioclusters maillist  -  Bioclusters at bioinformatics.org
>>https://bioinformatics.org/mailman/listinfo/bioclusters
>>
> 
> _______________________________________________
> Bioclusters maillist  -  Bioclusters at bioinformatics.org
> https://bioinformatics.org/mailman/listinfo/bioclusters

-- 
Chris Dagdigian, <dag at sonsorol.org>
BioTeam  - Independent life science IT & informatics consulting
Office: 617-665-6088, Mobile: 617-877-5498, Fax: 425-699-0193
PGP KeyID: 83D4310E iChat/AIM: bioteamdag  Web: http://bioteam.net