[Bioclusters] Questions on mpiBLAST

Thu Feb 3 14:05:01 EST 2005

Hi Xiaowu:

   mpiblast makes a great deal of sense with large numbers of input 
sequences, or with huge databases (nt).  There are startup costs to 
moving the database, and typically you will get the best performance by 
amortizing those costs over a large analysis (e.g. many sequences).

   It is possible (without knowing more about your situation), that the 
rate limiting factor for your analysis is the speed of moving the 
database fragments to the remote machines (this is still a serial 
process even in  mpiblast).  If you are doing many sequence comparisons, 
you will benefit as the database fragment motion needs to occur only 
once.  If you are doing very few (under 100) sequence comparisons, then 
the database fragment motion is liable to dominate your execution time.

   If you simply need a faster parallel blast, you might look into 
pre-fetching the database fragments to the remote nodes, in which case 
you no longer have that startup cost (though I don't remember if 
mpiblast works with a prefetched set of databases).  As this effectively 
defeats the mpiblast scheduler (which is one of the very nice features 
of the code),  this is not such a good method to use mpiblast with, 
though it works nicely with NCBI/WU blast.

   If Aaron is around, hopefully he can give you a more accurate/sound 
answer, and correct any mistakes I may have made in suppositions.

Joe

Xiaowu Gai wrote:
> Hi Everyone:
> 
> We have a 16-node Xserve cluster, with 2GB memory on each node and dual
> processors.  I was able to install mpiBLAST on it, along with LAM/MPI.
> However, the performance that I saw with some test runs has not been that
> good and quite confusing.  Here is what I did:
> 
> 
> 1.) I formatted the nt database:
> 
> mpiformatdb -N 16 -i nt
> 
> 2.) I ran the mpiblast on one, two, five, ten, twenty, and more sequences
> (about 500bp each) and with the command:
> 
> time mpirun N mpiblast -p blastn -d nt -i single.fa -o blast_results.
> 
> Here are the numbers:
> 
> Single: 1m39.054s
> Two: 0m11.009s
> Five: 0m16.021s
> Ten: 0m46.591s
> twenty: 3m7.541s
> ..
> 
> 
> I am all confused.  First of all, the performance is not that impressive.
> Secondly, the numbers are very confusing to me.  Why is that a single
> sequence query takes so much more time than a two (BTW, I reran the query of
> a single sequence right after the query of two and got similar results)? And
> query of five takes only 5 seconds more than the query of two and  so on..
> 
> I am afraid that I have done something wrong and would really appreciate any
> thoughts. 
> 
> Thanks
> 
> Xiaowu
> 
> _______________________________________________
> Bioclusters maillist  -  Bioclusters at bioinformatics.org
> https://bioinformatics.org/mailman/listinfo/bioclusters

-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web  : http://www.scalableinformatics.com
phone: +1 734 786 8423
fax  : +1 734 786 8452
cell : +1 734 612 4615