[Bioclusters] Questions on mpiBLAST
Joe Landman
landman at scalableinformatics.com
Thu Feb 3 14:05:01 EST 2005
Hi Xiaowu:
mpiblast makes a great deal of sense with large numbers of input
sequences, or with huge databases (nt). There are startup costs to
moving the database, and typically you will get the best performance by
amortizing those costs over a large analysis (e.g. many sequences).
It is possible (without knowing more about your situation), that the
rate limiting factor for your analysis is the speed of moving the
database fragments to the remote machines (this is still a serial
process even in mpiblast). If you are doing many sequence comparisons,
you will benefit as the database fragment motion needs to occur only
once. If you are doing very few (under 100) sequence comparisons, then
the database fragment motion is liable to dominate your execution time.
If you simply need a faster parallel blast, you might look into
pre-fetching the database fragments to the remote nodes, in which case
you no longer have that startup cost (though I don't remember if
mpiblast works with a prefetched set of databases). As this effectively
defeats the mpiblast scheduler (which is one of the very nice features
of the code), this is not such a good method to use mpiblast with,
though it works nicely with NCBI/WU blast.
If Aaron is around, hopefully he can give you a more accurate/sound
answer, and correct any mistakes I may have made in suppositions.
Joe
Xiaowu Gai wrote:
> Hi Everyone:
>
> We have a 16-node Xserve cluster, with 2GB memory on each node and dual
> processors. I was able to install mpiBLAST on it, along with LAM/MPI.
> However, the performance that I saw with some test runs has not been that
> good and quite confusing. Here is what I did:
>
>
> 1.) I formatted the nt database:
>
> mpiformatdb -N 16 -i nt
>
> 2.) I ran the mpiblast on one, two, five, ten, twenty, and more sequences
> (about 500bp each) and with the command:
>
> time mpirun N mpiblast -p blastn -d nt -i single.fa -o blast_results.
>
> Here are the numbers:
>
> Single: 1m39.054s
> Two: 0m11.009s
> Five: 0m16.021s
> Ten: 0m46.591s
> twenty: 3m7.541s
> ..
>
>
> I am all confused. First of all, the performance is not that impressive.
> Secondly, the numbers are very confusing to me. Why is that a single
> sequence query takes so much more time than a two (BTW, I reran the query of
> a single sequence right after the query of two and got similar results)? And
> query of five takes only 5 seconds more than the query of two and so on..
>
> I am afraid that I have done something wrong and would really appreciate any
> thoughts.
>
> Thanks
>
> Xiaowu
>
> _______________________________________________
> Bioclusters maillist - Bioclusters at bioinformatics.org
> https://bioinformatics.org/mailman/listinfo/bioclusters
--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web : http://www.scalableinformatics.com
phone: +1 734 786 8423
fax : +1 734 786 8452
cell : +1 734 612 4615
More information about the Bioclusters
mailing list