[Bioclusters] Questions on mpiBLAST

Thu Feb 3 14:47:14 EST 2005

Hrishikesh Deshmukh wrote:
> While this debate is on, could somebody answer my question:
> Say i have thousnds of sequences in my input file and want to run
> mpiBLAST, will mpiBLAST split sequences and allot them to nodes and
> then get back results to say one file?

Yes

> Would it help if say DB (Human)
> is installed on everymachine on the node?

No, mpiblast will handle this for you.

> 
> Thanks,
> Hrishi
> 
> 
> On Thu, 03 Feb 2005 14:05:32 -0500, Joe Landman
> <landman at scalableinformatics.com> wrote:
> 
>>You rang .... :)
>>
>>Brodie, Kent wrote:
>>
>>>Q: can someone point me to the results obtained by Joe Landman?  (web
>>>site, or..?)
>>>
>>>Many thanks,  -- Kent C. Brodie, Medical College of Wisconsin
>>>
>>>
>>>
>>>
>>>
>>>>-----Original Message-----
>>>>From: bioclusters-bounces+brodie=mcw.edu at bioinformatics.org
>>>>[mailto:bioclusters-bounces+brodie=mcw.edu at bioinformatics.org] On
>>>
>>>Behalf
>>>
>>>
>>>>Of Chris Dagdigian
>>>>Sent: Thursday, February 03, 2005 12:28 PM
>>>>To: Hrishikesh Deshmukh; Clustering, compute farming & distributed
>>>>computing in life science informatics
>>>>Subject: Re: [Bioclusters] Questions on mpiBLAST
>>>>
>>>>
>>>>"parallelizing" blast across cluster nodes only results in significant
>>>>speed gains if you are trying to solve a large problem set or have a
>>>>massive target database that in no way shape or form can squeeze into
>>>>physical memory on one node.
>>>>
>>>>The performance of BLAST is rate-limited first by how much RAM you
>>>
>>>have
>>>
>>>
>>>>and then by how fast your disk I/O system is.
>>>>
>>>>I think Joe Landman has also seen incredible variations in blast
>>>>performance by experimenting with non-GNU architecture optimized
>>>>compilers like those from IBM, Intel and the Portland Group.
>>>>
>>>>16 machines with 2Gb of RAM reading database files off of ethernet
>>>
>>>based
>>>
>>>
>>>>NFS is a "normal" compute farm config.
>>>>
>>>>Outside of mpiblast you could be seeing performance lags caused by
>>>
>>>your
>>>
>>>
>>>>network (if you are reading/writing via NFS or AFP) or by physical
>>>
>>>memory.
>>>
>>>
>>>>I'm not an expert on mpiblast but hope to start soon a personal
>>>
>>>project
>>>
>>>
>>>>to integrate it with grid engine mostly to satisfy my own curiosity.
>>>>
>>>>I agree with what Hrishikesh about your times -- you are searching
>>>
>>>with
>>>
>>>
>>>>a very small query set and you did not mention your target database.
>>>>
>>>>You may see better performance using one machine -- the first query
>>>
>>>will
>>>
>>>
>>>>be slow but the other queries will come back faster since most or part
>>>>of the target database will still be mmapped or whatever in RAM.
>>>>
>>>>If you really want to test mpiblast out you need to pick a much larger
>>>>query and target DB set.
>>>>
>>>>-Chris
>>>>
>>>>
>>>>
>>>>
>>>>Hrishikesh Deshmukh wrote:
>>>>
>>>>
>>>>
>>>>>Hi,
>>>>>I am no authority on BLAST, i guess you see a linear speedup
>>>
>>>increase
>>>
>>>
>>>>>only when the problem is huge, for 20 odd sequences mpiblast doesn't
>>>>>play, your ncbi blast is good enough! Just curious are the results
>>>
>>>for
>>>
>>>
>>>>>ncbi and mpiblast for the same dataset (input) match exactly?!
>>>>>
>>>>>I am tryting to get BLAST and mpiBLAST running on Sun Grid, right
>>>
>>>now
>>>
>>>
>>>>>BLAST works in serial mode and mpiBLAST is kinds stuck!
>>>>>
>>>>>Cheers,
>>>>>Hrishi
>>>>>
>>>>>
>>>>>On Thu, 03 Feb 2005 11:45:45 -0500, Xiaowu Gai
>>>
>>><xgai at genome.chop.edu>
>>>
>>>>wrote:
>>>>
>>>>
>>>>>>Hi Everyone:
>>>>>>
>>>>>>We have a 16-node Xserve cluster, with 2GB memory on each node and
>>>
>>>dual
>>>
>>>
>>>>>>processors.  I was able to install mpiBLAST on it, along with
>>>
>>>LAM/MPI.
>>>
>>>
>>>>>>However, the performance that I saw with some test runs has not been
>>>>
>>>>that
>>>>
>>>>
>>>>>>good and quite confusing.  Here is what I did:
>>>>>>
>>>>>>1.) I formatted the nt database:
>>>>>>
>>>>>>mpiformatdb -N 16 -i nt
>>>>>>
>>>>>>2.) I ran the mpiblast on one, two, five, ten, twenty, and more
>>>>
>>>>sequences
>>>>
>>>>
>>>>>>(about 500bp each) and with the command:
>>>>>>
>>>>>>time mpirun N mpiblast -p blastn -d nt -i single.fa -o
>>>
>>>blast_results.
>>>
>>>
>>>>>>Here are the numbers:
>>>>>>
>>>>>>Single: 1m39.054s
>>>>>>Two: 0m11.009s
>>>>>>Five: 0m16.021s
>>>>>>Ten: 0m46.591s
>>>>>>twenty: 3m7.541s
>>>>>>..
>>>>>>
>>>>>>I am all confused.  First of all, the performance is not that
>>>>
>>>>impressive.
>>>>
>>>>
>>>>>>Secondly, the numbers are very confusing to me.  Why is that a
>>>
>>>single
>>>
>>>
>>>>>>sequence query takes so much more time than a two (BTW, I reran the
>>>>
>>>>query of
>>>>
>>>>
>>>>>>a single sequence right after the query of two and got similar
>>>
>>>results)?
>>>
>>>
>>>>And
>>>>
>>>>
>>>>>>query of five takes only 5 seconds more than the query of two and
>>>
>>>so
>>>
>>>
>>>>on..
>>>>
>>>>
>>>>>>I am afraid that I have done something wrong and would really
>>>
>>>appreciate
>>>
>>>
>>>>any
>>>>
>>>>
>>>>>>thoughts.
>>>>>>
>>>>>>Thanks
>>>>>>
>>>>>>Xiaowu
>>>>>>
>>>>>>_______________________________________________
>>>>>>Bioclusters maillist  -  Bioclusters at bioinformatics.org
>>>>>>https://bioinformatics.org/mailman/listinfo/bioclusters
>>>>>>
>>>>>
>>>>>_______________________________________________
>>>>>Bioclusters maillist  -  Bioclusters at bioinformatics.org
>>>>>https://bioinformatics.org/mailman/listinfo/bioclusters
>>>>
>>>>--
>>>>Chris Dagdigian, <dag at sonsorol.org>
>>>>BioTeam  - Independent life science IT & informatics consulting
>>>>Office: 617-665-6088, Mobile: 617-877-5498, Fax: 425-699-0193
>>>>PGP KeyID: 83D4310E iChat/AIM: bioteamdag  Web: http://bioteam.net
>>>>_______________________________________________
>>>>Bioclusters maillist  -  Bioclusters at bioinformatics.org
>>>>https://bioinformatics.org/mailman/listinfo/bioclusters
>>>
>>>_______________________________________________
>>>Bioclusters maillist  -  Bioclusters at bioinformatics.org
>>>https://bioinformatics.org/mailman/listinfo/bioclusters
>>
>>--
>>Joseph Landman, Ph.D
>>Founder and CEO
>>Scalable Informatics LLC,
>>email: landman at scalableinformatics.com
>>web  : http://www.scalableinformatics.com
>>phone: +1 734 786 8423
>>fax  : +1 734 786 8452
>>cell : +1 734 612 4615
>>
>>_______________________________________________
>>Bioclusters maillist  -  Bioclusters at bioinformatics.org
>>https://bioinformatics.org/mailman/listinfo/bioclusters
>>
> 
> _______________________________________________
> Bioclusters maillist  -  Bioclusters at bioinformatics.org
> https://bioinformatics.org/mailman/listinfo/bioclusters

-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web  : http://www.scalableinformatics.com
phone: +1 734 786 8423
fax  : +1 734 786 8452
cell : +1 734 612 4615