[Bioclusters] Questions on mpiBLAST

Thu Feb 3 14:33:39 EST 2005

While this debate is on, could somebody answer my question:
Say i have thousnds of sequences in my input file and want to run
mpiBLAST, will mpiBLAST split sequences and allot them to nodes and
then get back results to say one file? Would it help if say DB (Human)
is installed on everymachine on the node?

Thanks,
Hrishi

On Thu, 03 Feb 2005 14:05:32 -0500, Joe Landman
<landman at scalableinformatics.com> wrote:
> You rang .... :)
> 
> Brodie, Kent wrote:
> > Q: can someone point me to the results obtained by Joe Landman?  (web
> > site, or..?)
> >
> > Many thanks,  -- Kent C. Brodie, Medical College of Wisconsin
> >
> >
> >
> >
> >>-----Original Message-----
> >>From: bioclusters-bounces+brodie=mcw.edu at bioinformatics.org
> >>[mailto:bioclusters-bounces+brodie=mcw.edu at bioinformatics.org] On
> >
> > Behalf
> >
> >>Of Chris Dagdigian
> >>Sent: Thursday, February 03, 2005 12:28 PM
> >>To: Hrishikesh Deshmukh; Clustering, compute farming & distributed
> >>computing in life science informatics
> >>Subject: Re: [Bioclusters] Questions on mpiBLAST
> >>
> >>
> >>"parallelizing" blast across cluster nodes only results in significant
> >>speed gains if you are trying to solve a large problem set or have a
> >>massive target database that in no way shape or form can squeeze into
> >>physical memory on one node.
> >>
> >>The performance of BLAST is rate-limited first by how much RAM you
> >
> > have
> >
> >>and then by how fast your disk I/O system is.
> >>
> >>I think Joe Landman has also seen incredible variations in blast
> >>performance by experimenting with non-GNU architecture optimized
> >>compilers like those from IBM, Intel and the Portland Group.
> >>
> >>16 machines with 2Gb of RAM reading database files off of ethernet
> >
> > based
> >
> >>NFS is a "normal" compute farm config.
> >>
> >>Outside of mpiblast you could be seeing performance lags caused by
> >
> > your
> >
> >>network (if you are reading/writing via NFS or AFP) or by physical
> >
> > memory.
> >
> >>I'm not an expert on mpiblast but hope to start soon a personal
> >
> > project
> >
> >>to integrate it with grid engine mostly to satisfy my own curiosity.
> >>
> >>I agree with what Hrishikesh about your times -- you are searching
> >
> > with
> >
> >>a very small query set and you did not mention your target database.
> >>
> >>You may see better performance using one machine -- the first query
> >
> > will
> >
> >>be slow but the other queries will come back faster since most or part
> >>of the target database will still be mmapped or whatever in RAM.
> >>
> >>If you really want to test mpiblast out you need to pick a much larger
> >>query and target DB set.
> >>
> >>-Chris
> >>
> >>
> >>
> >>
> >>Hrishikesh Deshmukh wrote:
> >>
> >>
> >>>Hi,
> >>>I am no authority on BLAST, i guess you see a linear speedup
> >
> > increase
> >
> >>>only when the problem is huge, for 20 odd sequences mpiblast doesn't
> >>>play, your ncbi blast is good enough! Just curious are the results
> >
> > for
> >
> >>>ncbi and mpiblast for the same dataset (input) match exactly?!
> >>>
> >>>I am tryting to get BLAST and mpiBLAST running on Sun Grid, right
> >
> > now
> >
> >>>BLAST works in serial mode and mpiBLAST is kinds stuck!
> >>>
> >>>Cheers,
> >>>Hrishi
> >>>
> >>>
> >>>On Thu, 03 Feb 2005 11:45:45 -0500, Xiaowu Gai
> >
> > <xgai at genome.chop.edu>
> >
> >>wrote:
> >>
> >>>>Hi Everyone:
> >>>>
> >>>>We have a 16-node Xserve cluster, with 2GB memory on each node and
> >
> > dual
> >
> >>>>processors.  I was able to install mpiBLAST on it, along with
> >
> > LAM/MPI.
> >
> >>>>However, the performance that I saw with some test runs has not been
> >>
> >>that
> >>
> >>>>good and quite confusing.  Here is what I did:
> >>>>
> >>>>1.) I formatted the nt database:
> >>>>
> >>>>mpiformatdb -N 16 -i nt
> >>>>
> >>>>2.) I ran the mpiblast on one, two, five, ten, twenty, and more
> >>
> >>sequences
> >>
> >>>>(about 500bp each) and with the command:
> >>>>
> >>>>time mpirun N mpiblast -p blastn -d nt -i single.fa -o
> >
> > blast_results.
> >
> >>>>Here are the numbers:
> >>>>
> >>>>Single: 1m39.054s
> >>>>Two: 0m11.009s
> >>>>Five: 0m16.021s
> >>>>Ten: 0m46.591s
> >>>>twenty: 3m7.541s
> >>>>..
> >>>>
> >>>>I am all confused.  First of all, the performance is not that
> >>
> >>impressive.
> >>
> >>>>Secondly, the numbers are very confusing to me.  Why is that a
> >
> > single
> >
> >>>>sequence query takes so much more time than a two (BTW, I reran the
> >>
> >>query of
> >>
> >>>>a single sequence right after the query of two and got similar
> >
> > results)?
> >
> >>And
> >>
> >>>>query of five takes only 5 seconds more than the query of two and
> >
> > so
> >
> >>on..
> >>
> >>>>I am afraid that I have done something wrong and would really
> >
> > appreciate
> >
> >>any
> >>
> >>>>thoughts.
> >>>>
> >>>>Thanks
> >>>>
> >>>>Xiaowu
> >>>>
> >>>>_______________________________________________
> >>>>Bioclusters maillist  -  Bioclusters at bioinformatics.org
> >>>>https://bioinformatics.org/mailman/listinfo/bioclusters
> >>>>
> >>>
> >>>_______________________________________________
> >>>Bioclusters maillist  -  Bioclusters at bioinformatics.org
> >>>https://bioinformatics.org/mailman/listinfo/bioclusters
> >>
> >>--
> >>Chris Dagdigian, <dag at sonsorol.org>
> >>BioTeam  - Independent life science IT & informatics consulting
> >>Office: 617-665-6088, Mobile: 617-877-5498, Fax: 425-699-0193
> >>PGP KeyID: 83D4310E iChat/AIM: bioteamdag  Web: http://bioteam.net
> >>_______________________________________________
> >>Bioclusters maillist  -  Bioclusters at bioinformatics.org
> >>https://bioinformatics.org/mailman/listinfo/bioclusters
> >
> > _______________________________________________
> > Bioclusters maillist  -  Bioclusters at bioinformatics.org
> > https://bioinformatics.org/mailman/listinfo/bioclusters
> 
> --
> Joseph Landman, Ph.D
> Founder and CEO
> Scalable Informatics LLC,
> email: landman at scalableinformatics.com
> web  : http://www.scalableinformatics.com
> phone: +1 734 786 8423
> fax  : +1 734 786 8452
> cell : +1 734 612 4615
> 
> _______________________________________________
> Bioclusters maillist  -  Bioclusters at bioinformatics.org
> https://bioinformatics.org/mailman/listinfo/bioclusters
>