[Bioclusters] Questions on mpiBLAST

Fri Feb 4 10:42:02 EST 2005

Do let us know.
Hrishi

On Thu, 03 Feb 2005 16:34:16 -0500, Xiaowu Gai <xgai at genome.chop.edu> wrote:
> Thanks everyone for the quick and excellent responses.  So, it does not
> appear that I did something totally wrong here.   I guess I had expectations
> too high to start with, after reading an article saying that mpiBLAST is 170
> times faster, and I have never had experience of one blast run taking a time
> more than 170 X 1.4 minutes.  I played with larger query data sets:  the
> time needed does kinda level off dramatically with a smaller database (the
> yeast genome), but was not that obvious for larger data sets like nt/nr or
> human genome.  The query sequences are not biased in that they are not
> really repetitive sequences.  But Aaron's suggestion of the --disable-mpi-db
> might be the key.  I will try it and let everyone know if I do see a
> difference.
> 
> Xiaowu
> 
> 
> On 2/3/05 3:39 PM, "Aaron Darling" <darling at cs.wisc.edu> wrote:
> 
> > I'd like to make a brief addendum to Jason's excellent reply...
> >
> > Jason Gans wrote:
> >
> >> Hello,
> >>
> >> There are a number of reasons for the results you show below.
> >>
> >> 1) Load balancing.
> >>
> >> The latest version of mpiBLAST uses a master node and
> >> a scheduler node. Hence if you run mpiBLAST on 16 nodes, only 14 worker
> >> nodes will being performing the actual BLAST search (i.e. the heavy
> >> lifting).
> >>
> >> If you format your database into 16 fragments, 12 worker nodes will be
> >> assigned 1 fragment each and 2 worker nodes will get 2 fragments. This
> >> is fine
> >> for a large query (and may actually improve load balancing) but for a
> >> small query
> >> the nodes that must search 2 fragments will be the rate limiting step
> >> in your calculation.
> >>
> >> You're better off formatting your database into 14 fragments (so that
> >> every worker
> >> node searches a single fragment).
> >
> >
> > The "scheduler" process performs almost no work, so to really optimize
> > performance on a 16 node cluster one could try formatting the database
> > into 15 fragments and running 17 processes.  Of course, care must be
> > taken that the node which runs two processes is running a scheduler and
> > either a worker or output.  The best way I can think of to achieve this
> > would be adding the following at line 178 of mpiblast.cpp (version 1.3.0):
> > scheduler_process = node_count - 1;
> >
> > That will set the last MPI process to be the scheduler.  AFAIK, mpich
> > (and possibly other MPI implementations) will wrap around to the first
> > node when assigning processes beyond the number of nodes given in the
> > mpich configuration.  The net result being that the scheduler process
> > and writer process end up on the same cluster node.
> >
> >>
> >> 2) Run time depends not just on the length of the query, but on the
> >> sequence composition of
> >> the query as well.
> >>
> >> A query sequence that is "similar" to a large number of database
> >> sequences will take longer to
> >> search than a query sequence that is "similar" to a only small number
> >> of database sequences.
> >>
> >> The reason for this is two-fold: (a) The BLAST algorithm only fully
> >> aligns two sequences if it first
> >> identifies identical sub-sequences of length W or greater. (b) The
> >> time that mpiBLAST spends
> >> formatting the BLAST output is proportional to the number of database
> >> entires that match the
> >> query (not the query length).
> >>
> >
> > One additional factor that can significantly impact the run time is the
> > length of DB sequences that your queries hit.  By default, versions
> > 1.2.x and 1.3.0 of mpiBLAST transmit the *entire* database sequence over
> > the wire, not just the portion of the sequence used in the resulting
> > alignment.  Nucleotide databases like nt or the human chromosome DB
> > contain sequences several MB in length, which can result in LOTS of
> > network traffic.  Long sequences are not usually a problem with protein
> > sequence databases.  Fortunately, a workaround exists for blastn
> > searches.  The command-line option --disable-mpi-db will prevent workers
> > from transmitting sequences over the network.  Instead, the writer
> > process reads only the necessary parts of the sequence from the database
> > on shared storage (e.g. it reads a small amount of data from NFS instead
> > of a large amount of data from worker nodes).
> >
> > Summary: to get good performance, always use --disable-mpi-db when
> > performing blastn searches on databases with large sequence entries like
> > nt and human chromosomes.
> >
> > A nice feature for a future mpiBLAST release would be workers
> > transmitting only the aligned portion of the bioseq to the writer
> > instead of the entire bioseq...
> >
> > -Aaron
> > _______________________________________________
> > Bioclusters maillist  -  Bioclusters at bioinformatics.org
> > https://bioinformatics.org/mailman/listinfo/bioclusters
> >
> 
> _______________________________________________
> Bioclusters maillist  -  Bioclusters at bioinformatics.org
> https://bioinformatics.org/mailman/listinfo/bioclusters
>