Do let us know. Hrishi On Thu, 03 Feb 2005 16:34:16 -0500, Xiaowu Gai <xgai at genome.chop.edu> wrote: > Thanks everyone for the quick and excellent responses. So, it does not > appear that I did something totally wrong here. I guess I had expectations > too high to start with, after reading an article saying that mpiBLAST is 170 > times faster, and I have never had experience of one blast run taking a time > more than 170 X 1.4 minutes. I played with larger query data sets: the > time needed does kinda level off dramatically with a smaller database (the > yeast genome), but was not that obvious for larger data sets like nt/nr or > human genome. The query sequences are not biased in that they are not > really repetitive sequences. But Aaron's suggestion of the --disable-mpi-db > might be the key. I will try it and let everyone know if I do see a > difference. > > Xiaowu > > > On 2/3/05 3:39 PM, "Aaron Darling" <darling at cs.wisc.edu> wrote: > > > I'd like to make a brief addendum to Jason's excellent reply... > > > > Jason Gans wrote: > > > >> Hello, > >> > >> There are a number of reasons for the results you show below. > >> > >> 1) Load balancing. > >> > >> The latest version of mpiBLAST uses a master node and > >> a scheduler node. Hence if you run mpiBLAST on 16 nodes, only 14 worker > >> nodes will being performing the actual BLAST search (i.e. the heavy > >> lifting). > >> > >> If you format your database into 16 fragments, 12 worker nodes will be > >> assigned 1 fragment each and 2 worker nodes will get 2 fragments. This > >> is fine > >> for a large query (and may actually improve load balancing) but for a > >> small query > >> the nodes that must search 2 fragments will be the rate limiting step > >> in your calculation. > >> > >> You're better off formatting your database into 14 fragments (so that > >> every worker > >> node searches a single fragment). > > > > > > The "scheduler" process performs almost no work, so to really optimize > > performance on a 16 node cluster one could try formatting the database > > into 15 fragments and running 17 processes. Of course, care must be > > taken that the node which runs two processes is running a scheduler and > > either a worker or output. The best way I can think of to achieve this > > would be adding the following at line 178 of mpiblast.cpp (version 1.3.0): > > scheduler_process = node_count - 1; > > > > That will set the last MPI process to be the scheduler. AFAIK, mpich > > (and possibly other MPI implementations) will wrap around to the first > > node when assigning processes beyond the number of nodes given in the > > mpich configuration. The net result being that the scheduler process > > and writer process end up on the same cluster node. > > > >> > >> 2) Run time depends not just on the length of the query, but on the > >> sequence composition of > >> the query as well. > >> > >> A query sequence that is "similar" to a large number of database > >> sequences will take longer to > >> search than a query sequence that is "similar" to a only small number > >> of database sequences. > >> > >> The reason for this is two-fold: (a) The BLAST algorithm only fully > >> aligns two sequences if it first > >> identifies identical sub-sequences of length W or greater. (b) The > >> time that mpiBLAST spends > >> formatting the BLAST output is proportional to the number of database > >> entires that match the > >> query (not the query length). > >> > > > > One additional factor that can significantly impact the run time is the > > length of DB sequences that your queries hit. By default, versions > > 1.2.x and 1.3.0 of mpiBLAST transmit the *entire* database sequence over > > the wire, not just the portion of the sequence used in the resulting > > alignment. Nucleotide databases like nt or the human chromosome DB > > contain sequences several MB in length, which can result in LOTS of > > network traffic. Long sequences are not usually a problem with protein > > sequence databases. Fortunately, a workaround exists for blastn > > searches. The command-line option --disable-mpi-db will prevent workers > > from transmitting sequences over the network. Instead, the writer > > process reads only the necessary parts of the sequence from the database > > on shared storage (e.g. it reads a small amount of data from NFS instead > > of a large amount of data from worker nodes). > > > > Summary: to get good performance, always use --disable-mpi-db when > > performing blastn searches on databases with large sequence entries like > > nt and human chromosomes. > > > > A nice feature for a future mpiBLAST release would be workers > > transmitting only the aligned portion of the bioseq to the writer > > instead of the entire bioseq... > > > > -Aaron > > _______________________________________________ > > Bioclusters maillist - Bioclusters at bioinformatics.org > > https://bioinformatics.org/mailman/listinfo/bioclusters > > > > _______________________________________________ > Bioclusters maillist - Bioclusters at bioinformatics.org > https://bioinformatics.org/mailman/listinfo/bioclusters >