[Bioclusters] Re: new on using clusters: problem running mpiblast (2)

darling at cs.wisc.edu darling at cs.wisc.edu
Fri Sep 21 11:27:49 EDT 2007


Good, it sounds like we're finally getting somewhere, but not quite
totally there yet...
First, let me apologize about the command lines, you may need to run some
of them again.  For the commands which start with 'mpiblast', I assumed
you would replace "mpiblast" with
something like "mpirun -np 16 -machinefile ./machines
/home/local/bin/mpiblast ..."

As I recall, running an MPI program without prefixing "mpirun ..." will
start the program on a single node although it can spawn additional
processes on other nodes via MPI library calls.

So what I can glean from your tests thus far is that mpiblast can
initialize MPI successfully on a single node, is failing somehow when
running on 2 nodes, but passes MPI_Init() on 16 nodes.  That's the
mysterious "orted" problem.  Sounds like Joe Landman has some good advice
on that issue.

Your original problem runs had made it further along in the mpiblast
program than any of this current batch.  If you can rerun the series of
commands I previously sent with the appropriate "mpirun ..." we can
hopefully narrow down the problem further, although I would recommend also
sorting out the "orted: command not found" issue.

Regards,
-Aaron




> Aaron,
>
> Many thanks for your hints.  Below please find what I get on trials of
> your hints:
>
> On Mon, 17 Sep 2007, Aaron Darling wrote:
>
>> Date: Mon, 17 Sep 2007 13:25:31 -0700
>> From: Aaron Darling <darling at cs.wisc.edu>
>> Reply-To: HPC in Bioinformatics <bioclusters at bioinformatics.org>
>> To: HPC in Bioinformatics <bioclusters at bioinformatics.org>
>> Subject: Re: [Bioclusters] Re: new on using clusters: problem running
>> mpiblast
>>      (2)
>>
>> This seems to be quite an elusive problem...
>> The rank 1 process is crashing with signal 11, which is usually a
>> segmentation fault, indicating an invalid memory access.  Assuming you
>> get the same behavior (no output) when running with --debug, it crashes
>> very early in the program, prior to writing any debug output.  I can see
>> two ways to debug the problem on your cluster, both of which will
>> require some patience.  The red pill would be running mpiblast in an mpi
>> debugger and see where the process crashes.  I'm unsure how the openmpi
>> debugger works, but there should be some mechanism to attach to the rank
>> 1 process.
>> The blue pill involves running mpiblast with a few different command
>> line options to see how far along in the program it gets before
>> crashing.  That might narrow down the crash point enough to give a clue
>> for the solving the problem.  If you take the blue pill, run the
>> following mpiblast commands:
>>
>> mpiblast --version
>> (this prints the version and is the first thing the program does at
>> startup.  if the program doesn't get that far then something is very
>> wrong.)
>
> Yeap, it gives version 1.4.0.
>
>> mpiblast
>> (run with no arguments, this causes the program to exit before parsing
>> the command-line and print an error message.  if the program doesn't get
>> that far then something is very wrong.)
>
> Indeed it responded with option suggestions:
> -------
> mpiBLAST requires the following options: -d [database] -i [query file] -p
> [blast program name]
>
>
>> mpiblast -a blah -b blah -c blah blah blah
>> (run with bogus arguments.  the program should exit with "mpiBLAST
>> requires the following options: -d [database] -i [query file] -p [blast
>> program name]".
>> This check happens after initializing the MPI libraries, so if you get
>> this error, then the mpi libs were init'ed successfully )
>
> Same as above.
>
>> mpirun -np 2 -machinefile ./machines /home/local/bin/mpiblast -p blastp
>> -i ./bait.fasta -d ecoli.aa
>> (mpiblast should report that it needs to be run on at least three nodes)
>
> Here is the error -- not as you expected:
> ---------------------------------------
> bash: orted: command not found
> bash: orted: command not found
> [ansci.iastate.edu:03916] ERROR: A daemon on node node001 failed to
> start as expected.
> [ansci.iastate.edu:03916] ERROR: There may be more information
> available from
> [ansci.iastate.edu:03916] ERROR: the remote shell (see above).
> [ansci.iastate.edu:03916] ERROR: The daemon exited unexpectedly
> with status 127.
> [ansci.iastate.edu:03916] [0,0,0] ORTE_ERROR_LOG: Timeout in file
> base/pls_base_orted_cmds.c at line 275
> [ansci.iastate.edu:03916] [0,0,0] ORTE_ERROR_LOG: Timeout in file
> pls_rsh_module.c at line 1164
> [ansci.iastate.edu:03916] [0,0,0] ORTE_ERROR_LOG: Timeout in file
> errmgr_hnp.c at line 90
> [ansci.iastate.edu:03916] ERROR: A daemon on node node002 failed to
> start as expected.
> [ansci.iastate.edu:03916] ERROR: There may be more information
> available from
> [ansci.iastate.edu:03916] ERROR: the remote shell (see above).
> [ansci.iastate.edu:03916] ERROR: The daemon exited unexpectedly
> with status 127.
> [ansci.iastate.edu:03916] [0,0,0] ORTE_ERROR_LOG: Timeout in file
> base/pls_base_orted_cmds.c at line 188
> [nagrp2.ansci.iastate.edu:03916] [0,0,0] ORTE_ERROR_LOG: Timeout in file
> pls_rsh_module.c at line 1196
> --------------------------------------------------------------------------
> mpirun was unable to cleanly terminate the daemons for this job. Returned
> value Timeout instead of ORTE_SUCCESS.
> --------------------------------------------------------------------------
>
>
>> mpiblast --copy-via=none  -p blastp -i ./bait.fasta -d ecoli.aa
>> (this should exit with the error message "Error: Shared and Local
>> storage must be identical when --copy_via=none")
>
> Here is the error -- not as expected:
> --------------------------------------
> Sorry, mpiBLAST must be run on 3 or more nodes
> [ansci.iastate.edu:04099] MPI_ABORT invoked on rank 0 in
> communicator MPI_COMM_WORLD with errorcode 0
>
>
>> mpiblast --pro-phile=asdfasdf --debug=logfile.txt  -p blastp -i
>> ./bait.fasta -d ecoli.aa
>> (this should write out "WARNING: --pro-phile is no longer supported" and
>> "logging to logfile.txt")
>
> Here is the error -- not as expected:
> --------------------------------------
> Sorry, mpiBLAST must be run on 3 or more nodes
> [ansci.iastate.edu:04102] MPI_ABORT invoked on rank 0 in
> communicator MPI_COMM_WORLD with errorcode 0
>
>
>> So, depending on how far through the list of commands you're able to get
>> error messages, we should be able to pin down where the program crashes.
>> Let me know how it goes.
>>
>> -aaron
>
> I noted my errors are a little different from you expected, so I repeated
> them, and made sure I copied the errors to the right trial commands.
>
> I hope these errors make some sense to you so you could come up more as
> what to try next...
>
> Best regards,
>
> Zhiliang
> _______________________________________________
> Bioclusters maillist  -  Bioclusters at bioinformatics.org
> https://bioinformatics.org/mailman/listinfo/bioclusters
>




More information about the Bioclusters mailing list