[Bioclusters] Re: new on using clusters: problem running mpiblast (2)

Zhiliang Hu hu at animalgenome.org
Thu Sep 20 15:21:55 EDT 2007


Aaron,

Many thanks for your hints.  Below please find what I get on trials of 
your hints:

On Mon, 17 Sep 2007, Aaron Darling wrote:

> Date: Mon, 17 Sep 2007 13:25:31 -0700
> From: Aaron Darling <darling at cs.wisc.edu>
> Reply-To: HPC in Bioinformatics <bioclusters at bioinformatics.org>
> To: HPC in Bioinformatics <bioclusters at bioinformatics.org>
> Subject: Re: [Bioclusters] Re: new on using clusters: problem running mpiblast
>      (2)
> 
> This seems to be quite an elusive problem...
> The rank 1 process is crashing with signal 11, which is usually a
> segmentation fault, indicating an invalid memory access.  Assuming you
> get the same behavior (no output) when running with --debug, it crashes
> very early in the program, prior to writing any debug output.  I can see
> two ways to debug the problem on your cluster, both of which will
> require some patience.  The red pill would be running mpiblast in an mpi
> debugger and see where the process crashes.  I'm unsure how the openmpi
> debugger works, but there should be some mechanism to attach to the rank
> 1 process.
> The blue pill involves running mpiblast with a few different command
> line options to see how far along in the program it gets before
> crashing.  That might narrow down the crash point enough to give a clue
> for the solving the problem.  If you take the blue pill, run the
> following mpiblast commands:
>
> mpiblast --version
> (this prints the version and is the first thing the program does at
> startup.  if the program doesn't get that far then something is very wrong.)

Yeap, it gives version 1.4.0.

> mpiblast
> (run with no arguments, this causes the program to exit before parsing
> the command-line and print an error message.  if the program doesn't get
> that far then something is very wrong.)

Indeed it responded with option suggestions:
-------
mpiBLAST requires the following options: -d [database] -i [query file] -p 
[blast program name]


> mpiblast -a blah -b blah -c blah blah blah
> (run with bogus arguments.  the program should exit with "mpiBLAST
> requires the following options: -d [database] -i [query file] -p [blast
> program name]".
> This check happens after initializing the MPI libraries, so if you get
> this error, then the mpi libs were init'ed successfully )

Same as above.

> mpirun -np 2 -machinefile ./machines /home/local/bin/mpiblast -p blastp
> -i ./bait.fasta -d ecoli.aa
> (mpiblast should report that it needs to be run on at least three nodes)

Here is the error -- not as you expected:
---------------------------------------
bash: orted: command not found
bash: orted: command not found
[ansci.iastate.edu:03916] ERROR: A daemon on node node001 failed to 
start as expected.
[ansci.iastate.edu:03916] ERROR: There may be more information 
available from
[ansci.iastate.edu:03916] ERROR: the remote shell (see above).
[ansci.iastate.edu:03916] ERROR: The daemon exited unexpectedly 
with status 127.
[ansci.iastate.edu:03916] [0,0,0] ORTE_ERROR_LOG: Timeout in file 
base/pls_base_orted_cmds.c at line 275
[ansci.iastate.edu:03916] [0,0,0] ORTE_ERROR_LOG: Timeout in file 
pls_rsh_module.c at line 1164
[ansci.iastate.edu:03916] [0,0,0] ORTE_ERROR_LOG: Timeout in file 
errmgr_hnp.c at line 90
[ansci.iastate.edu:03916] ERROR: A daemon on node node002 failed to 
start as expected.
[ansci.iastate.edu:03916] ERROR: There may be more information 
available from
[ansci.iastate.edu:03916] ERROR: the remote shell (see above).
[ansci.iastate.edu:03916] ERROR: The daemon exited unexpectedly 
with status 127.
[ansci.iastate.edu:03916] [0,0,0] ORTE_ERROR_LOG: Timeout in file 
base/pls_base_orted_cmds.c at line 188
[nagrp2.ansci.iastate.edu:03916] [0,0,0] ORTE_ERROR_LOG: Timeout in file 
pls_rsh_module.c at line 1196
--------------------------------------------------------------------------
mpirun was unable to cleanly terminate the daemons for this job. Returned 
value Timeout instead of ORTE_SUCCESS.
--------------------------------------------------------------------------


> mpiblast --copy-via=none  -p blastp -i ./bait.fasta -d ecoli.aa
> (this should exit with the error message "Error: Shared and Local
> storage must be identical when --copy_via=none")

Here is the error -- not as expected:
--------------------------------------
Sorry, mpiBLAST must be run on 3 or more nodes
[ansci.iastate.edu:04099] MPI_ABORT invoked on rank 0 in 
communicator MPI_COMM_WORLD with errorcode 0


> mpiblast --pro-phile=asdfasdf --debug=logfile.txt  -p blastp -i
> ./bait.fasta -d ecoli.aa
> (this should write out "WARNING: --pro-phile is no longer supported" and
> "logging to logfile.txt")

Here is the error -- not as expected:
--------------------------------------
Sorry, mpiBLAST must be run on 3 or more nodes
[ansci.iastate.edu:04102] MPI_ABORT invoked on rank 0 in 
communicator MPI_COMM_WORLD with errorcode 0


> So, depending on how far through the list of commands you're able to get
> error messages, we should be able to pin down where the program crashes.
> Let me know how it goes.
>
> -aaron

I noted my errors are a little different from you expected, so I repeated 
them, and made sure I copied the errors to the right trial commands.

I hope these errors make some sense to you so you could come up more as 
what to try next...

Best regards,

Zhiliang


More information about the Bioclusters mailing list