[Bioclusters] Re: new on using clusters: problem running mpiblast (2)

Aaron Darling darling at cs.wisc.edu
Mon Sep 17 16:25:31 EDT 2007


This seems to be quite an elusive problem...
The rank 1 process is crashing with signal 11, which is usually a 
segmentation fault, indicating an invalid memory access.  Assuming you 
get the same behavior (no output) when running with --debug, it crashes 
very early in the program, prior to writing any debug output.  I can see 
two ways to debug the problem on your cluster, both of which will 
require some patience.  The red pill would be running mpiblast in an mpi 
debugger and see where the process crashes.  I'm unsure how the openmpi 
debugger works, but there should be some mechanism to attach to the rank 
1 process.
The blue pill involves running mpiblast with a few different command 
line options to see how far along in the program it gets before 
crashing.  That might narrow down the crash point enough to give a clue 
for the solving the problem.  If you take the blue pill, run the 
following mpiblast commands:

mpiblast --version
(this prints the version and is the first thing the program does at 
startup.  if the program doesn't get that far then something is very wrong.)

mpiblast
(run with no arguments, this causes the program to exit before parsing 
the command-line and print an error message.  if the program doesn't get 
that far then something is very wrong.)

mpiblast -a blah -b blah -c blah blah blah
(run with bogus arguments.  the program should exit with "mpiBLAST 
requires the following options: -d [database] -i [query file] -p [blast 
program name]".
This check happens after initializing the MPI libraries, so if you get 
this error, then the mpi libs were init'ed successfully )

mpirun -np 2 -machinefile ./machines /home/local/bin/mpiblast -p blastp 
-i ./bait.fasta -d ecoli.aa
(mpiblast should report that it needs to be run on at least three nodes)

mpiblast --copy-via=none  -p blastp -i ./bait.fasta -d ecoli.aa
(this should exit with the error message "Error: Shared and Local 
storage must be identical when --copy_via=none")

mpiblast --pro-phile=asdfasdf --debug=logfile.txt  -p blastp -i 
./bait.fasta -d ecoli.aa
(this should write out "WARNING: --pro-phile is no longer supported" and 
"logging to logfile.txt")


So, depending on how far through the list of commands you're able to get 
error messages, we should be able to pin down where the program crashes.
Let me know how it goes.

-aaron



Zhiliang Hu wrote:
> Thanks Aaron,
>
> Indeed I got it compiled before (and now again, without my last 
> reported "CC/CPP" exports, and with or without non-specific "export 
> CC=mpicc" and "export CXX=mpicxx" suggested by Zhao Xu).
>
> The problem was, when I run the mpiblast with:
>   /opt/openmpi.gcc/bin/mpirun -np 16 -machinefile ./machines
>        /home/local/bin/mpiblast -p blastp -i ./bait.fasta -d ecoli.aa
>
> I got following error that I don't have a clue as where to look for 
> the cause:
>
> 1       0.095628        Bailing out with signal 11
> [node001:13406] MPI_ABORT invoked on rank 1 in communicator 
> MPI_COMM_WORLD with errorcode 0
> 0       0.101815        Bailing out with signal 15
> [node001:13405] MPI_ABORT invoked on rank 0 in communicator 
> MPI_COMM_WORLD with errorcode 0
> 15      0.157852        Bailing out with signal 15
> [node001:13420] MPI_ABORT invoked on rank 15 in communicator 
> MPI_COMM_WORLD with errorcode 0
> 2       0.105103        Bailing out with signal 15
> [node001:13407] MPI_ABORT invoked on rank 2 in communicator 
> MPI_COMM_WORLD with errorcode 0
> 3       0.109706        Bailing out with signal 15
> [node001:13408] MPI_ABORT invoked on rank 3 in communicator 
> MPI_COMM_WORLD with errorcode 0
> 4       0.114032        Bailing out with signal 15
> [node001:13409] MPI_ABORT invoked on rank 4 in communicator 
> MPI_COMM_WORLD with errorcode 0
> 5       0.117891        Bailing out with signal 15
> [node001:13410] MPI_ABORT invoked on rank 5 in communicator 
> MPI_COMM_WORLD with errorcode 0
> 6       0.122292        Bailing out with signal 15
> [node001:13411] MPI_ABORT invoked on rank 6 in communicator 
> MPI_COMM_WORLD with errorcode 0
> 7       0.125675        Bailing out with signal 15
> [node001:13412] MPI_ABORT invoked on rank 7 in communicator 
> MPI_COMM_WORLD with errorcode 0
> 8       0.129363        Bailing out with signal 15
> [node001:13413] MPI_ABORT invoked on rank 8 in communicator 
> MPI_COMM_WORLD with errorcode 0
> 9       0.134528        Bailing out with signal 15
> [node001:13414] MPI_ABORT invoked on rank 9 in communicator 
> MPI_COMM_WORLD with errorcode 0
> 10      0.138087        Bailing out with signal 15
> [node001:13415] MPI_ABORT invoked on rank 10 in communicator 
> MPI_COMM_WORLD with errorcode 0
> 11      0.141622        Bailing out with signal 15
> [node001:13416] MPI_ABORT invoked on rank 11 in communicator 
> MPI_COMM_WORLD with errorcode 0
> 12      0.145868        Bailing out with signal 15
> [node001:13417] MPI_ABORT invoked on rank 12 in communicator 
> MPI_COMM_WORLD with errorcode 0
> 13      0.149375        Bailing out with signal 15
> [node001:13418] MPI_ABORT invoked on rank 13 in communicator 
> MPI_COMM_WORLD with errorcode 0
> 14      0.152966        Bailing out with signal 15
> [node001:13419] MPI_ABORT invoked on rank 14 in communicator 
> MPI_COMM_WORLD with errorcode 0
>
> [As related information, the mpirun is working fine when tested with a 
> small "hello" program that showed responses from all nodes].
>
> -- 
> Zhiliang
>
>
> On Sun, 9 Sep 2007, Aaron Darling wrote:
>
>> Date: Sun, 09 Sep 2007 08:04:14 +1000
>> From: Aaron Darling <darling at cs.wisc.edu>
>> Reply-To: HPC in Bioinformatics <bioclusters at bioinformatics.org>
>> To: HPC in Bioinformatics <bioclusters at bioinformatics.org>
>> Subject: Re: [Bioclusters] Re: new on using clusters: problem running 
>> mpiblast
>>      (2)
>>
>> Hi Zhiliang
>>
>> For reasons that are beyond me, the version of autoconf that we used to
>> package mpiBLAST 1.4.0 does not approve of setting CC and/or CXX to
>> mpicc or mpicxx.  Doing so results in the autoconf error you have
>> observed.  For that reason we added the --with-mpi=/path/to/mpi
>> configure option.  It should be sufficient to use that option alone to
>> set the preferred compiler path.  If not, then it's a bug in the
>> mpiblast configure system.
>>
>> In response to your other query, I personally have not used mpiblast
>> with OpenMPI but I believe others have.  The 1.4.0 release was tested
>> against mpich1/2 and LAM.
>>
>> Regards,
>> -Aaron
>>
>>
> _______________________________________________
> Bioclusters maillist  -  Bioclusters at bioinformatics.org
> https://bioinformatics.org/mailman/listinfo/bioclusters



More information about the Bioclusters mailing list