[Bioclusters] mpiBLAST errors

Neil Saunders bioclusters@bioinformatics.org
Tue, 3 Aug 2004 11:59:32 +1000

We are running mpiBLAST (1.2.1) on 3 different clusters, with LAM-MPI 
and openPBS.

2 of the clusters are fine, but one has recently developed some rather 
bizarre ouput errors.  Small BLAST jobs (10s of sequences versus protein 
nr database) run fine, but larger jobs (e.g. all proteins from a typical 
microbial genome v. nr) have problems.  The BLAST output file starts to 
write, but is truncated.  The nodes appear to run lamboot and lamhalt 
fine, but one node seems to stall and we see this kind of error message:

Unknown message tag (-32766) received by 1
UUnknUkUnknknonwonwonwown nm nm nm mesessessesssagaegaegaege t at at
atag g( g( g( (-3-23-23-232767667667666) )r )r )r
reececeeceeceiivevievievedd b db db byy 2 y3 y4 5
One of the processes started by mpirun has exited with a nonzero exit
code.  This typically indicates that the process finished in error.
If your process did not finish in error, be sure to include a "return
0" or "exit(0)" in your C code before exiting the application.

PID 8303 failed on node n0 ( due to signal 9.

Has anyone seen anything like this before or have any ideas what the 
error signifies?  I suspect the head node of this cluster may have a 
different version of LAM-MPI to the slaves - could this be an issue?  
mpiBLAST seemed to compile cleanly with lam 7.0.6.

thanks for any ideas,

Neil Saunders
 School of Biotechnology and Biomolecular Sciences,
 The University of New South Wales,
 Sydney 2052,