[Bioclusters] mpiBLAST errors
Neil Saunders
bioclusters@bioinformatics.org
Tue, 3 Aug 2004 11:59:32 +1000
We are running mpiBLAST (1.2.1) on 3 different clusters, with LAM-MPI
and openPBS.
2 of the clusters are fine, but one has recently developed some rather
bizarre ouput errors. Small BLAST jobs (10s of sequences versus protein
nr database) run fine, but larger jobs (e.g. all proteins from a typical
microbial genome v. nr) have problems. The BLAST output file starts to
write, but is truncated. The nodes appear to run lamboot and lamhalt
fine, but one node seems to stall and we see this kind of error message:
----------------------------------------------------------------------------
Unknown message tag (-32766) received by 1
UUnknUkUnknknonwonwonwown nm nm nm mesessessesssagaegaegaege t at at
atag g( g( g( (-3-23-23-232767667667666) )r )r )r
reececeeceeceiivevievievedd b db db byy 2 y3 y4 5
-----------------------------------------------------------------------------
One of the processes started by mpirun has exited with a nonzero exit
code. This typically indicates that the process finished in error.
If your process did not finish in error, be sure to include a "return
0" or "exit(0)" in your C code before exiting the application.
PID 8303 failed on node n0 (10.0.92.100) due to signal 9.
-----------------------------------------------------------------------------
Has anyone seen anything like this before or have any ideas what the
error signifies? I suspect the head node of this cluster may have a
different version of LAM-MPI to the slaves - could this be an issue?
mpiBLAST seemed to compile cleanly with lam 7.0.6.
thanks for any ideas,
Neil Saunders
--
School of Biotechnology and Biomolecular Sciences,
The University of New South Wales,
Sydney 2052,
Australia
http://psychro.bioinformatics.unsw.edu.au/neil/index.php