Hi Neil:

  If you simply skip the one node which seems to stall, does the run 
work?  For PBS, you might need to create a queue which skips this node, 
or simply mark the node down using pbsnode.


Neil Saunders wrote:

>We are running mpiBLAST (1.2.1) on 3 different clusters, with LAM-MPI 
>and openPBS.
>2 of the clusters are fine, but one has recently developed some rather 
>bizarre ouput errors.  Small BLAST jobs (10s of sequences versus protein 
>nr database) run fine, but larger jobs (e.g. all proteins from a typical 
>microbial genome v. nr) have problems.  The BLAST output file starts to 
>write, but is truncated.  The nodes appear to run lamboot and lamhalt 
>fine, but one node seems to stall and we see this kind of error message:
>Unknown message tag (-32766) received by 1
>UUnknUkUnknknonwonwonwown nm nm nm mesessessesssagaegaegaege t at at
>atag g( g( g( (-3-23-23-232767667667666) )r )r )r
>reececeeceeceiivevievievedd b db db byy 2 y3 y4 5
>One of the processes started by mpirun has exited with a nonzero exit
>code.  This typically indicates that the process finished in error.
>If your process did not finish in error, be sure to include a "return
>0" or "exit(0)" in your C code before exiting the application.
>PID 8303 failed on node n0 ( due to signal 9.
>Has anyone seen anything like this before or have any ideas what the 
>error signifies?  I suspect the head node of this cluster may have a 
>different version of LAM-MPI to the slaves - could this be an issue?  
>mpiBLAST seemed to compile cleanly with lam 7.0.6.
>thanks for any ideas,
>Neil Saunders

