Hi Neil: If you simply skip the one node which seems to stall, does the run work? For PBS, you might need to create a queue which skips this node, or simply mark the node down using pbsnode. Joe Neil Saunders wrote: >We are running mpiBLAST (1.2.1) on 3 different clusters, with LAM-MPI >and openPBS. > >2 of the clusters are fine, but one has recently developed some rather >bizarre ouput errors. Small BLAST jobs (10s of sequences versus protein >nr database) run fine, but larger jobs (e.g. all proteins from a typical >microbial genome v. nr) have problems. The BLAST output file starts to >write, but is truncated. The nodes appear to run lamboot and lamhalt >fine, but one node seems to stall and we see this kind of error message: > >---------------------------------------------------------------------------- >Unknown message tag (-32766) received by 1 >UUnknUkUnknknonwonwonwown nm nm nm mesessessesssagaegaegaege t at at >atag g( g( g( (-3-23-23-232767667667666) )r )r )r >reececeeceeceiivevievievedd b db db byy 2 y3 y4 5 >----------------------------------------------------------------------------- >One of the processes started by mpirun has exited with a nonzero exit >code. This typically indicates that the process finished in error. >If your process did not finish in error, be sure to include a "return >0" or "exit(0)" in your C code before exiting the application. > >PID 8303 failed on node n0 (10.0.92.100) due to signal 9. >----------------------------------------------------------------------------- > > >Has anyone seen anything like this before or have any ideas what the >error signifies? I suspect the head node of this cluster may have a >different version of LAM-MPI to the slaves - could this be an issue? >mpiBLAST seemed to compile cleanly with lam 7.0.6. > >thanks for any ideas, > >Neil Saunders > > -- Joseph Landman, Ph.D Scalable Informatics LLC, email: landman@scalableinformatics.com web : http://scalableinformatics.com phone: +1 734 612 4615