Debugging odd failures on clusters can really be hard. For SGE clusters the best place of debug/failure info is always going to be in the STDOUT/STDERR files produced by the jobs themselves. Nine times out of ten this is where you'll find the most useful info. Since it seems that you are not getting anything useful from those files, the next place to look is the sge_execd logs from the machines where the array tasks ran. The execd spool files will either be local to the compute node or under your $SGE_ROOT/<cell>/spool/ <machineName>" directory if you are running everything off of a shared filesystem. After the execd spool logs, the qmaster and schedd messages files may also be of use although they rarely give good info on job level issues. A third place to look is "/tmp" on the compute nodes -- when all else fails and grid engine is in a panic situation and unable to spool normally it will log to /tmp/ on the host. Something you should also try: - Alter the value for "loglevel" in your grid engine configuration -- you may want to temporarily set "loglevel=log_info" This was discussed in a recent SGE users mailing list The thread is here: http://gridengine.sunsource.net/servlets/BrowseList? list=users&by=thread&from=8137 The sge_conf man page has this to say about loglevel: > loglevel > This parameter specifies the level of detail that Grid > Engine compo- > nents such as sge_qmaster(8) or sge_execd(8) use to > produce informa- > tive, warning or error messages which are logged to the > messages files > in the master and execution daemon spool directories (see > the descrip- > tion of the execd_spool_dir parameter above). The > following message > levels are available: > > log_err > All error events being recognized are logged. > > log_warning > All error events being recognized and all > detected signs of > potentially erroneous behavior are logged. > > log_info > All error events being recognized, all detected signs > of poten- > tially erroneous behavior and a variety of > informative messages > are logged. The final troubleshooting step is to look into the Grid Engine "KEEP_ACTIVE" execd parameter setting -- this will temporarily disable deletion of the active_jobs/ directories that Grid Engine uses to stage info while the job is active. Normally these directories are deleted when the job drains from the system. Quite a bit of useful environment, pid, trace and other information can be found in these directories. This is one you'll have to watch out for though -- disabling the cleanup function could consume disk space rapidly. Regards, Chris On Mar 2, 2006, at 5:12 PM, Shane Brubaker wrote: > Hi, Shane from the JGI here. > > We are finding some strange behavior in which a few tasks of an > array job never seem to complete. > > The tasks do not go into an Error state, and they are listed as > finished with an exit status of 0, and they > have a valid start and end time for the task. > > However, in the output log, the output clearly stops in between two > print statements near the top of the script. > > > Has anyone seen this? Any ideas?