[Bioclusters] SGE Array Job tasks mysteriously disappear

Thu Mar 2 19:03:00 EST 2006

Debugging odd failures on clusters can really be hard.

For SGE clusters the best place of debug/failure info is always going  
to be in the STDOUT/STDERR files produced by the jobs themselves.

Nine times out of ten this is where you'll find the most useful info.

Since it seems that you are not getting anything useful from those  
files, the next place to look is the sge_execd logs from the machines  
where the array tasks ran. The execd spool files will either be local  
to the compute node or under your $SGE_ROOT/<cell>/spool/ 
<machineName>" directory if you are running everything off of a  
shared filesystem.

After the execd spool logs, the qmaster and schedd messages files may  
also be of use although they rarely give good info on job level issues.

A third place to look is "/tmp" on the compute nodes -- when all else  
fails and grid engine is in a panic situation and unable to spool  
normally it will log to /tmp/ on the host.

Something you should also try:

  - Alter the value for "loglevel" in your grid engine configuration  
-- you may want to temporarily set "loglevel=log_info"

This was discussed in a recent SGE users mailing list The thread is  
here:
http://gridengine.sunsource.net/servlets/BrowseList? 
list=users&by=thread&from=8137

The sge_conf man page has this to say about loglevel:

> loglevel
>        This parameter specifies the level of detail that  Grid   
> Engine  compo-
>        nents  such  as  sge_qmaster(8) or sge_execd(8) use to  
> produce informa-
>        tive, warning or error messages which are logged to the  
> messages  files
>        in  the master and execution daemon spool directories (see  
> the descrip-
>        tion of the execd_spool_dir parameter  above).  The   
> following  message
>        levels are available:
>
>        log_err
>               All error events being recognized are logged.
>
>        log_warning
>               All  error  events  being  recognized  and all  
> detected signs of
>               potentially erroneous behavior are logged.
>
>        log_info
>               All error events being recognized, all detected signs  
> of  poten-
>               tially  erroneous behavior and a variety of  
> informative messages
>               are logged.

The final troubleshooting step is to look into the Grid Engine   
"KEEP_ACTIVE"  execd parameter setting -- this will temporarily  
disable deletion of the active_jobs/ directories that Grid Engine  
uses to stage info while the job is active. Normally these  
directories are deleted when the job drains from the system. Quite a  
bit of useful environment, pid, trace and other information can be  
found in these directories.  This is one you'll have to watch out for  
though -- disabling the cleanup function could consume disk space  
rapidly.

Regards,
Chris

On Mar 2, 2006, at 5:12 PM, Shane Brubaker wrote:

> Hi, Shane from the JGI here.
>
> We are finding some strange behavior in which a few tasks of an  
> array job never seem to complete.
>
> The tasks do not go into an Error state, and they are listed as  
> finished with an exit status of 0, and they
> have a valid start and end time for the task.
>
> However, in the output log, the output clearly stops in between two  
> print statements near the top of the script.
>
>
> Has anyone seen this?  Any ideas?