[Bioclusters] SGE Array Job tasks mysteriously disappear
Chris Dagdigian
dag at sonsorol.org
Thu Mar 2 19:03:00 EST 2006
Debugging odd failures on clusters can really be hard.
For SGE clusters the best place of debug/failure info is always going
to be in the STDOUT/STDERR files produced by the jobs themselves.
Nine times out of ten this is where you'll find the most useful info.
Since it seems that you are not getting anything useful from those
files, the next place to look is the sge_execd logs from the machines
where the array tasks ran. The execd spool files will either be local
to the compute node or under your $SGE_ROOT/<cell>/spool/
<machineName>" directory if you are running everything off of a
shared filesystem.
After the execd spool logs, the qmaster and schedd messages files may
also be of use although they rarely give good info on job level issues.
A third place to look is "/tmp" on the compute nodes -- when all else
fails and grid engine is in a panic situation and unable to spool
normally it will log to /tmp/ on the host.
Something you should also try:
- Alter the value for "loglevel" in your grid engine configuration
-- you may want to temporarily set "loglevel=log_info"
This was discussed in a recent SGE users mailing list The thread is
here:
http://gridengine.sunsource.net/servlets/BrowseList?
list=users&by=thread&from=8137
The sge_conf man page has this to say about loglevel:
> loglevel
> This parameter specifies the level of detail that Grid
> Engine compo-
> nents such as sge_qmaster(8) or sge_execd(8) use to
> produce informa-
> tive, warning or error messages which are logged to the
> messages files
> in the master and execution daemon spool directories (see
> the descrip-
> tion of the execd_spool_dir parameter above). The
> following message
> levels are available:
>
> log_err
> All error events being recognized are logged.
>
> log_warning
> All error events being recognized and all
> detected signs of
> potentially erroneous behavior are logged.
>
> log_info
> All error events being recognized, all detected signs
> of poten-
> tially erroneous behavior and a variety of
> informative messages
> are logged.
The final troubleshooting step is to look into the Grid Engine
"KEEP_ACTIVE" execd parameter setting -- this will temporarily
disable deletion of the active_jobs/ directories that Grid Engine
uses to stage info while the job is active. Normally these
directories are deleted when the job drains from the system. Quite a
bit of useful environment, pid, trace and other information can be
found in these directories. This is one you'll have to watch out for
though -- disabling the cleanup function could consume disk space
rapidly.
Regards,
Chris
On Mar 2, 2006, at 5:12 PM, Shane Brubaker wrote:
> Hi, Shane from the JGI here.
>
> We are finding some strange behavior in which a few tasks of an
> array job never seem to complete.
>
> The tasks do not go into an Error state, and they are listed as
> finished with an exit status of 0, and they
> have a valid start and end time for the task.
>
> However, in the output log, the output clearly stops in between two
> print statements near the top of the script.
>
>
> Has anyone seen this? Any ideas?
More information about the Bioclusters
mailing list