[Bioclusters] SGE Array Job tasks mysteriously disappear

James Cuff jcuff at broad.mit.edu
Thu Mar 2 18:34:40 EST 2006


Hi Shane,

So you might want to give us a bit more information.

As to seeing weird stuff on clusters, yeah we see a lot of it, *way* too 
much of it sometimes :)

Here come a bunch of questions I would ask myself if it happened to me:

Did I isolate it down to just an issue with the job array?
Does this only happen with this program or all programs I execute?
What is the code doing?
Are there "core" files in my output directory?
Are the binaries on an NFS server?  If so is it having issues?  Check the 
logs for NFS timeouts.
Is a directory filling up /tmp /scratch what ever?
What do the syslogs on the remote machine say?
Is there a network issue that I've caused by running too much stuff at the 
same time, broken NIS/NFS?
Is the OOM killer running on the remote node, have I filled up all the 
memory?
Is it only happening on one node, some nodes or a subset? 
Am I writing to a database and not catching an error?
Does it happen with a really simple example?
Does it only happen on a Tuesday evening (system maint for example)

etc. etc.  It is a pain to debug things like this on a cluster, I feel 
your pain.

Maybe have another look at what is going wrong and post back with some 
more information.  There are lots of people who can probably help, at the 
moment there is not really enough for us to go on, as you see it could be 
lots of things.

Best,

J.

On Thu, 2 Mar 2006, Shane Brubaker wrote:

> Hi, Shane from the JGI here.
>
> We are finding some strange behavior in which a few tasks of an array job 
> never seem to complete.
>
> The tasks do not go into an Error state, and they are listed as finished 
> with an exit status of 0, and they have a valid start and end time for 
> the task.
>
> However, in the output log, the output clearly stops in between two print 
> statements near the top of the script.
>
>
> Has anyone seen this?  Any ideas?
>
>
> Thanks,
> Shane
>
> _______________________________________________
> Bioclusters maillist  -  Bioclusters at bioinformatics.org
> https://bioinformatics.org/mailman/listinfo/bioclusters
>



More information about the Bioclusters mailing list