[Bioclusters] OpenPBS problems

Tim Cutts bioclusters@bioinformatics.org
Tue, 2 Dec 2003 14:23:43 +0000


On 02-Dec-03, Ron Chen wrote:
> > We also need a "swap out Queue", hence low priority
> > queue that will suspend running jobs (and swap them 
> > out to the disk) incase some jobs in other queues 
> > needs it's cpu. Does such feature exist under PBS
> > system?
> 
> The closest thing you can use is checkpointing. I
> don't think a batch system can tell the OS to "swap
> processes to disk".

LSF can, although it is of course another example that costs $$.
Different queues can be configured to 'preempt' each other.  A
pre-empted job in a low priority queue is sent a SIGSTOP by LSF, and
will get a corresponding resume signal once the higher priority job has
finished.

However, you should take great care with this sort of thing; suspending
lots of jobs can cause you to run out of resources really quickly... we
got bitten by this; several hundred jobs with open connections to a
MySQL instance get suspended because they are preempted by a higher
priority set of jobs.  The higher priority set of several hundred jobs
starts, and tries to connect to the same instance.  *BOOM*.  MySQL
instance runs out of connections, or the hosting OS runs out of threads,
or whatever, and all the high priority jobs fail.  Or if you're really
unlucky the server falls over altogether.

Users like preemption of jobs, but it's really risky.  We have it
switched off now, on almost all queues.  There is *one* queue which can
preempt others, and it's for use in dire emergencies only.

The rest of the time, we rely on the fact that when a job slot comes
available, LSF will always start a job from the highest priority queue
that it can, given other resource requirements.

Tim

-- 
Dr Tim Cutts
Informatics Systems Group
Wellcome Trust Sanger Institute, Hinxton, Cambridge, CB10 1SA, UK