[Bioclusters] sharing an SGE cluster

Thu May 5 13:24:19 EDT 2005

Hi All,

I've scanned the SGE documentation and user groups, and have not found
an answer to this question.  I got such good service last time I asked a
question here, I thought I'd try again!

I have a 10 node cluster (soon to grow), with SGE.  Two groups
contributed funds for the hardware.  Both groups have periods of heavy
use, and periods of very light use.  Hence, I'd like the following use
model

*	If group A (or B) is the only one using it, they get all 10
machines.
*	If group A and group B are both using it, they effectively get 5
machines each.

The jobs submitted tend to be very big array jobs, each part of the
array job taking 5 or 10 minutes.

It is easy enough to set up one queue on each machine for each group
(i.e. each machine has two queues), and control access by user ID.

But how to configure the queues?   Imagine group A is running on all 10
nodes, and group B submits.  What I would like to see, on the 5 group B
machines, is the group B jobs starting, the group A jobs completing, and
no more group A jobs being started (on the B machines).

I can't see how to do this.  The subordinate queue mechanism would
suspend the A queues, which kills the jobs; I'd need to modify all the
scripts that combine the results of array jobs to know how to deal with
killed pieces of array jobs.  What I think I need is an equivalent to
subordinate queues, but instead of suspending, it should disable the
queues to allow the jobs to complete.

My solution right now is to set "nice" priorities, so that the A jobs
largely get out of the way of the B jobs on the B machines.  This is not
perfect; you end up with many processes running, and you end up with an
imbalance in how long a piece of an array job takes, depending on where
it is running, which can substantially lengthen overall run times (due
to some pieces being "stuck" on low-priority processes).

This method doesn't scale nicely either, adding another group could
result in even more processes running on each node.

Thanks for any pointers,

Peter

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://bioinformatics.org/pipermail/bioclusters/attachments/20050505/cdbd08e7/attachment.htm