[Bioclusters] sharing an SGE cluster

Thu May 5 13:45:58 EDT 2005

The easiest way to do this is not with queues but to use the built in  
Grid Engine resource allocation and policy mechanism(s). There are  
many to choose from assuming you are using SGE 5.3.x Enterprise  
Edition or any version of Grid Engine 6.x.

You don't mention what version of SGE you are using so I'll assume 6.x

I'd also suggest that carving up the cluster by specific machine is  
less efficient than allocating cluster resources on a percentage  
basis (group a gets 50% ; group b gets 50%; any group gets 100% when  
cluster is idle).

I've done this many times using many different policy mechanisms. One  
of the complexities of grid engine is that there are many ways you  
can tackle a problem.

A whitepaper/howto that I wrote a while back may help you. It goes  
step by step through the process of using the grid engine functional  
share policy mechanism to allocate cluster resources on a percentage  
basis between user groups (in this paper I did it by "department")

The URL is here:
http://bioteam.net/dag/sge6-funct-share-dept.html

Hope it proves useful. All of the stuff I've promised people I'd put  
online is linked to off of this generic url:
http://bioteam.net/dag/

Regards,
Chris

On May 5, 2005, at 1:24 PM, <peter_webb at agilent.com>  
<peter_webb at agilent.com> wrote:

> Hi All,
>
>
>
> I've scanned the SGE documentation and user groups, and have not found
> an answer to this question.  I got such good service last time I  
> asked a
> question here, I thought I'd try again!
>
>
>
> I have a 10 node cluster (soon to grow), with SGE.  Two groups
> contributed funds for the hardware.  Both groups have periods of heavy
> use, and periods of very light use.  Hence, I'd like the following use
> model
>
>
>
> *    If group A (or B) is the only one using it, they get all 10
> machines.
> *    If group A and group B are both using it, they effectively get 5
> machines each.
>
>
>
> The jobs submitted tend to be very big array jobs, each part of the
> array job taking 5 or 10 minutes.
>
>
>
> It is easy enough to set up one queue on each machine for each group
> (i.e. each machine has two queues), and control access by user ID.
>
>
>
> But how to configure the queues?   Imagine group A is running on  
> all 10
> nodes, and group B submits.  What I would like to see, on the 5  
> group B
> machines, is the group B jobs starting, the group A jobs  
> completing, and
> no more group A jobs being started (on the B machines).
>
>
>
> I can't see how to do this.  The subordinate queue mechanism would
> suspend the A queues, which kills the jobs; I'd need to modify all the
> scripts that combine the results of array jobs to know how to deal  
> with
> killed pieces of array jobs.  What I think I need is an equivalent to
> subordinate queues, but instead of suspending, it should disable the
> queues to allow the jobs to complete.
>
>
>
> My solution right now is to set "nice" priorities, so that the A jobs
> largely get out of the way of the B jobs on the B machines.  This  
> is not
> perfect; you end up with many processes running, and you end up  
> with an
> imbalance in how long a piece of an array job takes, depending on  
> where
> it is running, which can substantially lengthen overall run times (due
> to some pieces being "stuck" on low-priority processes).
>
>
>
> This method doesn't scale nicely either, adding another group could
> result in even more processes running on each node.
>
>
>
> Thanks for any pointers,
>
>
>
> Peter
>
>
>
> _______________________________________________
> Bioclusters maillist  -  Bioclusters at bioinformatics.org
> https://bioinformatics.org/mailman/listinfo/bioclusters
>