[Bioclusters] sharing an SGE cluster
Chris Dagdigian
dag at sonsorol.org
Thu May 5 13:45:58 EDT 2005
The easiest way to do this is not with queues but to use the built in
Grid Engine resource allocation and policy mechanism(s). There are
many to choose from assuming you are using SGE 5.3.x Enterprise
Edition or any version of Grid Engine 6.x.
You don't mention what version of SGE you are using so I'll assume 6.x
I'd also suggest that carving up the cluster by specific machine is
less efficient than allocating cluster resources on a percentage
basis (group a gets 50% ; group b gets 50%; any group gets 100% when
cluster is idle).
I've done this many times using many different policy mechanisms. One
of the complexities of grid engine is that there are many ways you
can tackle a problem.
A whitepaper/howto that I wrote a while back may help you. It goes
step by step through the process of using the grid engine functional
share policy mechanism to allocate cluster resources on a percentage
basis between user groups (in this paper I did it by "department")
The URL is here:
http://bioteam.net/dag/sge6-funct-share-dept.html
Hope it proves useful. All of the stuff I've promised people I'd put
online is linked to off of this generic url:
http://bioteam.net/dag/
Regards,
Chris
On May 5, 2005, at 1:24 PM, <peter_webb at agilent.com>
<peter_webb at agilent.com> wrote:
> Hi All,
>
>
>
> I've scanned the SGE documentation and user groups, and have not found
> an answer to this question. I got such good service last time I
> asked a
> question here, I thought I'd try again!
>
>
>
> I have a 10 node cluster (soon to grow), with SGE. Two groups
> contributed funds for the hardware. Both groups have periods of heavy
> use, and periods of very light use. Hence, I'd like the following use
> model
>
>
>
> * If group A (or B) is the only one using it, they get all 10
> machines.
> * If group A and group B are both using it, they effectively get 5
> machines each.
>
>
>
> The jobs submitted tend to be very big array jobs, each part of the
> array job taking 5 or 10 minutes.
>
>
>
> It is easy enough to set up one queue on each machine for each group
> (i.e. each machine has two queues), and control access by user ID.
>
>
>
> But how to configure the queues? Imagine group A is running on
> all 10
> nodes, and group B submits. What I would like to see, on the 5
> group B
> machines, is the group B jobs starting, the group A jobs
> completing, and
> no more group A jobs being started (on the B machines).
>
>
>
> I can't see how to do this. The subordinate queue mechanism would
> suspend the A queues, which kills the jobs; I'd need to modify all the
> scripts that combine the results of array jobs to know how to deal
> with
> killed pieces of array jobs. What I think I need is an equivalent to
> subordinate queues, but instead of suspending, it should disable the
> queues to allow the jobs to complete.
>
>
>
> My solution right now is to set "nice" priorities, so that the A jobs
> largely get out of the way of the B jobs on the B machines. This
> is not
> perfect; you end up with many processes running, and you end up
> with an
> imbalance in how long a piece of an array job takes, depending on
> where
> it is running, which can substantially lengthen overall run times (due
> to some pieces being "stuck" on low-priority processes).
>
>
>
> This method doesn't scale nicely either, adding another group could
> result in even more processes running on each node.
>
>
>
> Thanks for any pointers,
>
>
>
> Peter
>
>
>
> _______________________________________________
> Bioclusters maillist - Bioclusters at bioinformatics.org
> https://bioinformatics.org/mailman/listinfo/bioclusters
>
More information about the Bioclusters
mailing list