[Bioclusters] Admins / node?

Andrew D. Fant fant at pobox.com
Thu Feb 2 12:44:56 EST 2006


Christopher Dwan wrote:
> 
> Question on the state of the art in cluster management:
> 
> Approximately what level of dedicated support do folks on this list 
> have / wish you have for your clusters?  Obviously, there are a lot  of
> free variables, including but not limited to:
> 
> * Does the support person also do development, parallelization, or 
> otherwise *use* the cluster?
> 
> * Do their other responsibilities come from the IT side or the  research
> side (i.e:  Are we dedicating half of a unix admin, or half  of a postdoc?)
> 
> * How many users are being supported, and to what level?
> 
> Setting aside these and similar details that would make for valid, 
> comparable numbers, my gut feel is that a reasonable guess is one  half
> time IT person for the first fifty nodes or so.  I think it  scales
> logarithmically from there:   So, go up to an entire full time  person
> for 50 - 150, and add support staff incrementally as the  cluster
> becomes more huge.
> 
> -Chris Dwan _______________________________________________
> Bioclusters maillist  -  Bioclusters at bioinformatics.org
> https://bioinformatics.org/mailman/listinfo/bioclusters
> 
> 
Chris,
   You're hitting on a subject near and dear to my heart.  I have some religion
on the subject, based in part on a few informal site visits and conversations
with others at Supercomputing and other meetings.  Let me preface this all with
a statement that while opinions are partially based on experiences at my
current employer, they are strictly mine, and not theirs.  Also, these opinions
are about groups/centers/facilities that are shared resources, and not in
support of one research group reporting to a locally omnipotent PI.

    First off, I think that for shared HPC/research computing resources to
succeed, they need a to exist as a coherent entity.  I've seen attempts to
create groups in a matrix organizational scheme where you take someone from the
sysadmins and someone from networking and part of someone from the CS
department (for example) and have them remain administratively part of their
line organizations, remain sited there, and maybe even nominally only
communicate via email, phone, and occasional meetings.  This can work, but I
think it's a bad way to proceed.  It dilutes the focus and the effort of the
team, and makes it hard for cross-training in the unique aspects of the systems
and the process to happen.  This is not to say that I think that all functions
of a cluster support group need to be under one roof and/or manager.  If there
is a functional help desk in the organization, it's probably far better to
provide them with the tools they need for front-line support and a good API to
call specialized support as needed instead of building up a first-tier support
organization from scratch for the cluster.

     Next, I am not a big fan of rigidly dividing cluster support staff into
classes too soon.  In general principle, I think that anyone who administers a
biocluster ought to at least be able to describe what the main applications do
on a somewhat technical level and run simple jobs with the major applications
in use.  Maybe not know all the ins and outs of blast or gromos, but enough to
at least be sure things aren't completely munged after a system upgrade or
calculate pi from a monte carlo method in MPI.  Likewise, someone supporting
applications or development might not need to be a senior-level unix
administrator, but probably ought to be skilled enough to administer their own
Linux desktop or to be trusted with the root password in a crisis situation
where all hands are needed.

    As a rule of thumb. I think that a rough one-to-one parity between
"systems" and "applications" is a good thing.  This, of course, means that
until the group grows to a significant size, it will be hard to make hard and
fast barriers between the two teams, and that they will have to have a certain
level of cross-training and cross-coverage.  Obviously. not everyone can do
every job equally well, but if the sysadmin is in bed with the flu, it is good
for someone else to be able to restart the batch queues if they need to be.
Likewise, there are times when a user's issues may be closer to the operating
system and the hardware than to a given application, and having the admin
capable of interacting with end-users in a professional, consulting role is
sometimes the most effective solution.

     Assuming that we are talking about production clusters of 50 cpus each or
more, with their own associated testbed systems, the architecture well defined,
and datacenter staff support, I would think that the first admin ought to be
able to handle 2 clusters without a total meltdown.  If they are responsible
for hardware issues like RAM upgrades and dead Drive replacements, I would
probably put a cap of about 150 systems under their direct care at first, and
only add one new system class at a time.  If the group/unit/center is expected
to run it's own infrastructure (DNS/Email/File/Web/LDAP), I would add say a
second admin is really essential from the beginning.  Likewise, if this is a
totally new group providing centralized cluster services, I'd say a second
admin is needed.  They might not be needed once things are up and running, so
it's a good candidate for a consultant position, but asking one admin to do the
architectural work in a relative vacuum is a bad thing in my book.  Once a
facility has more than 2 clusters in production, It's definitely time to look
at adding another admin.  I guess that Chris is right it could be roughly
logarithmic in scaling, though I think that strongly varies depending on
whether admins do hardware and networking support, or are strictly systems
support.  In the former case, I would say that the relevant metric is number of
 CPUs managed ( or possibly number of hard drives in deference to the no-moving
parts on compute nodes contingent).  In the latter, I would say that it's
easier to say one admin for 2-3 well-defined clusters and 0.75 - 1.25 FTE of an
admin for a new cluster with a new architecture (system or cluster) being
brought on-line.

    *whew* having said all that, I could go into greater depth on the mix of
user support skills involved, but I suspect that I've said enough for the
minute.  If anyone wants to go into the user support side, speak up with your
views, and I'll find time to pontificate on it later today. 8-)

Andy

-- 
Andrew Fant    | And when the night is cloudy    | This space to let
Molecular Geek | There is still a light          |----------------------
fant at pobox.com | That shines on me               | Disclaimer:  I don't
Boston, MA     | Shine until tomorrow, Let it be | even speak for myself



More information about the Bioclusters mailing list