Christopher Dwan wrote: > > Question on the state of the art in cluster management: > > Approximately what level of dedicated support do folks on this list > have / wish you have for your clusters? Obviously, there are a lot of > free variables, including but not limited to: > > * Does the support person also do development, parallelization, or > otherwise *use* the cluster? > > * Do their other responsibilities come from the IT side or the research > side (i.e: Are we dedicating half of a unix admin, or half of a postdoc?) > > * How many users are being supported, and to what level? > > Setting aside these and similar details that would make for valid, > comparable numbers, my gut feel is that a reasonable guess is one half > time IT person for the first fifty nodes or so. I think it scales > logarithmically from there: So, go up to an entire full time person > for 50 - 150, and add support staff incrementally as the cluster > becomes more huge. > > -Chris Dwan _______________________________________________ > Bioclusters maillist - Bioclusters at bioinformatics.org > https://bioinformatics.org/mailman/listinfo/bioclusters > > Chris, You're hitting on a subject near and dear to my heart. I have some religion on the subject, based in part on a few informal site visits and conversations with others at Supercomputing and other meetings. Let me preface this all with a statement that while opinions are partially based on experiences at my current employer, they are strictly mine, and not theirs. Also, these opinions are about groups/centers/facilities that are shared resources, and not in support of one research group reporting to a locally omnipotent PI. First off, I think that for shared HPC/research computing resources to succeed, they need a to exist as a coherent entity. I've seen attempts to create groups in a matrix organizational scheme where you take someone from the sysadmins and someone from networking and part of someone from the CS department (for example) and have them remain administratively part of their line organizations, remain sited there, and maybe even nominally only communicate via email, phone, and occasional meetings. This can work, but I think it's a bad way to proceed. It dilutes the focus and the effort of the team, and makes it hard for cross-training in the unique aspects of the systems and the process to happen. This is not to say that I think that all functions of a cluster support group need to be under one roof and/or manager. If there is a functional help desk in the organization, it's probably far better to provide them with the tools they need for front-line support and a good API to call specialized support as needed instead of building up a first-tier support organization from scratch for the cluster. Next, I am not a big fan of rigidly dividing cluster support staff into classes too soon. In general principle, I think that anyone who administers a biocluster ought to at least be able to describe what the main applications do on a somewhat technical level and run simple jobs with the major applications in use. Maybe not know all the ins and outs of blast or gromos, but enough to at least be sure things aren't completely munged after a system upgrade or calculate pi from a monte carlo method in MPI. Likewise, someone supporting applications or development might not need to be a senior-level unix administrator, but probably ought to be skilled enough to administer their own Linux desktop or to be trusted with the root password in a crisis situation where all hands are needed. As a rule of thumb. I think that a rough one-to-one parity between "systems" and "applications" is a good thing. This, of course, means that until the group grows to a significant size, it will be hard to make hard and fast barriers between the two teams, and that they will have to have a certain level of cross-training and cross-coverage. Obviously. not everyone can do every job equally well, but if the sysadmin is in bed with the flu, it is good for someone else to be able to restart the batch queues if they need to be. Likewise, there are times when a user's issues may be closer to the operating system and the hardware than to a given application, and having the admin capable of interacting with end-users in a professional, consulting role is sometimes the most effective solution. Assuming that we are talking about production clusters of 50 cpus each or more, with their own associated testbed systems, the architecture well defined, and datacenter staff support, I would think that the first admin ought to be able to handle 2 clusters without a total meltdown. If they are responsible for hardware issues like RAM upgrades and dead Drive replacements, I would probably put a cap of about 150 systems under their direct care at first, and only add one new system class at a time. If the group/unit/center is expected to run it's own infrastructure (DNS/Email/File/Web/LDAP), I would add say a second admin is really essential from the beginning. Likewise, if this is a totally new group providing centralized cluster services, I'd say a second admin is needed. They might not be needed once things are up and running, so it's a good candidate for a consultant position, but asking one admin to do the architectural work in a relative vacuum is a bad thing in my book. Once a facility has more than 2 clusters in production, It's definitely time to look at adding another admin. I guess that Chris is right it could be roughly logarithmic in scaling, though I think that strongly varies depending on whether admins do hardware and networking support, or are strictly systems support. In the former case, I would say that the relevant metric is number of CPUs managed ( or possibly number of hard drives in deference to the no-moving parts on compute nodes contingent). In the latter, I would say that it's easier to say one admin for 2-3 well-defined clusters and 0.75 - 1.25 FTE of an admin for a new cluster with a new architecture (system or cluster) being brought on-line. *whew* having said all that, I could go into greater depth on the mix of user support skills involved, but I suspect that I've said enough for the minute. If anyone wants to go into the user support side, speak up with your views, and I'll find time to pontificate on it later today. 8-) Andy -- Andrew Fant | And when the night is cloudy | This space to let Molecular Geek | There is still a light |---------------------- fant at pobox.com | That shines on me | Disclaimer: I don't Boston, MA | Shine until tomorrow, Let it be | even speak for myself