[Bioclusters] Admins / node?

Thu Feb 2 12:18:04 EST 2006

On 2 Feb 2006, at 2:50 pm, Christopher Dwan wrote:

>
> Question on the state of the art in cluster management:
>
> Approximately what level of dedicated support do folks on this list  
> have / wish you have for your clusters?

Sanger is an interesting case, which I am happy to describe.  The  
Systems team consists of about 24 or so admins.  These are divided  
into four teams:

1)  Ops.  Handles backups, account administration, Linux desktops (of  
which there are about 300), and front line helpdesk. 6 admins.

2)  PC/Mac.  Handles all Windows and most MacOS X machines (a few  
hundred of those), and shares front line help desk with Ops. 8 admins.

3)  Special Projects.  Handles network and SAN infrastructure, as  
well as core services like mail, DHCP, NIS.  Second line support.  5  
admins.

4)  Informatics Systems Group.  Handles high performance computing  
clusters, technology evals (along with SPT), and internal consultancy  
for scientific computing.  4 admins (including me).  Other  
bioclusters list regulars are past and present members of this group,  
notably James Cuff, who started the whole ISG group in the first  
place, and Guy Coates who runs it now.  The rest of what I'm going to  
say applies your questions to ISG specifically.

>   Obviously, there are a lot of free variables, including but not  
> limited to:
>
> * Does the support person also do development, parallelization, or  
> otherwise *use* the cluster?

We certainly develop tools to help manage our cluster.  We also help  
users in the development of their code to use the cluster  
effectively.  But we don't actually perform any bioinformatics  
research ourselves, as such.  Cuffy used to, but we don't have the  
time any more - things have got a lot larger and more diverse since  
his day.  :-)

> * Do their other responsibilities come from the IT side or the  
> research side (i.e:  Are we dedicating half of a unix admin, or  
> half of a postdoc?)

ISG team members have usually been both.  I'm a biologist by training  
(PhD in cell biology), but have been doing informatics more or less  
continuously since I finished my PhD.  Guy, similarly, is a PhD  
protein chemist, who got into the IT side that way.  James is a PhD  
bioinformatician, I think.  I'm sure he'll correct me.  Our current  
other two members are a PhD computer scientist, and a very  
experienced Linux admin.  But as far as we're viewed politically by  
the Institute, we're IT people, not scientists, and we report to the  
head of IT.

> * How many users are being supported, and to what level?

Sanger has about 700 users, all told, and ISG in particular also  
support a couple of dozen from the EBI who use the same compute  
farm.  Most of these users, though, are in labs, and their IT needs  
are handled by teams other than ISG.  The number of users of the  
compute farm, now, is (rummages through the project accounting) 147.

> Setting aside these and similar details that would make for valid,  
> comparable numbers, my gut feel is that a reasonable guess is one  
> half time IT person for the first fifty nodes or so.  I think it  
> scales logarithmically from there:   So, go up to an entire full  
> time person for 50 - 150, and add support staff incrementally as  
> the cluster becomes more huge.

Generally speaking, I'd say that's about right.  Once you get past  
really large numbers (say 1000, or so) I don't think you really need  
many more people.  By that stage, you've probably got automated  
monitoring and fault reporting sorted out, so you just need enough  
people to physically go into the machine room and replace bits as and  
when required.

Depending on how much else the admin has to do, and how much you are  
willing to pay for really well designed hardware, you can actually go  
to quite high node/admin ratios.  When we bought our RLX blades in  
2002, we had a pretty tough procurement requirement; we wanted around  
700 nodes, and it was to require no more than half a full time  
engineer to run them (ISG only had two members at the time, James and  
me, and we were fairly well stretched).  Those RLX blades managed it  
too - I used to spend a single day, once a month, in the machine room  
replacing broken blades and sending them back to RLX.  Didn't need to  
spend $$$ on an expensive support contract.  Warranty replacement was  
fine (when you've got nearly 800 nodes, a failed node per day can be  
ignored for a couple of weeks; the loss in capacity is so small).   
The rest of the time they required no attention at all other than  
occasionally distributing new BLAST indexes to them, perhaps once a  
week.

It's why I'm so sold on blades now.  Managing them is *such* a doddle  
compared with pizza boxes.  You pay for it up front when you buy the  
machines, but I think it's worth it.

Tim

-- 
Dr Tim Cutts
Informatics Systems Group, Wellcome Trust Sanger Institute
GPG: 1024D/E3134233 FE3D 6C73 BBD6 726A A3F5  860B 3CDD 3F56 E313 4233