On 2 Feb 2006, at 2:50 pm, Christopher Dwan wrote: > > Question on the state of the art in cluster management: > > Approximately what level of dedicated support do folks on this list > have / wish you have for your clusters? Sanger is an interesting case, which I am happy to describe. The Systems team consists of about 24 or so admins. These are divided into four teams: 1) Ops. Handles backups, account administration, Linux desktops (of which there are about 300), and front line helpdesk. 6 admins. 2) PC/Mac. Handles all Windows and most MacOS X machines (a few hundred of those), and shares front line help desk with Ops. 8 admins. 3) Special Projects. Handles network and SAN infrastructure, as well as core services like mail, DHCP, NIS. Second line support. 5 admins. 4) Informatics Systems Group. Handles high performance computing clusters, technology evals (along with SPT), and internal consultancy for scientific computing. 4 admins (including me). Other bioclusters list regulars are past and present members of this group, notably James Cuff, who started the whole ISG group in the first place, and Guy Coates who runs it now. The rest of what I'm going to say applies your questions to ISG specifically. > Obviously, there are a lot of free variables, including but not > limited to: > > * Does the support person also do development, parallelization, or > otherwise *use* the cluster? We certainly develop tools to help manage our cluster. We also help users in the development of their code to use the cluster effectively. But we don't actually perform any bioinformatics research ourselves, as such. Cuffy used to, but we don't have the time any more - things have got a lot larger and more diverse since his day. :-) > * Do their other responsibilities come from the IT side or the > research side (i.e: Are we dedicating half of a unix admin, or > half of a postdoc?) ISG team members have usually been both. I'm a biologist by training (PhD in cell biology), but have been doing informatics more or less continuously since I finished my PhD. Guy, similarly, is a PhD protein chemist, who got into the IT side that way. James is a PhD bioinformatician, I think. I'm sure he'll correct me. Our current other two members are a PhD computer scientist, and a very experienced Linux admin. But as far as we're viewed politically by the Institute, we're IT people, not scientists, and we report to the head of IT. > * How many users are being supported, and to what level? Sanger has about 700 users, all told, and ISG in particular also support a couple of dozen from the EBI who use the same compute farm. Most of these users, though, are in labs, and their IT needs are handled by teams other than ISG. The number of users of the compute farm, now, is (rummages through the project accounting) 147. > Setting aside these and similar details that would make for valid, > comparable numbers, my gut feel is that a reasonable guess is one > half time IT person for the first fifty nodes or so. I think it > scales logarithmically from there: So, go up to an entire full > time person for 50 - 150, and add support staff incrementally as > the cluster becomes more huge. Generally speaking, I'd say that's about right. Once you get past really large numbers (say 1000, or so) I don't think you really need many more people. By that stage, you've probably got automated monitoring and fault reporting sorted out, so you just need enough people to physically go into the machine room and replace bits as and when required. Depending on how much else the admin has to do, and how much you are willing to pay for really well designed hardware, you can actually go to quite high node/admin ratios. When we bought our RLX blades in 2002, we had a pretty tough procurement requirement; we wanted around 700 nodes, and it was to require no more than half a full time engineer to run them (ISG only had two members at the time, James and me, and we were fairly well stretched). Those RLX blades managed it too - I used to spend a single day, once a month, in the machine room replacing broken blades and sending them back to RLX. Didn't need to spend $$$ on an expensive support contract. Warranty replacement was fine (when you've got nearly 800 nodes, a failed node per day can be ignored for a couple of weeks; the loss in capacity is so small). The rest of the time they required no attention at all other than occasionally distributing new BLAST indexes to them, perhaps once a week. It's why I'm so sold on blades now. Managing them is *such* a doddle compared with pizza boxes. You pay for it up front when you buy the machines, but I think it's worth it. Tim -- Dr Tim Cutts Informatics Systems Group, Wellcome Trust Sanger Institute GPG: 1024D/E3134233 FE3D 6C73 BBD6 726A A3F5 860B 3CDD 3F56 E313 4233