Hi Chris, I heard your talk in San Diego this year where I signed up for the bioclusters list, glad I did now! >These people understand that their cluster is a discovery research >system and (ocasional) downtime is not going to be unusual. Most people >I know consider downtimes of less than 4 hours or so (in hardware >failure cases) to be par for the course. > That's what I figured. Unfortunately we just had a visit from a Sun Rep though who considered our setup to be too risky with the single mastser node / cluster approach (I think they were a bit upset we went with a linux cluster bought from another source) and he spread a little FUD around which made some impact on my boss. Not that we have ever had (or needed) even 2 9's from our Sunfire's but... >I have 2 suggestions for you to consider that won't come close to >getting you to 100% uptime but they will get you closer and you won't >have to spend tons of $$$ on special filesystems or cross-connected >storage, IP switches etc. > >(1) Purchase a cold/warm spare head node; configure it to be ready to go >or perhaps keep a set of clone disks on hand that you can throw in. > This is the current plan, and I think part of the appeal is that it will allow me to experiment a bit with different configuration options on the "spare node". We can also keep the spare node while we are at it... Is there any reason that this spare head node have the same CPU/kernel/disk arrangement as the "old" master node? I was thinking of creating this new head node the way I want it (for one I don't like the partitioning scheme on the current master node), and then having the old node mirroring the critical and changing parts of the active master node. Another factor is that the current master node is running hot (and now somewhat slow, but they were the right choice at the time) AMD's and I'd rather jump to Intel for the new mastser node. >(2) I like this solution best -- why don't you configure multiple head >nodes? It is trivial to add N more multi-homed servers to your cluster >and the DRM software layers like Grid Engine and Platform LSF can be > I like this idea too, but truthfully I'm just not there yet (I'm primarily a developer, not a sysadmin) and our only experience has been with PBS (which has worked well enough so far). I think I should probably install SGE on the spare for kicks, and see how it goes. From there I can worry more about HA, shadow masters and other higher end stuff. Thanks for the advice, -John