[Bioclusters] High Availability Clustering

Wed, 23 Jul 2003 11:16:03 -0400

Hi Chris,

I heard your talk in San Diego this year where I signed up for the
bioclusters list, glad I did now!

>These people understand that their cluster is a discovery research 
>system and  (ocasional) downtime is not going to be unusual. Most people 
>I know consider downtimes of less than 4 hours or so (in hardware 
>failure cases) to be par for the course.
>
That's what I figured.  Unfortunately we just had a visit from a Sun Rep
though who considered our setup to be too risky with the single mastser node
/ cluster approach (I think they were a bit upset we went with a linux
cluster bought from another source) and he spread a little FUD around which
made some impact on my boss.  Not that we have ever had (or needed) even 2
9's from our Sunfire's but...

>I have 2 suggestions for you to consider that won't come close to 
>getting you to 100% uptime but they will get you closer and you won't 
>have to spend tons of $$$ on special filesystems or cross-connected 
>storage, IP switches etc.
>
>(1) Purchase a cold/warm spare head node; configure it to be ready to go 
>or perhaps keep a set of clone disks on hand that you can throw in.
>
This is the current plan, and I think part of the appeal is that it will
allow me to experiment a bit with different configuration options on the
"spare node".  We can also keep the spare node while we are at it...  Is
there any reason that this spare head node have the same CPU/kernel/disk
arrangement as the "old" master node?  I was thinking of creating this new
head node the way I want it (for one I don't like the partitioning scheme on
the current master node), and then having the old node mirroring the
critical and changing parts of the active master node.  Another factor is
that the current master node is running hot (and now somewhat slow, but they
were the right choice at the time) AMD's and I'd rather jump to Intel for
the  new mastser node.

>(2) I like this solution best -- why don't you configure multiple head 
>nodes? It is trivial to add N more multi-homed servers to your cluster 
>and the DRM software layers like Grid Engine and Platform LSF can be 
>
I like this idea too, but truthfully I'm just not there yet (I'm primarily a
developer, not a sysadmin) and our only experience has been with PBS (which
has worked well enough so far).  I think I should probably install SGE on
the spare for kicks, and see how it goes.  From there I can worry more about
HA, shadow masters and other higher end stuff.

Thanks for the advice,

 -John