[Bioclusters] High Availability Clustering
Chris Dagdigian
bioclusters@bioinformatics.org
Tue, 22 Jul 2003 17:29:33 -0400
Hi John,
I've seen, toured or been involved with a bunch of clustering projects
over the years and I've never seen anyone really shoot for 5 nines
uptime or whatever for their clusters. In most cases these are research
systems and the owners have decided to forgo expensive HA clustering in
favor of either (a) saving money or (b) plowing more money into storage,
network or raw CPU power. This is true in my experience across academic,
biotech and big pharma settings.
These people understand that their cluster is a discovery research
system and (ocasional) downtime is not going to be unusual. Most people
I know consider downtimes of less than 4 hours or so (in hardware
failure cases) to be par for the course.
You will also find that hardware is not the most common failure case.
Many times the cluster goes down because a user crashed the cluster or
the DRM -- not a hardware cause at all.
Joe hit it on the head --- there is a whole body of best practices in
the application and robust RDBMS database server space that you can
probably draw upon to learn what people are doing for HA stuff. It
typically involves shared storage, IP failover and some sort of
heartbeat mechanism between machines. I've heard good things about the
Linux HA project but have never actually used it.
Please keep the list informed; I'd be interested in seeing how this
project goes.
I have 2 suggestions for you to consider that won't come close to
getting you to 100% uptime but they will get you closer and you won't
have to spend tons of $$$ on special filesystems or cross-connected
storage, IP switches etc.
(1) Purchase a cold/warm spare head node; configure it to be ready to go
or perhaps keep a set of clone disks on hand that you can throw in. If
your cluster storage is seperate (ie NAS or external fileserver) then
you can bring up a new head node in a few minutes and reimage it to
bring it up to date in another couple of minutes. You may find your
management and users are willing to put up with an hour or two of
downtime in case of head node failure. This will save you time, $$ and
complexity at the cost of some absolute downtime if the head node goes down.
(2) I like this solution best -- why don't you configure multiple head
nodes? It is trivial to add N more multi-homed servers to your cluster
and the DRM software layers like Grid Engine and Platform LSF can be
configured to fail over the scheduling and resource allocation daemons
and all they need between themselves is a common NFS filesystem. Grid
Engine has "shadow masters" that will activate upon failure of the
master node and Platform LSF has a mechanism whereby the cluster will
select, elect and promote a new machine to be the cluster master.
If you combine multiple head nodes that are each capable of acting as
the cluster scheduler and gateway then you can just get one of those
simple load balancer boxes that companies sell into the web farm space
-- these boxes do round-robin DNS or load-based balancing of IP traffic
between boxes on the same subnet.
-Chris
Osborne, John wrote:
> Hello,
>
> I'm the unofficial admin for a 20 node (40 CPU) linux cluster here at the
> CDC and I'm looking for some advice. Our setup here relies upon a *single*
> master node which acts as a gateway to the internal cluster network. If
> something were to happen to the master node, we'd be in serious trouble if
> we are aiming for 100% uptime. So far we aren't that serious about 100%
> uptime (although we've had it for this master node thus far) but as the
> popularity of the cluster grows it is becoming more important. I am
> wondering what is the best way to ensure failover for a master node in a
> cluster. Write now I just write out a master node image to network storage
> every night and if something goes wrong, the cluster is effectively down and
> it could take hours to get it fixed.
>
> Is it possible to have 2 master nodes with a single virtual IP address? How
> are other people solving this problem?
>
> -John
>
> _______________________________________________
> Bioclusters maillist - Bioclusters@bioinformatics.org
> https://bioinformatics.org/mailman/listinfo/bioclusters
--
Chris Dagdigian, <dag@sonsorol.org>
BioTeam Inc. - Independent Bio-IT & Informatics consulting
Office: 617-666-6454, Mobile: 617-877-5498, Fax: 425-699-0193
PGP KeyID: 83D4310E Yahoo IM: craffi Web: http://bioteam.net