[Bioclusters] resources on administering clusters

Jeff Layton bioclusters@bioinformatics.org
Tue, 26 Mar 2002 06:14:18 -0500

Joe Landman wrote:

> On Mon, 2002-03-25 at 13:23, Jeff Layton wrote:
> > > Remote power control is nice because I can remotely kill or reboot nodes
> > > that are misbehaving and I can also turn on and turn off the entire
> > > cluster in a staged manner (so you don't blow your power circuits!)
> > >
> > > With these 2 tools in hand, this is what my admin philosophy becomes:
> > >
> > > (1) If a node is behaving, don't touch it
> > > (2) If a node acts strangely use systemImager to automatically wipe the
> > > disk and reinstall the OS from scratch (remotely)
> >
> > I usually try to debug a node first before re-imaging it. I also plug into
> > a node that is locked up to see if I can find out anything (Linux doesn't
> > behave well under heavy memory pressure - "swapping itself to death").
> The swapping-death comes from a number of places, VM issues in pre
> 2.4.16 kernels, and poor swap layout.  Generally speaking swapping is
> not a good thing to do.  But sometimes good apps swap, so you should
> make sure they can do it reasonably well.

Most of our kernel panics from swapping came with 2.2 kernels. We
have seen some kernel panics with early 2.4 kernels on some test
boxes, but with 2.4.16 upwards, these panics have been happening
less frequently.

We don't like to swap if we don't have to :) We have a pretty good rule
of thumb for memory usage with the size of the problem. However,
occasionally we have to go pretty close to this rule of thumb and we
end up swapping.

> First off, spread the swap to as many spindles as you can.  Under Linux,
> you can "stripe" swap across multiple partitions.  If you have 4 disks,
> then look at the possibility of using 4 equisized partitions (one per
> disk) for swap.  This needs to be done at system build time.  Never ever
> put all your swap on a single partition.  This is "A Bad Thing(TM)" and
> leads to swap-death.

Can you put swap on RAID-0? I've never tried that before.

The other option is to take n disks, pull out a few Gigs from
each disk for /boot, /, /opt, /usr/local/, and saw and use the
remaining portions of the disks for RAID-0. This way you could
put n swap partitions on the disks.

> Second off, arrange the swap to the outermost cylinders of the disk.
> >From the various benchmarks on places like Tom's Hardware and others, it
> seems that you will get the highest I/O rates at the lower number
> cylinders.  Even with small 18 GB disks, lopping off 0.5 GB per disk is
> not terribly difficult.

This is where I forget most of the time. If I'm partitioning a new HD
with /dev/hda1, /dev/hda2, etc., do I put the swap partition on a lower
number partition or high number partition? I always get this backward.

> Third off, buy enough RAM.  RAM is cheap.  Far too many groups make the
> often painful decision that aggregate memory is important, and per CPU
> memory is not.  This is not true for a memory hungry application (like
> BLAST with large databases).  The time spent in swapping on a memory
> starved system can often increase the runtime an order of magnitude or
> more.  If you convert that into opportunity cost of being unable to use
> the resource for other jobs while it is grinding away at yours, well,
> you get the idea that the RAM pays for itself over and over again.

Amen. Now if I can convince management that sawpping is a bad thing and
not a cheap good thing.

<rant> It's amazing that so-called IT experts really have no clue about
computers at all. They claim to have experience with computers (running
Windows perhaps), but they are really clueless in many instances about
what the issues are and how to solve them. In fact, I had a Unix Sys Admin
manager, who was over all Unix servers and workstations, who did not have
a college degree, never took a college course, had never ever used Unix,
barely knew how to spell Unix, and had never been any kind of manager
before - not even a team lead.

> Fourth off, if you have the choice of buying a single big disk, or more
> smaller disks, think of the calculus this way.  1 I/O pipe per disk, and
> I can stripe my file systems.  So more smaller disks means more I/O
> (local) bandwidth.  This is "A Good Thing(TM)".  Yes, you may argue that
> it increases your risk and reduces the nodes MTBF.  A good node
> regeneration and some spares cure that issue rather quickly.

Agreed. Plus, running more disks allows you to use RAID-0 for speed
and even RAID-10 (or RAID-01) to get some redundancy.

> There are other points that could be made, but swapping need not be
> deadly.  If it is, you have a local disk I/O issue that desperately
> needs to be solved.  Local I/O is very important to certain
> applications.  You really do not want to be hitting a set of files hard
> over an NFS mount.  That does not scale.

Bingo. We are looking at putting PVFS into production. We have tested it
on and off over two years and it looks promising in our tests. We just have
to modify our app to use MPI-IO/ROMIO/PVFS and we're off to the

Thanks for the advice! It's much appreciated.


> --
> Joseph Landman, Ph.D.
> Senior Scientist,
> MSC Software High Performance Computing
> email           : joe.landman@mscsoftware.com
> messaging       : page_joe@mschpc.dtw.macsch.com
> Main office     : +1 248 208 3312
> Cell phone      : +1 734 612 4615
> Fax             : +1 714 784 3774
> _______________________________________________
> Bioclusters maillist  -  Bioclusters@bioinformatics.org
> http://bioinformatics.org/mailman/listinfo/bioclusters