[Bioclusters] Request for discussions-How to build a biocluster Part 3 (the OS)

18 May 2002 20:32:00 -0500

Donald Becker <becker@scyld.com> writes:
> > Beowulf software, though it may require special versions of libraries and
> > patched kernels, is, from the distribution perspective, just a set of
> > packages.  As far as I can see, the pieces could be packaged for the
> > different distributions with little difficulty.
> 
> Viewed that way, there is little difference between Linux distributions.
> They are just a set of packages with an installation program. They all
> use approximately the same kernels, libraries, compilers and utilities.

True, in a way.  But I think there are important differences.  Debian's public
bug tracking system is really, really nice, for example.

Text editors are much the same, too, but I still strongly prefer to use
software (e.g., mail readers) that let me use the editor that best meets my
requirements, rather than the editor that the authors of the software have
decided would be best.

> But that discounts the value of a distribution.  Unless you have an
> integrated distribution, you can't provide a complete, tested solution.  LFS
> large file support is an example.  Two years ago we were the first to ship a
> distribution with tested LFS, which workstation-oriented distributions
> didn't see as a priority.  That wouldn't have been feasible with add-on
> tools for arbitrary distributions.

This is a good point, and a significant challenge for anyone that wants to
sell clustering software on Linux.  You could simply stipulate that the
distribution must correctly support LFS.  That's not an entirely satisfactory
solution, though.

> Not at all.  In a cluster, compute nodes exist to run jobs on behalf of
> the master systems.  Putting a full installation on a compute node
> increases the complexity, administrative burden, and opportunity for
> failure.
> 
> With the Scyld system, compute nodes are dramatically simplified.  They
> run a fully capable standard kernel with extensions, and start out with
> no file system (actually a RAM-based filesystem).
> 
> There are many advantages of this approach.
>   Adding new compute nodes is fast and automatic
>   The system is easily scalable to over a thousand nodes
>   Single-point updates for kernel, device drivers, libraries and applications
>   Jobs run faster on compute nodes than a full installation

For the sake of argument, I'm comparing this with an nfsroot setup (with a
common /etc for all slaves, /var on ramdisk, /dev on devfs, everything else
mounted read-only straight off the master, except for application files
mounted r/w).  It looks like points 1 and 3 would work the same and I don't
see why point #4 would be true.

It does seem like scalability is something you'd have to keep an eye on.  In
the comparison setup, you're read-only mounting a lot of files off of the
master.  I'm not sure how many hosts you can do this with before you start
running into trouble, but it does seem like it should scale somewhat (at least
with parameter tweaking, as you pointed out).

> Presenting a simple model to the user is a very important thing.  Using
> a NFS root makes it simple for the person installing the system, but
> that is a hack not an architected system.  Doing system administration
> will require detailed knowledge of what types of files to put on which
> file systems, NFS has significant performance and scaling bottleneck,
> and the users will have to deal with NFS consistency and caching quirks.

I agree that keeping things simple is a very important thing.  One of the
things I like about the nfsroot setup is that the story for admins and users
is pretty simple.  It's basically a network of fairly vanilla Linux boxes,
except that the OS filesystems of the slaves are read-only or in RAM and all
admin tasks need to be done on the master.  That set of workstations can then
be used in the obvious way or with MPI, etc.

To me, BProc seems considerably more complex.  You pretty much have to
understand BProc.  You can pretend that you're running everything on a single
system, but you stand to get bit a lot if you really believe so.  ("One of my
processes wrote a file to /tmp; why can't my other processes see it?"  "Why
doesn't shared memory work?")  

> Our system is much more than BProc, although many of the tools are built
> using BProc as a base.  The system has dozens of integrated pieces
> such as a cluster name service, status monitoring tools, integrated MPI,
> and a unified administration system.

What can I say?  That sounds really complicated...  :-)

> The user sees BProc as the unified process space over the cluster.  They
> can see and control all processes of their job using Unix tools they
> already know, such as a 'top', 'ps', 'suspend' and 'kill'.

Yes, and this is really nice.  But BProc also seems to be stretching POSIX
pretty hard.  In normal Linux, when I kill(2) a process, failure (as in
communications failure) is not a possibility.  It seems like a lot of subtle
failure modes could be hiding here.

> You should look at the Scyld system

I will.  Thanks for the info!

Mike