[Bioclusters] Request for discussions-How to build a biocluster Part 3 (the OS)

20 May 2002 23:26:24 -0500

Donald Becker <becker@scyld.com> writes:
> Setting up a NFS root isn't quite as simple as that.  There are many
> exceptions: most of /etc/ can be shared, but a few files (e.g. /etc/fstab)
> must exist per-node.  There are enough exceptions in different directories
> that most NFS root systems use a explicitly created file tree on the file
> server for each node in the cluster.

It's a thought experiment for me, to understand what's really involved.  I do
realize that careful consideration has to go into the decisions regarding how
each file/directory is replicated.  My hope is that (building on the work many
previous people have done, making /usr shareable, for example) that it might
be possible to create a fairly concise and simple (!) design.

(I had kind of hoped that I might even be able to just use the master's / and
/etc directly, but it does seem that way too much hackery would be required to
make that work.)

I must be slow tonight, but I can't see why /etc/fstab cannot be shared among
all nodes, given my assumption that all slaves are hardware-identical.  What
am I missing?

> >  It looks like points 1 and 3 would work the same and I don't
> > see why point #4 would be true.

[If I'm reading your reply correctly, you are, below, only addressing the
question of whether or not the original point #4 ("Jobs run faster on compute
nodes than a full installation") is true.  I would still be interested in your
thoughts on original points #1 ("Adding new compute nodes is fast and
automatic") and #3 ("Single-point updates for kernel, device drivers,
libraries and applications").]

> The faster execution is the the effect of
>     - cluster directory (name) services
>     - wired-down core libraries
>     - no housekeeping daemons
>     - integrated MPI and PVM
>     - fast BProc job initiation
> 
> Examples of point #1 are
>    Using cluster node names such as ".23".  Destination addresses can
>    be calculated by knowing the IP address of ".0".
>    Sending user names with processes to avoid /etc/passwd look-ups or
>    network traffic
> 
> That last two points relate only to job start-up time.
> Cluster directory services and the clean BProc semantics make MPI and
> PVM job start-up much simpler.  BProc reduce the node process creation
> time to about 1/10 and 1/20 the time of 'rsh' and 'ssh'.
> 
> Job start-up time isn't important for long-running jobs, but it is
> important if you want to use a cluster as an interactive machine rather
> than a batch system.

For the compute-bound problem domain I'm thinking of, startup time (within
reason) isn't an issue.  I can imagine that if your job stream looked more
like Google, you probably don't want to use ssh.

I'm not really following the cluster node name idea.  Is this really faster
than just looking up names in /etc/hosts?  And likewise with /etc/passwd.  It
doesn't seem like this would take any noticeable fraction of total job time.

Eliminating unneeded daemons seems smart.  It seems like one could do as well
with an nfsroot system on that point.

I don't really understand what wired-down libraries or integrated MPI means.
How does this save execution time?

> > It does seem like scalability is something you'd have to keep an eye on.  In
> > the comparison setup, you're read-only mounting a lot of files off of the
> > master.  I'm not sure how many hosts you can do this with before you start
> > running into trouble, but it does seem like it should scale somewhat
> > (at least with parameter tweaking, as you pointed out).
> 
> Clusters are a much different situation than workstations -- all nodes
> want the same services at the same time.
> Using NFS root with default parameters scales very poorly.  You can do
> much better by using multiple NFS mounts with different attributes and
> caching parameters.  But this requires expertise, and on-going
> monitoring.  Just managing NFS quickly becomes complex, both for users
> and administrators.

You probably know a lot more about this than I do, but I certainly concede
that this could take some expertise to design correctly.  With luck, it might
make a fairly simple HOWTO, once designed, however.  (Or not.)

I'm ready to concede that the design I have in mind will require knowledgeable
setup and administration.  I guess I see that as coming with the territory at
this point.  If someone told me they were going to sell me a Beowulf that
didn't require a knowledgeable person to admin, I wouldn't find that claim
credible.  Maybe, if you're only running one well-defined and well-behaved
application, you can create a Beowulf toaster.  Maybe.

> > To me, BProc seems considerably more complex.  You pretty much have to
> > understand BProc.  (*)
> 
> You only need to understand BProc if you are writing system tools.  End
> users, administrators and MPI developers don't need to learn new
> concepts.

The lines you edited out at (*) were

   >>> You can pretend that you're running everything on a single system, but
   >>> you stand to get bit a lot if you really believe so.  ("One of my
   >>> processes wrote a file to /tmp; why can't my other processes see it?"
   >>> "Why doesn't shared memory work?")

To me, this sees like a counterexample to what you're saying.  If a user
writes a shell script that writes a file to /tmp, and another that expects to
read that file from /tmp, with the expectation that it'll just work because
it's a single system, they're going to get bit, right?  The two files may be
on two different hosts.  Or am I missing something?

As I said elsewhere, I don't see this as a showstopper.  But I don't think
it's correct to say that users or application writers don't need to know
anything about BProc.

> > > The user sees BProc as the unified process space over the cluster.  They
> > > can see and control all processes of their job using Unix tools they
> > > already know, such as a 'top', 'ps', 'suspend' and 'kill'.
> >
> > Yes, and this is really nice.  But BProc also seems to be stretching POSIX
> > pretty hard.
> 
> In what way is it streching POSIX?  Are there any semantics that are
> questionable?
> 
> > In normal Linux, when I kill(2) a process, failure (as in
> > communications failure) is not a possibility.  It seems like a lot of subtle
> > failure modes could be hiding here.
> 
> [[ Please avoid innuendo such as that last statement.  You are implying
> a problem without being specific what it might be or standing behind the
> claim. ]]

Don, I've reread that and I don't believe it's innuendo.  I'm saying that to
me it looks like there could be a problem there, and I've given a concrete
example of the kind of problem I have in mind (i.e., delivery failure of a
signal due to network failure).

I could certainly be wrong, but I think it's the kind of question that would
occur to most engineers (I know you're sharp and I'm confident that you've
considered it yourself).  I'm not trying to impugn BProc--I'm trying to get a
good feeling for what its limitations are.

Here's the example I had in mind: Process A (not on slave 1) does a kill,
sending a signal to process B on slave 1.  At just that moment, the network
BProc uses fails in some way.  Let's say that process B is writing to some
sort of network file (on NFS or whatever) on another network unaffected by the
above network failure.  Process B has not received the signal and continues
merrily along.  What happens?  Will process B be writing to that file while
process A has every right to assume that it is not (because it's dead)?

From what you've said, it looks like the master counts process B as dead once
the network fails.  This can cause trouble, though--remember that process B is
still running and writing to that file.  This is the kind of thing I'm
thinking of when I talk about stretching POSIX (I should have said "stretching
expected Unix behavior").

Thinking a little more on this, I guess maybe the monitor on the slave can
also notice the network failure (assuming some heartbeat traffic) and kill
process B in that case.  So maybe it just works.

How about username/uid mapping?  If /etc/passwd changes after a program has
migrated, does the program see the changes?

How about setuid programs?  Do they work correctly?  I mentioned shared
memory; that seems like an issue.

Do non-BProc processes on a slave see the BProc slave processes?  If so, and
one sends a signal to a slave process, where does the slave process see the
signal come from (si_pid in siginfo_t)?  The real sender simply doesn't exist
from the slave's point of view.  You could disallow the sending of such
signals, but this breaks the expectations of non-BProc processes on the slave.
Or maybe non-BProc processes on the slave can't see BProc processes at all.
This seems problematic, though, because those processes can hold resources
(e.g., they can sit on a mounted filesystem, which will thus be mysteriously
unmountable).  So it seems like there's maybe some subtlety lurking here.

I can imagine that there's much more that I haven't thought of.  This should
give you a taste of my concerns, anyway.

Again, I'm not saying that BProc is bad.  I just want to understand what it's
limitations are.  I think that anyone that's really going to use it needs to
know this stuff, both to use it effectively and to weigh it against the
alternatives.

> I am often asked about the semantics of node failure.
> First, much like a segmentation violation in a local child process,
> that's not the normal or typical operation.
> 
> Process control takes place over reliable streams, and failure policy is
> implemented in a user-level control process.  When the master can no
> longer communicate with a slave process, the process is considered to
> have died abnormally.  You won't be able to get process exit
> accounting information, but there are few other differences from a local
> process.
> 
> Application recovery after a node failure still has to be handled on a
> case-by-case basis.  But this is always true.  Most HPTC users select
> the tradeoff of running at full speed, and re-running jobs that are
> affected by the rare case of a node failure.  Long running jobs do their
> own application-specific checkpointing, which is much more efficient
> than doing full-memory-image checkpointing.

Thanks, that is helpful.

Mike