On 18 May 2002, Mike Coleman wrote: > Donald Becker <becker@scyld.com> writes: > > With the Scyld system, compute nodes are dramatically simplified. They > > run a fully capable standard kernel with extensions, and start out with > > no file system (actually a RAM-based filesystem). > > > > There are many advantages of this approach. > > Adding new compute nodes is fast and automatic > > The system is easily scalable to over a thousand nodes > > Single-point updates for kernel, device drivers, libraries and applications > > Jobs run faster on compute nodes than a full installation > > For the sake of argument, I'm comparing this with an nfsroot setup (with a > common /etc for all slaves, /var on ramdisk, /dev on devfs, everything else > mounted read-only straight off the master, except for application files > mounted r/w). Setting up a NFS root isn't quite as simple as that. There are many exceptions: most of /etc/ can be shared, but a few files (e.g. /etc/fstab) must exist per-node. There are enough exceptions in different directories that most NFS root systems use a explicitly created file tree on the file server for each node in the cluster. > It looks like points 1 and 3 would work the same and I don't > see why point #4 would be true. The faster execution is the the effect of - cluster directory (name) services - wired-down core libraries - no housekeeping daemons - integrated MPI and PVM - fast BProc job initiation Examples of point #1 are Using cluster node names such as ".23". Destination addresses can be calculated by knowing the IP address of ".0". Sending user names with processes to avoid /etc/passwd look-ups or network traffic That last two points relate only to job start-up time. Cluster directory services and the clean BProc semantics make MPI and PVM job start-up much simpler. BProc reduce the node process creation time to about 1/10 and 1/20 the time of 'rsh' and 'ssh'. Job start-up time isn't important for long-running jobs, but it is important if you want to use a cluster as an interactive machine rather than a batch system. > It does seem like scalability is something you'd have to keep an eye on. In > the comparison setup, you're read-only mounting a lot of files off of the > master. I'm not sure how many hosts you can do this with before you start > running into trouble, but it does seem like it should scale somewhat > (at least with parameter tweaking, as you pointed out). Clusters are a much different situation than workstations -- all nodes want the same services at the same time. Using NFS root with default parameters scales very poorly. You can do much better by using multiple NFS mounts with different attributes and caching parameters. But this requires expertise, and on-going monitoring. Just managing NFS quickly becomes complex, both for users and administrators. > To me, BProc seems considerably more complex. You pretty much have to > understand BProc. You only need to understand BProc if you are writing system tools. End users, administrators and MPI developers don't need to learn new concepts. > > The user sees BProc as the unified process space over the cluster. They > > can see and control all processes of their job using Unix tools they > > already know, such as a 'top', 'ps', 'suspend' and 'kill'. > > Yes, and this is really nice. But BProc also seems to be stretching POSIX > pretty hard. In what way is it streching POSIX? Are there any semantics that are questionable? > In normal Linux, when I kill(2) a process, failure (as in > communications failure) is not a possibility. It seems like a lot of subtle > failure modes could be hiding here. [[ Please avoid innuendo such as that last statement. You are implying a problem without being specific what it might be or standing behind the claim. ]] I am often asked about the semantics of node failure. First, much like a segmentation violation in a local child process, that's not the normal or typical operation. Process control takes place over reliable streams, and failure policy is implemented in a user-level control process. When the master can no longer communicate with a slave process, the process is considered to have died abnormally. You won't be able to get process exit accounting information, but there are few other differences from a local process. Application recovery after a node failure still has to be handled on a case-by-case basis. But this is always true. Most HPTC users select the tradeoff of running at full speed, and re-running jobs that are affected by the rare case of a node failure. Long running jobs do their own application-specific checkpointing, which is much more efficient than doing full-memory-image checkpointing. -- Donald Becker becker@scyld.com Scyld Computing Corporation http://www.scyld.com 410 Severn Ave. Suite 210 Second Generation Beowulf Clusters Annapolis MD 21403 410-990-9993