On 13 May 2002, Mike Coleman wrote: > Donald Becker <becker@scyld.com> writes: > > The Scyld distribution includes features such as unified process space, > > single point/ single binary application installation, cluster directory > > services and zero-administration compute nodes. These features require > > kernel and library support. > > I think we're using the word 'distribution' in slightly different > ways. I don't think so -- the Scyld distribution is a complete OS installation, with integrated cluster tools. The CD set is full distribution. > Beowulf software, though it may require special versions of libraries and > patched kernels, is, from the distribution perspective, just a set of > packages. As far as I can see, the pieces could be packaged for the > different distributions with little difficulty. Viewed that way, there is little difference between Linux distributions. They are just a set of packages with an installation program. They all use approximately the same kernels, libraries, compilers and utilities. But that discounts the value of a distribution. Unless you have an integrated distribution, you can't provide a complete, tested solution. LFS large file support is an example. Two years ago we were the first to ship a distribution with tested LFS, which workstation-oriented distributions didn't see as a priority. That wouldn't have been feasible with add-on tools for arbitrary distributions. > It does seem that the only reasonable way to structure a cluster of any size > is for the compute nodes to somehow mirror some canonical copy of their > filesystems. Not at all. In a cluster, compute nodes exist to run jobs on behalf of the master systems. Putting a full installation on a compute node increases the complexity, administrative burden, and opportunity for failure. With the Scyld system, compute nodes are dramatically simplified. They run a fully capable standard kernel with extensions, and start out with no file system (actually a RAM-based filesystem). There are many advantages of this approach. Adding new compute nodes is fast and automatic The system is easily scalable to over a thousand nodes Single-point updates for kernel, device drivers, libraries and applications Jobs run faster on compute nodes than a full installation Presenting a simple model to the user is a very important thing. Using a NFS root makes it simple for the person installing the system, but that is a hack not an architected system. Doing system administration will require detailed knowledge of what types of files to put on which file systems, NFS has significant performance and scaling bottleneck, and the users will have to deal with NFS consistency and caching quirks. > The Bproc stuff looks interesting. I worry, though, about the amount > of brain surgery being done to the Linux/Unix model (which goes back > to that simplicity thing). Our system is much more than BProc, although many of the tools are built using BProc as a base. The system has dozens of integrated pieces such as a cluster name service, status monitoring tools, integrated MPI, and a unified administration system. The user sees BProc as the unified process space over the cluster. They can see and control all processes of their job using Unix tools they already know, such as a 'top', 'ps', 'suspend' and 'kill'. > As before, since I have yet to build my first Beowulf, take all this with an > extra large grain of salt. You should look at the Scyld system -- we have innovative approaches to solving problems that people have accepted as inherent to building clusters. You can buy low cost CD (note that it is an older version and comes with no support) from Linux Central. Or you can buy integrated clusters from about a dozen vendors, including HP-Compaq. -- Donald Becker becker@scyld.com Scyld Computing Corporation http://www.scyld.com 410 Severn Ave. Suite 210 Second Generation Beowulf Clusters Annapolis MD 21403 410-990-9993