Donald Becker <becker@scyld.com> writes: > > Beowulf software, though it may require special versions of libraries and > > patched kernels, is, from the distribution perspective, just a set of > > packages. As far as I can see, the pieces could be packaged for the > > different distributions with little difficulty. > > Viewed that way, there is little difference between Linux distributions. > They are just a set of packages with an installation program. They all > use approximately the same kernels, libraries, compilers and utilities. True, in a way. But I think there are important differences. Debian's public bug tracking system is really, really nice, for example. Text editors are much the same, too, but I still strongly prefer to use software (e.g., mail readers) that let me use the editor that best meets my requirements, rather than the editor that the authors of the software have decided would be best. > But that discounts the value of a distribution. Unless you have an > integrated distribution, you can't provide a complete, tested solution. LFS > large file support is an example. Two years ago we were the first to ship a > distribution with tested LFS, which workstation-oriented distributions > didn't see as a priority. That wouldn't have been feasible with add-on > tools for arbitrary distributions. This is a good point, and a significant challenge for anyone that wants to sell clustering software on Linux. You could simply stipulate that the distribution must correctly support LFS. That's not an entirely satisfactory solution, though. > Not at all. In a cluster, compute nodes exist to run jobs on behalf of > the master systems. Putting a full installation on a compute node > increases the complexity, administrative burden, and opportunity for > failure. > > With the Scyld system, compute nodes are dramatically simplified. They > run a fully capable standard kernel with extensions, and start out with > no file system (actually a RAM-based filesystem). > > There are many advantages of this approach. > Adding new compute nodes is fast and automatic > The system is easily scalable to over a thousand nodes > Single-point updates for kernel, device drivers, libraries and applications > Jobs run faster on compute nodes than a full installation For the sake of argument, I'm comparing this with an nfsroot setup (with a common /etc for all slaves, /var on ramdisk, /dev on devfs, everything else mounted read-only straight off the master, except for application files mounted r/w). It looks like points 1 and 3 would work the same and I don't see why point #4 would be true. It does seem like scalability is something you'd have to keep an eye on. In the comparison setup, you're read-only mounting a lot of files off of the master. I'm not sure how many hosts you can do this with before you start running into trouble, but it does seem like it should scale somewhat (at least with parameter tweaking, as you pointed out). > Presenting a simple model to the user is a very important thing. Using > a NFS root makes it simple for the person installing the system, but > that is a hack not an architected system. Doing system administration > will require detailed knowledge of what types of files to put on which > file systems, NFS has significant performance and scaling bottleneck, > and the users will have to deal with NFS consistency and caching quirks. I agree that keeping things simple is a very important thing. One of the things I like about the nfsroot setup is that the story for admins and users is pretty simple. It's basically a network of fairly vanilla Linux boxes, except that the OS filesystems of the slaves are read-only or in RAM and all admin tasks need to be done on the master. That set of workstations can then be used in the obvious way or with MPI, etc. To me, BProc seems considerably more complex. You pretty much have to understand BProc. You can pretend that you're running everything on a single system, but you stand to get bit a lot if you really believe so. ("One of my processes wrote a file to /tmp; why can't my other processes see it?" "Why doesn't shared memory work?") > Our system is much more than BProc, although many of the tools are built > using BProc as a base. The system has dozens of integrated pieces > such as a cluster name service, status monitoring tools, integrated MPI, > and a unified administration system. What can I say? That sounds really complicated... :-) > The user sees BProc as the unified process space over the cluster. They > can see and control all processes of their job using Unix tools they > already know, such as a 'top', 'ps', 'suspend' and 'kill'. Yes, and this is really nice. But BProc also seems to be stretching POSIX pretty hard. In normal Linux, when I kill(2) a process, failure (as in communications failure) is not a possibility. It seems like a lot of subtle failure modes could be hiding here. > You should look at the Scyld system I will. Thanks for the info! Mike