On 6 Feb 2005, at 11:04 am, Tony Travis wrote: > Hello, Tim. > > We only have a 'small' 64-node cluster here :-) > > However, I've opted to use BOBCAT architecture: > > http://www.epcc.ed.ac.uk/bobcat/ > > Although the original EPCC BOBCAT no longer exists, it's spirit lives > on in our RRI/BioSS cluster: > > http://bobcat.rri.sari.ac.uk > > The important thing is to have TWO completely separate private network > fabrics: One for DHCP/NFS, the other for IPC. The main problem we have > is that IPC (i.e. Inter Process Communication) can swamp the bandwidth > of a single network fabric and you rapidly lose control of the > cluster. We don't have any IPC. We don't run any parallel code. Each job runs on a single CPU. And NFS *still* causes problems, occasionally. It really isn't a myth at this scale. It's unusable. For example, we have to make separate copies of the LSF binaries on all of the machines, because to do it the Platform-endorsed way, with everything NFS mounted, is a bit flakey. The NFS contention from LSF's house keeping alone can be enough to break the cluster. I suspect if you're running large parallel jobs, then the number of NFS operations involved is relatively low. The issue for us is sometimes hundreds of jobs completing every minute, all trying to read some data files and then create three or four output files on an NFS mounted disk. That's a lot of separate NFS operations, a large proportion of which are the particularly painful directory operations. I plead with the users not to write code like this, but you know what users are like. > I think there are some MYTHS about NFS and clusters around because of > the bandwidth contention on a single network fabric. The NFS network > traffic on our cluster is completely segregated from the IPC traffic > which is throttled by the bandwidth of its own network fabric. The > switches on the two network fabrics are NOT connected in any way... Our approach is actually similar to yours; we're moving towards cluster filesystems like GPFS and Lustre, and in those cases, we run the cluster filesystem traffic over a second network. It's actually a VLAN on the same switches, but that's not the performance problem you might think because the Extreme switches we use are fully non-blocking. You can throw an absolutely obscene number of packets at them and they cope fine. Even when a Ganglia bug caused a machine to emit thousands of multicast packets to all 1000 machines every second. The ganglia daemons went into 100% CPU coping with the incoming packets, which made the cluster almost unusable, but the network itself was still going strong. Tim -- Dr Tim Cutts Informatics Systems Group, Wellcome Trust Sanger Institute GPG: 1024D/E3134233 FE3D 6C73 BBD6 726A A3F5 860B 3CDD 3F56 E313 4233