[Bioclusters] Login & home directory strategies for PVM?

Mon Feb 7 04:14:12 EST 2005

On 6 Feb 2005, at 11:04 am, Tony Travis wrote:

> Hello, Tim.
>
> We only have a 'small' 64-node cluster here :-)
>
> However, I've opted to use BOBCAT architecture:
>
> 	http://www.epcc.ed.ac.uk/bobcat/
>
> Although the original EPCC BOBCAT no longer exists, it's spirit lives 
> on in our RRI/BioSS cluster:
>
> 	http://bobcat.rri.sari.ac.uk
>
> The important thing is to have TWO completely separate private network 
> fabrics: One for DHCP/NFS, the other for IPC. The main problem we have 
> is that IPC (i.e. Inter Process Communication) can swamp the bandwidth 
> of a single network fabric and you rapidly lose control of the 
> cluster.

We don't have any IPC.  We don't run any parallel code.  Each job runs 
on a single CPU.  And NFS *still* causes problems, occasionally.  It 
really isn't a myth at this scale.  It's unusable.  For example, we 
have to make separate copies of the LSF binaries on all of the 
machines, because to do it the Platform-endorsed way, with everything 
NFS mounted, is a bit flakey.  The NFS contention from LSF's house 
keeping alone can be enough to break the cluster.

I suspect if you're running large parallel jobs, then the number of NFS 
operations involved is relatively low.  The issue for us is sometimes 
hundreds of jobs completing every minute, all trying to read some data 
files and then create three or four output files on an NFS mounted 
disk.  That's a lot of separate NFS operations, a large proportion of 
which are the particularly painful directory operations.  I plead with 
the users not to write code like this, but you know what users are 
like.

> I think there are some MYTHS about NFS and clusters around because of 
> the bandwidth contention on a single network fabric. The NFS network 
> traffic on our cluster is completely segregated from the IPC traffic 
> which is throttled by the bandwidth of its own network fabric. The 
> switches on the two network fabrics are NOT connected in any way...

Our approach is actually similar to yours; we're moving towards cluster 
filesystems like GPFS and Lustre, and in those cases, we run the 
cluster filesystem traffic over a second network.  It's actually a VLAN 
on the same switches, but that's not the performance problem you might 
think because the Extreme switches we use are fully non-blocking.  You 
can throw an absolutely obscene number of packets at them and they cope 
fine.  Even when a Ganglia bug caused a machine to emit thousands of 
multicast packets to all 1000 machines every second.  The ganglia 
daemons went into 100% CPU coping with the incoming packets, which made 
the cluster almost unusable, but the network itself was still going 
strong.

Tim

-- 
Dr Tim Cutts
Informatics Systems Group, Wellcome Trust Sanger Institute
GPG: 1024D/E3134233 FE3D 6C73 BBD6 726A A3F5  860B 3CDD 3F56 E313 4233