[Bioclusters] NFS / SMC switches / GFS

Mon Aug 29 00:47:41 EDT 2005

> 
> Now that we have all 63 up and running it looks like we are
> getting performance issues with NFS much in the same way
> that others have reported here. Even moderate job loads
> produce trouble - (nfsstats -c show lots of retransmissions),

Are you using NFS over TCP?  If not, you probably should.  That
introduces some reliability problems, in that NFS/TCP is no
longer stateless.  If the file server goes down, clients may
hang.  But since your file server is your head node, it's mostly
a moot point.  Lose the head node, and you lose the cluster
anyway.

> grid engine execds don't report back in so qhost shows nodes not
> responding though eventually they will return. On occasion one of
> the switches stops and that whole "side" of the cluster disappears.
> so we reboot the switch and are back in action. Anyway here are my
> questions (thanks for your patience in reading through this)
> 
> Has anyone had similar problems with these SMC switches ?
> I'm not accustomed to having the switches die like this.
> 
> In terms of improving NFS performance I've already
> put SGE spool onto the local nodes to try to improve things
> but only helps a little. There are various NFS tuning
> documents with respect to clusters ( using tcp, atime, rsize,
> wsize, etc options to mount). I've experimented with a few of
> these (rsize, wsize) though with only very marginal positive impact.
> for those with larger clusters and similar issues have you found
> a subset of these options to be more key or influential than others ?

If you use NFS/TCP, the "rsize" and "wsize" parameters are 
irrelevant.  The Linux NFS how-to suggest raising the 'sysctl' 
values of "net.core.rmem_max" and "net.core.rmem_default" higher
than their usual values of 64k.  You should also pay attention
to the number of 'nfsd' processes running on your server.  The
rule of thumb is eight per CPU.  In principle, the more clients
you have the more 'nfsd' processes you want.  But multiple server
processes contend for resources themselves, so you reach a point
of diminishing returns in starting more. 

> 
> One scenario that has been discussed is bonding two NICs
> on the v40z in conjunction with switch trunking. Does anyone
> have any opinions or ideas on this ? 

If your switch can trunk, go ahead.  I trunk together gigabit 
ethernet interfaces on a FreeBSD file server.  I've some rumours
to the effect that a four-way trunk on Linux can be slower than
a two-way, due to problems in the bonding driver.  Regard that
as just hearsay, however, because I don't have any experience
with such things on Linux.  You might consider using jumbo
frames, if your switches support that.

> Lastly is it even worth
> it to keep messing with NFS ? And maybe go for GFS.

There are a number of parallel or cluster file systems in 
addition to GFS, like PVFS2 (free), Lustre (sort of free),
GPFS (free to universities), TeraFS (commercial), and Ibrix
(commercial).  They may not work well for hosting home
directories, because they're not optimized for that sort
of I/O load.  They're also, in my experience, rather less
than stable.  We built a fifty node cluster with just GPFS,
no NFS and very little local disk.  The results were quite
disappointing.

File I/O is one of the major un-solved problems of cluster
computing.  Anybody who tells you otherwise is trying to
sell you something.

David S.

> 
> 
> 
> 
>