[Bioclusters] NFS / SMC switches / GFS

Sun Aug 28 23:42:40 EDT 2005

Hi Steve:

Steve Pittard wrote:

> Now that we have all 63 up and running it looks like we are
> getting performance issues with NFS much in the same way
> that others have reported here. Even moderate job loads
> produce trouble - (nfsstats -c show lots of retransmissions),

Ok.  You need to isolate if it is the single spindle, the file system, 
or the network that is maxing out.

Get atop (http://www.atcomputing.nl/Tools/atop) and see if the disk is 
maxed out (100% utilization) or the network is maxed out.  Or both.

> grid engine execds don't report back in so qhost shows nodes not
> responding though eventually they will return. On occasion one of
> the switches stops and that whole "side" of the cluster disappears.
> so we reboot the switch and are back in action. Anyway here are my
> questions (thanks for your patience in reading through this)
> 
> Has anyone had similar problems with these SMC switches ?
> I'm not accustomed to having the switches die like this.

We haven't had much luck when pushing SMC and other lower end switches 
very hard.  We have been using (and specing out) HP Procurves for the 
last few years for smaller clusters (through about 64 nodes).

It sounds like your network design is also a tree, not a fat tree, but a 
basic tree.  This could explain the collisions (limited consumable 
resource).  We usually spec this design for a maximum of 16 CPUs in 
moderate IO situations.

> In terms of improving NFS performance I've already
> put SGE spool onto the local nodes to try to improve things
> but only helps a little. There are various NFS tuning
> documents with respect to clusters ( using tcp, atime, rsize,
> wsize, etc options to mount). I've experimented with a few of
> these (rsize, wsize) though with only very marginal positive impact.
> for those with larger clusters and similar issues have you found
> a subset of these options to be more key or influential than others ?

Well, until we know exactly where the bottleneck is, the best we can do 
is guess.  So here are a few suggestions.

First thing would be to divide/conquer your traffic.  A single gigabit 
pipe can handle a maximum of 100 MB/s.  If you have even marginal 
utilization of the remote mounted disk, say each machine uses 5% of the 
available bandwidth for IO, then after 20 machines, you have filled the 
pipe.  With 63 machines, you have at least an 3x oversubscription of the 
pipe.  I would (at minimum) get one of those quad gigabit ethernet 
adaptors for your head node.  Either channel bond them or divide the 
traffic among 4 network addresses for your servers.

Second, a single spindle could (at best) read about 70 MB/s for large 
blocks.  Even marginal utilization from the compute nodes could hammer 
that disk.  You really need multiple spindles, not with an FC adaptor, 
but multiple spindles with a high speed connection to the same backplane 
that the network cards are on.  FC won't get you there.  U320 could for 
a well designed/implemented set of spindles.  SATA would be the most 
cost effective/best price performance route you could go.  Some of the 
SATA file systems we have on clusters at our customer sites sustain in 
excess of 300 MB/s under heavy load, and much higher performance units 
can be constructed.  Using 10GBe or IB, we could source (with some 
effort) about 800 MB/s sustained from a server.  If you need really 
scalable IO, you need to look at tools like Panasas, Ibrix, etc.   Be 
prepared for significant costs.  Then again, these tools scale (unlike a 
fair number of other solutions such as NFS).  The Panasas is especially 
interesting in that you can keep scaling it, and it is not that hard to 
use, quite easy in fact.

> One scenario that has been discussed is bonding two NICs
> on the v40z in conjunction with switch trunking. Does anyone
> have any opinions or ideas on this ? Lastly is it even worth
> it to keep messing with NFS ? And maybe go for GFS.

I think you might be network pipe and spindle limited, possibly even 
file system limited (ext3 has a number of bottlenecks in its journaling 
code).  If you have the opportunity to switch file systems, add network 
cards, and add spindles, we would recommend something like this.

Joe

-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web  : http://www.scalableinformatics.com
phone: +1 734 786 8423
fax  : +1 734 786 8452
cell : +1 734 612 4615