Hi Steve: Steve Pittard wrote: > Now that we have all 63 up and running it looks like we are > getting performance issues with NFS much in the same way > that others have reported here. Even moderate job loads > produce trouble - (nfsstats -c show lots of retransmissions), Ok. You need to isolate if it is the single spindle, the file system, or the network that is maxing out. Get atop (http://www.atcomputing.nl/Tools/atop) and see if the disk is maxed out (100% utilization) or the network is maxed out. Or both. > grid engine execds don't report back in so qhost shows nodes not > responding though eventually they will return. On occasion one of > the switches stops and that whole "side" of the cluster disappears. > so we reboot the switch and are back in action. Anyway here are my > questions (thanks for your patience in reading through this) > > Has anyone had similar problems with these SMC switches ? > I'm not accustomed to having the switches die like this. We haven't had much luck when pushing SMC and other lower end switches very hard. We have been using (and specing out) HP Procurves for the last few years for smaller clusters (through about 64 nodes). It sounds like your network design is also a tree, not a fat tree, but a basic tree. This could explain the collisions (limited consumable resource). We usually spec this design for a maximum of 16 CPUs in moderate IO situations. > In terms of improving NFS performance I've already > put SGE spool onto the local nodes to try to improve things > but only helps a little. There are various NFS tuning > documents with respect to clusters ( using tcp, atime, rsize, > wsize, etc options to mount). I've experimented with a few of > these (rsize, wsize) though with only very marginal positive impact. > for those with larger clusters and similar issues have you found > a subset of these options to be more key or influential than others ? Well, until we know exactly where the bottleneck is, the best we can do is guess. So here are a few suggestions. First thing would be to divide/conquer your traffic. A single gigabit pipe can handle a maximum of 100 MB/s. If you have even marginal utilization of the remote mounted disk, say each machine uses 5% of the available bandwidth for IO, then after 20 machines, you have filled the pipe. With 63 machines, you have at least an 3x oversubscription of the pipe. I would (at minimum) get one of those quad gigabit ethernet adaptors for your head node. Either channel bond them or divide the traffic among 4 network addresses for your servers. Second, a single spindle could (at best) read about 70 MB/s for large blocks. Even marginal utilization from the compute nodes could hammer that disk. You really need multiple spindles, not with an FC adaptor, but multiple spindles with a high speed connection to the same backplane that the network cards are on. FC won't get you there. U320 could for a well designed/implemented set of spindles. SATA would be the most cost effective/best price performance route you could go. Some of the SATA file systems we have on clusters at our customer sites sustain in excess of 300 MB/s under heavy load, and much higher performance units can be constructed. Using 10GBe or IB, we could source (with some effort) about 800 MB/s sustained from a server. If you need really scalable IO, you need to look at tools like Panasas, Ibrix, etc. Be prepared for significant costs. Then again, these tools scale (unlike a fair number of other solutions such as NFS). The Panasas is especially interesting in that you can keep scaling it, and it is not that hard to use, quite easy in fact. > One scenario that has been discussed is bonding two NICs > on the v40z in conjunction with switch trunking. Does anyone > have any opinions or ideas on this ? Lastly is it even worth > it to keep messing with NFS ? And maybe go for GFS. I think you might be network pipe and spindle limited, possibly even file system limited (ext3 has a number of bottlenecks in its journaling code). If you have the opportunity to switch file systems, add network cards, and add spindles, we would recommend something like this. Joe -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman at scalableinformatics.com web : http://www.scalableinformatics.com phone: +1 734 786 8423 fax : +1 734 786 8452 cell : +1 734 612 4615