Which of these PVFS2 Lustre GPFS have some level of redundancy ? =============================================== David Coornaert (dcoorna at dbm.ulb.ac.be) Belgian Embnet Node (http://www.be.embnet.org) Universite' Libre de Bruxelles Laboratoire de Bioinformatique 12, Rue des Professeurs Jeener & Brachet 6041 Gosselies BELGIQUE Te'l: +3226509975 Fax: +3226509998 =============================================== DGS wrote: >>Now that we have all 63 up and running it looks like we are >>getting performance issues with NFS much in the same way >>that others have reported here. Even moderate job loads >>produce trouble - (nfsstats -c show lots of retransmissions), >> >> > >Are you using NFS over TCP? If not, you probably should. That >introduces some reliability problems, in that NFS/TCP is no >longer stateless. If the file server goes down, clients may >hang. But since your file server is your head node, it's mostly >a moot point. Lose the head node, and you lose the cluster >anyway. > > > >>grid engine execds don't report back in so qhost shows nodes not >>responding though eventually they will return. On occasion one of >>the switches stops and that whole "side" of the cluster disappears. >>so we reboot the switch and are back in action. Anyway here are my >>questions (thanks for your patience in reading through this) >> >>Has anyone had similar problems with these SMC switches ? >>I'm not accustomed to having the switches die like this. >> >>In terms of improving NFS performance I've already >>put SGE spool onto the local nodes to try to improve things >>but only helps a little. There are various NFS tuning >>documents with respect to clusters ( using tcp, atime, rsize, >>wsize, etc options to mount). I've experimented with a few of >>these (rsize, wsize) though with only very marginal positive impact. >>for those with larger clusters and similar issues have you found >>a subset of these options to be more key or influential than others ? >> >> > >If you use NFS/TCP, the "rsize" and "wsize" parameters are >irrelevant. The Linux NFS how-to suggest raising the 'sysctl' >values of "net.core.rmem_max" and "net.core.rmem_default" higher >than their usual values of 64k. You should also pay attention >to the number of 'nfsd' processes running on your server. The >rule of thumb is eight per CPU. In principle, the more clients >you have the more 'nfsd' processes you want. But multiple server >processes contend for resources themselves, so you reach a point >of diminishing returns in starting more. > > > >>One scenario that has been discussed is bonding two NICs >>on the v40z in conjunction with switch trunking. Does anyone >>have any opinions or ideas on this ? >> >> > > >If your switch can trunk, go ahead. I trunk together gigabit >ethernet interfaces on a FreeBSD file server. I've some rumours >to the effect that a four-way trunk on Linux can be slower than >a two-way, due to problems in the bonding driver. Regard that >as just hearsay, however, because I don't have any experience >with such things on Linux. You might consider using jumbo >frames, if your switches support that. > > > >>Lastly is it even worth >>it to keep messing with NFS ? And maybe go for GFS. >> >> > >There are a number of parallel or cluster file systems in >addition to GFS, like PVFS2 (free), Lustre (sort of free), >GPFS (free to universities), TeraFS (commercial), and Ibrix >(commercial). They may not work well for hosting home >directories, because they're not optimized for that sort >of I/O load. They're also, in my experience, rather less >than stable. We built a fifty node cluster with just GPFS, >no NFS and very little local disk. The results were quite >disappointing. > >File I/O is one of the major un-solved problems of cluster >computing. Anybody who tells you otherwise is trying to >sell you something. > >David S. > > > >> >> >> >> >> >_______________________________________________ >Bioclusters maillist - Bioclusters at bioinformatics.org >https://bioinformatics.org/mailman/listinfo/bioclusters > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://bioinformatics.org/pipermail/bioclusters/attachments/20050830/9d1f66d8/attachment.html