Hi Carlos, If its any help, we also had similar problems with our cluster. Our solution was to train the users to include code in their scripts that would create local directories (on the compute node - in /tmp) and copy the files they needed to those directories, then do their computing locally and copy back the results. This greatly reduced the NFS load (we have a 120 compute node G5 cluster with 4 NFS servers serving up directories from an Xsan array to all the compute nodes). Otherwise we have a similar setup to yours (OS X 10.3.8 using SGE 6.0u4, our compute nodes all are dual processor with at least 2Gb RAM, and our programs are a mixture of home-grown c and fortran codes that do a lot of I/O as well as perl and awk scripts that read large text files). > I had a question to see if anyone had any knowledge of a problem we've > been encountering. It seems our Apple cluster is crashing due to NFS. > When we run large batch jobs that frequently access an NFS mount, the > system ends up accumulating 'stuck' processes. If the job is able to > finish it eventually cleans the 'stuck' processes, and all is well. > But, if the job continues to allow accumulation of these stuck > processes, if a given job runs long enough, the system slowly > deteriorates and becomes less and less responsive, eventually freezing > up and not allowing anything to function at all. > > We started the maximum number of NFS servers (20) and this improved > things, but didn't fix them. We also limited the jobs to 10 nodes (20 > processors) to theoretically allow one node to access one NFS pipeline > at any given time. I'm not sure if anyone has run into this before, or > if anyone has ideas on how to approach fixing this problem. The only > errors we're seeing otherwise are in the system log, complaining about > PasswordService not matching the clients response. > > We're still running OSX 10.3.8 and our jobs are running through SGE > 5.3. And we've got a 16 node (32 processor G5 system) with at least 2gb > RAM per node. The programs running are a mixture of text mining > algorithms in both Perl and Java. Both requiring frequent reads on > large .txt files residing on NFS shared directories. > > Thanks in advance, for any ideas or suggestions. > ***************************************** -- M. Michael Barmada, Ph.D. Associate Professor of Human Genetics Graduate School of Public Health, University of Pittsburgh ===================================================================== There are three kinds of people in this world: those that can count, and those that can't... The requirements said: Windows 2000 or better. So I got a Macintosh. To know the mighty works of God; to comprehend His wisdom and majesty and power; to appreciate, in degree, the wonderful working of His laws, surely all this must be a pleasing and acceptable mode of worship to the Most High, to whom ignorance can not be more grateful than knowledge. ~Copernicus =====================================================================