[Bioclusters] OS X and NFS

M. Michael Barmada barmada at pitt.edu
Wed Jul 13 14:01:49 EDT 2005


Hi Carlos,

If its any help, we also had similar problems with our cluster. Our solution
was to train the users to include code in their scripts that would create
local directories (on the compute node - in /tmp) and copy the files they
needed to those directories, then do their computing locally and copy back
the results. This greatly reduced the NFS load (we have a 120 compute node
G5 cluster with 4 NFS servers serving up directories from an Xsan array to
all the compute nodes). Otherwise we have a similar setup to yours (OS X
10.3.8 using SGE 6.0u4, our compute nodes all are dual processor with at
least 2Gb RAM, and our programs are a mixture of home-grown c and fortran
codes that do a lot of I/O as well as perl and awk scripts that read large
text files).

> I had a question to see if anyone had any knowledge of a problem we've
> been encountering.  It seems our Apple cluster is crashing due to NFS.
> When we run large batch jobs that frequently access an NFS mount, the
> system ends up accumulating  'stuck' processes.  If the job is able to
> finish it eventually cleans the 'stuck' processes, and all is well.
> But, if the job continues to allow accumulation of these stuck
> processes, if a given job runs long enough, the system slowly
> deteriorates and becomes less and less responsive, eventually freezing
> up and not allowing anything to function at all.
> 
> We started the maximum number of NFS servers (20) and this improved
> things, but didn't fix them.  We also limited the jobs to 10 nodes (20
> processors) to theoretically allow one node to access one NFS pipeline
> at any given time.  I'm not sure if anyone has run into this before, or
> if anyone has ideas on how to approach fixing this problem.  The only
> errors we're seeing otherwise are in the system log, complaining about
> PasswordService not matching the clients response.
> 
> We're still running OSX 10.3.8 and our jobs are running through SGE
> 5.3.  And we've got a 16 node (32 processor G5 system) with at least 2gb
> RAM per node.   The programs running are a mixture of text mining
> algorithms in both Perl and Java.  Both requiring frequent reads on
> large .txt files residing on NFS shared directories.
> 
> Thanks in advance, for any ideas or suggestions.
> *****************************************
 
-- 
M. Michael Barmada, Ph.D.
Associate Professor of Human Genetics
Graduate School of Public Health, University of Pittsburgh

=====================================================================
            There are three kinds of people in this world:
             those that can count, and those that can't...

            The requirements said: Windows 2000 or better.
                        So I got a Macintosh.

    To know the mighty works of God; to comprehend His wisdom and
  majesty and power; to appreciate, in degree, the wonderful working
 of His laws, surely all this must be a pleasing and acceptable mode
   of worship to the Most High, to whom ignorance can not be more
                      grateful than knowledge.
                           ~Copernicus
===================================================================== 




More information about the Bioclusters mailing list