[Bioclusters] OS X and NFS

Wed Jul 13 14:18:56 EDT 2005

> 
> I had a question to see if anyone had any knowledge of a problem we've 
> been encountering.  It seems our Apple cluster is crashing due to NFS.  
> When we run large batch jobs that frequently access an NFS mount, the 
> system ends up accumulating  'stuck' processes.  If the job is able to 
> finish it eventually cleans the 'stuck' processes, and all is well.  
> But, if the job continues to allow accumulation of these stuck 
> processes, if a given job runs long enough, the system slowly 
> deteriorates and becomes less and less responsive, eventually freezing 
> up and not allowing anything to function at all.

Though I have little experience with OS X, I'd guess that you're
reaching the limits of NFS performance on your system.  There are
some things you can try to improve matters:

   - Copy the data files to the executions nodes' local disks,
     and read them there rather than from NFS.

   - Get more NFS servers - the machines themselves, not just the 
     server processes running on them - so that you decrease the I/O 
     load on each.

   - Add enough RAM to the execution nodes to cache the data file
     in the file system buffer, so that you read from memory rather 
     than NFS. 

   - Get a specialized storage appliance that can handle a higher
     concurrent I/O load than your NFS server.  I don't have any
     experience with these, or know which ones will work with OS X,
     so I can't make any recommendations.

   - If there are any sort of parallel or cluster file systems
     available for OS X, try one.  My experience with these on Linux
     with heavy I/O processing - BLAST and megaBLAST - isn't 
     encouraging.

Concurrent I/O is a serious weakness of cluster systems.  I'm not
aware of any magic solution to the problem.

David S.

> We started the maximum number of NFS servers (20) and this improved 
> things, but didn't fix them.  We also limited the jobs to 10 nodes (20 
> processors) to theoretically allow one node to access one NFS pipeline 
> at any given time.  I'm not sure if anyone has run into this before, or 
> if anyone has ideas on how to approach fixing this problem.  The only 
> errors we're seeing otherwise are in the system log, complaining about 
> PasswordService not matching the clients response.
> 
> We're still running OSX 10.3.8 and our jobs are running through SGE 
> 5.3.  And we've got a 16 node (32 processor G5 system) with at least 2gb 
> RAM per node.   The programs running are a mixture of text mining 
> algorithms in both Perl and Java.  Both requiring frequent reads on 
> large .txt files residing on NFS shared directories.
> 
> Thanks in advance, for any ideas or suggestions.
> 
> Juan Perin
> Children's Hospital of Philadelphia
> _______________________________________________
> Bioclusters maillist  -  Bioclusters at bioinformatics.org
> https://bioinformatics.org/mailman/listinfo/bioclusters