[Bioclusters] OS X and NFS

Fri Jul 15 09:23:09 EDT 2005

> Respectfully, and at the risk of sounding, ridiculously naive --  
> why not
> consider upgrading the I/O switching technology to Myrinet or  
> Infiniband for
> higher-bandwidth and ultra-low latency, before buying more servers?

A good question, but it addresses the wrong layer in this particular  
conversation.

The network filesystem rests on top of the network fabric.  The  
problems in this case are things like  the number concurrent accesses  
allowed through a single server, and thrashing at the disk level.   
Even given a perfect network (zero latency, infinite bandwidth) we  
would still need to have a conversation about the scalability of the  
file server.  In fact, we would need to have that conversation a lot  
sooner.

Other respondents have hit it on the head:  There are two basic  
approaches:  Modify the algorithm or beef up the server.  Algorithm  
mods include copying or staging files to local space first, operating  
on disks local to the nodes as much as possible, batching I/O rather  
than the classic "open FILEHANDLE; do_everything; close FILEHANDLE"  
approach.  For beefing up the fileserver, we've had good results with  
Apple's XSAN, price perfomance wise.  It's common knowledge that an  
enterprise scale file server can easily cost more than the  cluster  
it's supposed to serve.

-Chris Dwan
  The BioTeam

> Quoting Juan Carlos Perin <bic at genome.chop.edu>:
>
>
>>
>> I had a question to see if anyone had any knowledge of a problem  
>> we've
>> been encountering.  It seems our Apple cluster is crashing due to  
>> NFS.
>> When we run large batch jobs that frequently access an NFS mount, the
>> system ends up accumulating  'stuck' processes.  If the job is  
>> able to
>> finish it eventually cleans the 'stuck' processes, and all is well.
>> But, if the job continues to allow accumulation of these stuck
>> processes, if a given job runs long enough, the system slowly
>> deteriorates and becomes less and less responsive, eventually  
>> freezing
>> up and not allowing anything to function at all.
>>
>> We started the maximum number of NFS servers (20) and this improved
>> things, but didn't fix them.  We also limited the jobs to 10 nodes  
>> (20
>> processors) to theoretically allow one node to access one NFS  
>> pipeline
>> at any given time.  I'm not sure if anyone has run into this  
>> before, or
>> if anyone has ideas on how to approach fixing this problem.  The only
>> errors we're seeing otherwise are in the system log, complaining  
>> about
>> PasswordService not matching the clients response.
>>
>> We're still running OSX 10.3.8 and our jobs are running through SGE
>> 5.3.  And we've got a 16 node (32 processor G5 system) with at  
>> least 2gb
>> RAM per node.   The programs running are a mixture of text mining
>> algorithms in both Perl and Java.  Both requiring frequent reads on
>> large .txt files residing on NFS shared directories.
>>
>> Thanks in advance, for any ideas or suggestions.
>>
>> Juan Perin
>> Children's Hospital of Philadelphia
>> _______________________________________________
>> Bioclusters maillist  -  Bioclusters at bioinformatics.org
>> https://bioinformatics.org/mailman/listinfo/bioclusters