> > I had a question to see if anyone had any knowledge of a problem we've > been encountering. It seems our Apple cluster is crashing due to NFS. > When we run large batch jobs that frequently access an NFS mount, the > system ends up accumulating 'stuck' processes. If the job is able to > finish it eventually cleans the 'stuck' processes, and all is well. > But, if the job continues to allow accumulation of these stuck > processes, if a given job runs long enough, the system slowly > deteriorates and becomes less and less responsive, eventually freezing > up and not allowing anything to function at all. Though I have little experience with OS X, I'd guess that you're reaching the limits of NFS performance on your system. There are some things you can try to improve matters: - Copy the data files to the executions nodes' local disks, and read them there rather than from NFS. - Get more NFS servers - the machines themselves, not just the server processes running on them - so that you decrease the I/O load on each. - Add enough RAM to the execution nodes to cache the data file in the file system buffer, so that you read from memory rather than NFS. - Get a specialized storage appliance that can handle a higher concurrent I/O load than your NFS server. I don't have any experience with these, or know which ones will work with OS X, so I can't make any recommendations. - If there are any sort of parallel or cluster file systems available for OS X, try one. My experience with these on Linux with heavy I/O processing - BLAST and megaBLAST - isn't encouraging. Concurrent I/O is a serious weakness of cluster systems. I'm not aware of any magic solution to the problem. David S. > We started the maximum number of NFS servers (20) and this improved > things, but didn't fix them. We also limited the jobs to 10 nodes (20 > processors) to theoretically allow one node to access one NFS pipeline > at any given time. I'm not sure if anyone has run into this before, or > if anyone has ideas on how to approach fixing this problem. The only > errors we're seeing otherwise are in the system log, complaining about > PasswordService not matching the clients response. > > We're still running OSX 10.3.8 and our jobs are running through SGE > 5.3. And we've got a 16 node (32 processor G5 system) with at least 2gb > RAM per node. The programs running are a mixture of text mining > algorithms in both Perl and Java. Both requiring frequent reads on > large .txt files residing on NFS shared directories. > > Thanks in advance, for any ideas or suggestions. > > Juan Perin > Children's Hospital of Philadelphia > _______________________________________________ > Bioclusters maillist - Bioclusters at bioinformatics.org > https://bioinformatics.org/mailman/listinfo/bioclusters