On 7 Mar 2006, at 8:05 pm, Shane Brubaker wrote: > Hi, Shane here from the JGI, I wanted to post back and attempt to > answer some of these questions about our "disappearing" array job > tasks. > I don't know the answer to all these, but the question about NIS > errors pops out. We have been having NIS and NFS problems quite a > bit, > so I suspect that could be why. > > Soon we will be moving our cluster onto a better network switch, > and also have increased a cache size on our LDAP server. We've been > working to improve our NFS problems too. It seems like that may > help - lately the problems seem to have gone away. I've also > implemented > a "cleanup" step in our workflow system which re-submits missing > tasks one at a time just in case. > [ snip ] > Is there a network issue that I've caused by running too much stuff > at the > same time, broken NIS/NFS? > Yes Are your Linux nodes running the Name Service Caching Daemon (nscd)? We found that running that on all of our cluster nodes quite drastically reduces the pounding the NIS servers receive. It's not without its problems though; because it's a cache, it means that the nodes will sometimes take a while to notice any NIS map updates. Is it not also possible to replicate your LDAP server, so that the load from the cluster nodes is distributed over more than one server? Regards, Tim