Grid Engine 6.0 - 6.0u1 has a bit of a catch-22 functionality hole right now when it comes to building fault-tolerant SGE implementations: o You can use shared NFS with the time-tested shadow master feature but you have to use "classic" spooling instead of berkeleydb spooling which is slower and less scalable. Honestly though, for many people classic spooling is not going to make much of a throughput or performance difference. o You can get around the "can't write berkeleydb files to shared NFS mount" problem by running the berkeley RPC spooling server. In this mode, spooling is done over the network to a remote RPC server -- this allows shadow masters to pick up the pieces after the qmaster falls over. You get the "fast/new" spooling technology with shadow master functionality, but... The trouble with RPC server spooling (besides it being characterized as incredibly insecure) is that you can have only 1 RPC server currently. This effectively makes the use of shadow masters quite silly as you'll still have a single point of failure (the RPC server is now your critical failure point). Rayson mentioned one possible workaround -- use NFSv4 and berkeleydb spooling. Have not tested this yet myself. Another approach that we have tested and seen work is to use a shared SAN volume between SGE master hosts. Our testbed for this was a 100+ node Apple G5 Xserve cluster in which the 4x "head nodes" shared a SGE 6.0u1 spool volume via Apple's XSAN software. Failover worked fine when we knocked over head nodes. Our setup used a beta release of the XSAN product so I would not call this 100% rock solid, production-ready yet. This functionality hole is just a byproduct of the new adoption of berkeleydb under the hood, I'm guessing NFSv4 and hopefully some way to run multiple RPC servers in the future will make this a non-issue. -chris Rayson Ho wrote: > It's rather a limitation of NFS -- but if you are using NFSv4 for the > spool directory, you can use shadow master with Berkely spooling. > > Rayson > > --- Steve <slitster@rcn.com> wrote: > >>Not positive on this, but, I think 6.1 may allow the use of a shadow >>master in combination with the Berkely database functionality. >> >> >>Steve > > https://bioinformatics.org/mailman/listinfo/bioclusters