Hi, I would like to follow up on this topic in the interest of letting you know what we tried and what conclusions we reached. First I would really like to thank everyone who provided feedback on this problem. Especially as we move from a home-grown cluster management system to SGE, it makes a huge difference to know that we can get so much expert help from people on questions. As you know, we were trying to run large numbers of short array jobs. We had a workflow system that was using files to read the return codes. This became un-scalable and had NFS issues. We brainstormed and tried several solutions, and looked over the emails we got back from various people. First I tried using an array job which concatenates the results to a set of master stdout and stderr files. This approach led to a write loss of about 5% (writes simply got lost). Next I tried using a direct logging to the database approach. At the end of each array job, a script called db_logger.pl is called which logs the return code, hostname, etc. to the database. This approach was much better and is the approach we went with. So far I am very satisfied with it and it scales to 100,000 jobs quite nicely. The db_logger approach still had a write loss rate of about 0.1%. It leaves a sqlnet log file which mentions a Fatal NI error. This is related to the SQL IBOUND_CONNECT parameter. Basically, some jobs were failing to get a connection. We discussed this with Oracle - this parameter is "indefinite" by default on Oracle 10g, but we changed it to use 5 seconds. It is not clear if this helped or not. By using our whole cluster (100 nodes) but restricting to 1 slot per node, or by using less than the whole cluster, I can sometimes get this write loss rate down to 0%. In any case, the workflow system now resorts to a serial qsub mode at the end to clean up the 0.1% remaining jobs. The workflow system clears errors on SGE automatically using qmod -cj, and the db_logger now takes care of resubmits by exiting with 99 if the job fails and has not reached its maximum allowed resubmission level. Between these different layers of protection, 10,000 or more jobs can run and be assured of having a 100% success rate. We also had a need to allow for stdout and stderr to go to a variety of specific places, which SGE currently does not provide. So our array job script now takes a command list, an seflist, and an soflist, and then heads those files with a tail -1 to get the line of the file for that SGE_TASK_ID, and then evals the command and redirects stdout and stderr to the specific files desired. If no file list is given, we still use the master stderr and stdout file, with the understanding that the database has a more perfect reflection of the results. The qsub command uses -e /dev/null -o /dev/null. For 10,000 1-second jobs this process takes about 1.5 hours. For 100,000 jobs it takes about 15 hours. As an aside, we discovered an interesting way to make these jobs run very, very fast on SGE. Normally these jobs produce essentially no load across the cluster. But by making a queue with 10 nodes and 100 slots, we were able to submit 1000 jobs at a time, thus producing a significant load on the servers. However, this method produces write errors of over 60% even using the database. So, it will not work when detailed logging is required, but might be useful for someone who just wants to run an array job without the logging. So, it seems like we have found a fairly workable solution. I want to thank all of you again for all the input and help, it was invaluable! Sincerely, Shane Brubaker Joint Genome Institute