[Bioclusters] SGE array job produces too many files and concatenation is slow

Mon Feb 13 14:48:08 EST 2006

Hi, I would like to follow up on this topic in the interest of letting you 
know what we tried and what conclusions we reached.

First I would really like to thank everyone who provided feedback on this 
problem.  Especially as we move from a home-grown
cluster management system to SGE, it makes a huge difference to know that 
we can get so much expert help from people on
questions.

As you know, we were trying to run large numbers of short array jobs.  We 
had a workflow system that was using files
to read the return codes.  This became un-scalable and had NFS issues.

We brainstormed and tried several solutions, and looked over the emails we 
got back from various people.

First I tried using an array job which concatenates the results to a set of 
master stdout and stderr files.
This approach led to a write loss of about 5% (writes simply got lost).

Next I tried using a direct logging to the database approach.  At the end 
of each array job, a script called
db_logger.pl is called which logs the return code, hostname, etc. to the 
database.

This approach was much better and is the approach we went with.  So far I 
am very satisfied with it and it
scales to 100,000 jobs quite nicely.

The db_logger approach still had a write loss rate of about 0.1%.  It 
leaves a sqlnet log file which mentions
a Fatal NI error.  This is related to the SQL IBOUND_CONNECT 
parameter.  Basically, some jobs were failing
to get a connection.  We discussed this with Oracle - this parameter is 
"indefinite" by default on Oracle 10g,
but we changed it to use 5 seconds.  It is not clear if this helped or 
not.  By using our whole cluster (100 nodes) but
restricting to 1 slot per node, or by using less than the whole cluster, I 
can sometimes get this write loss rate down to 0%.

In any case, the workflow system now resorts to a serial qsub mode at the 
end to clean up the 0.1% remaining jobs.
The workflow system clears errors on SGE automatically using qmod -cj, and 
the db_logger now takes care of resubmits
by exiting with 99 if the job fails and has not reached its maximum allowed 
resubmission level.  Between these different
layers of protection, 10,000 or more jobs can run and be assured of having 
a 100% success rate.

We also had a need to allow for stdout and stderr to go to a variety of 
specific places, which SGE currently does not provide.
So our array job script now takes a command list, an seflist, and an 
soflist, and then heads those files with a tail -1 to get
the line of the file for that SGE_TASK_ID, and then evals the command and 
redirects stdout and stderr to the specific files
desired.  If no file list is given, we still use the master stderr and 
stdout file, with the understanding that the database has
a more perfect reflection of the results.  The qsub command uses -e 
/dev/null -o /dev/null.

For 10,000 1-second jobs this process takes about 1.5 hours.  For 100,000 
jobs it takes about 15 hours.

As an aside, we discovered an interesting way to make these jobs run very, 
very fast on SGE.  Normally these jobs
produce essentially no load across the cluster.  But by making a queue with 
10 nodes and 100 slots, we were able
to submit 1000 jobs at a time, thus producing a significant load on the 
servers.  However, this method produces
write errors of over 60% even using the database.  So, it will not work 
when detailed logging is required, but might be useful
for someone who just wants to run an array job without the logging.

So, it seems like we have found a fairly workable solution.  I want to 
thank all of you again for all the input and help,
it was invaluable!

Sincerely,

Shane Brubaker
Joint Genome Institute