[Bioclusters] SGE array job produces too many files andconcatenation is slow

Mon Jan 30 16:37:25 EST 2006

Hi Shane, Hi Joe, Hi Chris

>> We are running large numbers (10,000 to 100,000) jobs that are very 
>> short (1 second).

(KE)I concur with Joe and Chris.  By taking a "processing-centric" approach,
only data and results would need to be moved through the system and you'd
only accrue a start-up and tear-down overhead time once. 

Ditto to what Joe said.  In this case scheduling overhead is outstripping
your performance by an order of magnitude or so.  You're on the right track
with batching the jobs.  Nobody ever creates 10,000 truly unique jobs.  They
create 10,000 unique inputs.

In one case, I found it useful to impose an artificial limitation of 500
tasks for any batch of work.  I then scaled my chunk size
(granularity) to fit that.  This turned out to be much simpler than playing
the never ending "explore the limits of my queuing system" game.

>> Admittedly, one second is too short for a job and will produce a lot 
>> of overhead no matter what, but there are times when it is difficult 
>> to change our code to produce longer jobs, and we'd like to provide 
>> some facility to do this with at least minimal overhead.

(KE) If the data sets are of a similar size and use the same algorithms and
resource sets, it is possible to share the build up and tear down time so
that the only overhead you accrue is one build-up, one tear-down and then
the overhead per expansion (data expansion out to nodes) timestep.  This
would significantly reduce your overhead.  If you need a variable cluster
geometry that requires a "job-centric" approach, because the data sets are
not of a similar size and don't use the same algorithms, you'll need a
communications model that supports expansion to the maximum number of nodes
in the shortest amount of time.  Do you know what your expansion ratio is
(number of nodes/time step)?  Also, would you say that your jobs are
typically "processing-centric" or "job-centric"?

Sounds like you're looking for a way to schedule and dispatch several jobs
at once as a single unit.  You would run the jobs in the same sandbox on the
node and share the build up and tear down costs?

I've never heard of any queuing system with this functionality.

>> Also, when our file systems have more than a few thousand files in 
>> one directory things slow down tremendously, and it becomes 
>> impossible to even ls the directory.  It also can crash our file 
>> servers.  We are using NFS.

I've seen this too.  It's a real pain.  Again, the answer is to find the
limits of your system and batch things up so as to work around them.  When
dealing with EST projects, we would make directory trees such that we could
get below about 1,000 files per directory.  It added complexity, but yielded
a working system.  Most filesystems just don't stand up that sort of flat
topology.  Besides, once you conquer that, there is an endless pit of trivia
behind it, starting with "too many arguments on the command line" for all
your favorite shell tools.

Watch your inode counts.

One behavior that's bitten me more than a couple of times is the fact that
NFS servers can be overwhelmed by too many concurrent writes to the same
file.  So, one can either write a truly large number of STDERR and STDOUT
files, or else risk having jobs die on file write errors.

In theory, this is a bug in NFS, since it should scale "indefinitely."  In
reality, even a moderate size compute farm can quickly overwhelm even the
most robust filesystem.  (braces for marketing claims to the contrary)

>> However, when I went to 100,000 jobs the number of files grew faster 
>> than they could be concatenated, and the system is now slowly going 
>> through that huge directory and trying to append the smaller files, 
>> even though the array job is long since finished.
>
> ... stuff like this happens.

...

> Have a single process handle appending.  Write the append meta 
> information into a queue (a database), and have the single process 
> walk the database. This way you are updating a database and not 
> dealing with file locking issues.

The only thing I would add to this suggestion is to take a look at what
you're going to use to consume these results.  It's possible that you could
feed directly into that instead of saving intermediary files.  Ensembl does
this (database writes directly from the compute nodes as they finish their
work) and it's pretty nifty.

-Chris Dwan

_______________________________________________
Bioclusters maillist  -  Bioclusters at bioinformatics.org
https://bioinformatics.org/mailman/listinfo/bioclusters