[Bioclusters] SGE array job produces too many files and concatenation is slow

Mon Jan 30 13:31:45 EST 2006

Hi Shane

On Mon, 30 Jan 2006, Shane Brubaker wrote:

> Hi, my name is Shane Brubaker and I work at the Joint Genome Institute.
>
> We are facing a problem with scalability on large numbers of short jobs 
> involving SGE and a workflow system which we wrote.
>
> We are running large numbers (10,000 to 100,000) jobs that are very short (1 
> second).  Admittedly, one second is too short
> for a job and will produce a lot of overhead no matter what, but there are 
> times when it is difficult to change our code to
> produce longer jobs, and we'd like to provide some facility to do this with 
> at least minimal overhead.

Hmmm... for something like this, you will likely have 10-20 seconds of 
overhead for a single job.  This is not very efficient, and it will only 
get worse as the number of jobs increase.  There are ways around this ...

>
> Also, when our file systems have more than a few thousand files in one 
> directory things slow down tremendously, and it becomes impossible to
> even ls the directory.  It also can crash our file servers.  We are using 
> NFS.

This could be a significant problem.  What is the file system underlying 
the NFS server?  I am going to guess that you are experiencing lots of 
locking issues among other things.  How many nodes are in your cluster?

The fastest file system access will (with very rare exceptions) be local 
access.  The best file systems use B-Trees to make accessing 
files/directories fast.

> I have come up with a strategy of using an array job and having the workflow 
> system, which is written in perl, concatenate the

An array job is still multiple jobs.  What you want to do is to figure out 
how you can launch a job process on a node, and have your runs stream 
through this so you don't get startup/teardown costs per job.

Also, how many nodes are running at once?  This will be critical to the 
NFS performance.  What is the network/processor/memory config on the NFS 
server?

> smaller task files to the end of a set of master logs and then remove the 
> smaller files, using system calls, as I go.  This actually worked
> quite well for 10,000 jobs, keeping the directory from growing and greatly 
> improving performance.

Hmmm... you might want to queue up the removal, and make this process 
asynchronous and non-blocking, otherwise ...

>
> However, when I went to 100,000 jobs the number of files grew faster than 
> they could be concatenated, and the system is now slowly
> going through that huge directory and trying to append the smaller files, 
> even though the array job is long since finished.

... stuff like this happens.

>
> I am wondering if anyone has experience with this and has a recommended 
> solution.

Yes.  I ran into this ~6 years ago with SGI GenomeCluster and then later 
with MSC.Life.  I had to make the concatenation and the temp file removal 
asynchronous (I had a queue that was processed by one or more queue 
daemons).

>  I am also curious if the SGE folks have any plans to
> add a master log capability for array jobs.  Finally, if you have any general 
> advice on fast ways to append files and ways to deal with large directories,
> I would really appreciate any advice.

Have a single process handle appending.  Write the append meta information 
into a queue (a database), and have the single process walk the database. 
This way you are updating a database and not dealing with file locking 
issues.

Joe

>
>
> Thanks,
> Shane Brubaker
>
> _______________________________________________
> Bioclusters maillist  -  Bioclusters at bioinformatics.org
> https://bioinformatics.org/mailman/listinfo/bioclusters
>