[Bioclusters] SGE array job produces too many files and
concatenation is slow
Shane Brubaker
brubaker2 at llnl.gov
Mon Jan 30 13:08:48 EST 2006
Hi, my name is Shane Brubaker and I work at the Joint Genome Institute.
We are facing a problem with scalability on large numbers of short jobs
involving SGE and a workflow system which we wrote.
We are running large numbers (10,000 to 100,000) jobs that are very short
(1 second). Admittedly, one second is too short
for a job and will produce a lot of overhead no matter what, but there are
times when it is difficult to change our code to
produce longer jobs, and we'd like to provide some facility to do this with
at least minimal overhead.
Also, when our file systems have more than a few thousand files in one
directory things slow down tremendously, and it becomes impossible to
even ls the directory. It also can crash our file servers. We are using NFS.
I have come up with a strategy of using an array job and having the
workflow system, which is written in perl, concatenate the
smaller task files to the end of a set of master logs and then remove the
smaller files, using system calls, as I go. This actually worked
quite well for 10,000 jobs, keeping the directory from growing and greatly
improving performance.
However, when I went to 100,000 jobs the number of files grew faster than
they could be concatenated, and the system is now slowly
going through that huge directory and trying to append the smaller files,
even though the array job is long since finished.
I am wondering if anyone has experience with this and has a recommended
solution. I am also curious if the SGE folks have any plans to
add a master log capability for array jobs. Finally, if you have any
general advice on fast ways to append files and ways to deal with large
directories,
I would really appreciate any advice.
Thanks,
Shane Brubaker
More information about the Bioclusters
mailing list