[Bioclusters] SGE array job produces too many files and concatenation is slow

Tue Jan 31 07:18:12 EST 2006

On 30 Jan 2006, at 7:20 pm, Christopher Dwan wrote:

>>> Also, when our file systems have more than a few thousand files  
>>> in one directory things slow down tremendously, and it becomes  
>>> impossible to
>>> even ls the directory.  It also can crash our file servers.  We  
>>> are using NFS.
>
> I've seen this too.  It's a real pain.  Again, the answer is to  
> find the limits of your system and batch things up so as to work  
> around them.  When dealing with EST projects, we would make  
> directory trees such that we could get below about 1,000 files per  
> directory.  It added complexity, but yielded a working system.   
> Most filesystems just don't stand up that sort of flat topology.   
> Besides, once you conquer that, there is an endless pit of trivia  
> behind it, starting with "too many arguments on the command line"  
> for all your favorite shell tools.
>
> Watch your inode counts.

That's a really significant issue, even if you use the sort of  
directory hashing that Chris is suggesting (and it's easy code to  
write - I have compatible examples in Bourne shell script and perl  
which use the MD5 checksum of the filename to create a directory  
structure two or three levels deep which can easily handle millions  
of files - your filesystem and/or backups will break first)

Once inode counts get huge, you will find that your backups take  
forever, and may even fail completely.

You may want to consider making the jobs completely silent, and  
storing the results in a database instead.  This is just moving the  
contention somewhere else, but database systems, if designed well,  
can handle the contention better than filesystems, in my experience.

As Joe and Chris both said, I will third - block up your jobs until  
they run for a significant amount of time.  I always suggest to our  
users to aim for an individual job run time of at least 10 minutes.   
It's not hard to do; even at the most basic level, just create a  
shell script which runs the command repeatedly on 600 input files.   
It won't help the NFS contention, but it will get rid of the  
scheduler overhead to a great extent.

> One behavior that's bitten me more than a couple of times is the  
> fact that NFS servers can be overwhelmed by too many concurrent  
> writes to the same file.  So, one can either write a truly large  
> number of STDERR and STDOUT files, or else risk having jobs die on  
> file write errors.
>
> In theory, this is a bug in NFS, since it should scale  
> "indefinitely."  In reality, even a moderate size compute farm can  
> quickly overwhelm even the most robust filesystem.  (braces for  
> marketing claims to the contrary)

I've seen NFS break spectacularly with as few as 10 simultaneous  
clients.  Admittedly that was about 6 years ago, on the first BLAST  
cluster I ever built.  I learned the "NFS is bad" lesson very  
early.  :-)

>> Have a single process handle appending.  Write the append meta  
>> information into a queue (a database), and have the single process  
>> walk the database. This way you are updating a database and not  
>> dealing with file locking issues.

This was Incyte's approach with their pipeline for LifeSeq Foundation  
(RIP).  It worked very well, since it also got rid of the database  
contention.  However, it essentially meant they wrote their own job  
scheduler, talking to the database.  The LifeSeq Foundation pipeline  
was a beautiful piece of work, and I don't often say that about  
bioinformatics software.  It's a crying shame that it's locked away  
under tonnes of IP, and will probably never see the light of day again.

> The only thing I would add to this suggestion is to take a look at  
> what you're going to use to consume these results.  It's possible  
> that you could feed directly into that instead of saving  
> intermediary files.  Ensembl does this (database writes directly  
> from the compute nodes as they finish their work) and it's pretty  
> nifty.

Yup.  Database contention raises its head eventually, but not until  
you reach the 500 node mark, or so, and even then if you get the  
granularity right, it can be managed.  And finally you can use the  
Big Stick approach, and schedule jobs according to the current load  
on the database they need to access, so that it never causes the  
cluster to fill with jobs waiting for the database.

Tim

-- 
Dr Tim Cutts
Informatics Systems Group, Wellcome Trust Sanger Institute
GPG: 1024D/E3134233 FE3D 6C73 BBD6 726A A3F5  860B 3CDD 3F56 E313 4233