On 30 Jan 2006, at 7:20 pm, Christopher Dwan wrote: >>> Also, when our file systems have more than a few thousand files >>> in one directory things slow down tremendously, and it becomes >>> impossible to >>> even ls the directory. It also can crash our file servers. We >>> are using NFS. > > I've seen this too. It's a real pain. Again, the answer is to > find the limits of your system and batch things up so as to work > around them. When dealing with EST projects, we would make > directory trees such that we could get below about 1,000 files per > directory. It added complexity, but yielded a working system. > Most filesystems just don't stand up that sort of flat topology. > Besides, once you conquer that, there is an endless pit of trivia > behind it, starting with "too many arguments on the command line" > for all your favorite shell tools. > > Watch your inode counts. That's a really significant issue, even if you use the sort of directory hashing that Chris is suggesting (and it's easy code to write - I have compatible examples in Bourne shell script and perl which use the MD5 checksum of the filename to create a directory structure two or three levels deep which can easily handle millions of files - your filesystem and/or backups will break first) Once inode counts get huge, you will find that your backups take forever, and may even fail completely. You may want to consider making the jobs completely silent, and storing the results in a database instead. This is just moving the contention somewhere else, but database systems, if designed well, can handle the contention better than filesystems, in my experience. As Joe and Chris both said, I will third - block up your jobs until they run for a significant amount of time. I always suggest to our users to aim for an individual job run time of at least 10 minutes. It's not hard to do; even at the most basic level, just create a shell script which runs the command repeatedly on 600 input files. It won't help the NFS contention, but it will get rid of the scheduler overhead to a great extent. > One behavior that's bitten me more than a couple of times is the > fact that NFS servers can be overwhelmed by too many concurrent > writes to the same file. So, one can either write a truly large > number of STDERR and STDOUT files, or else risk having jobs die on > file write errors. > > In theory, this is a bug in NFS, since it should scale > "indefinitely." In reality, even a moderate size compute farm can > quickly overwhelm even the most robust filesystem. (braces for > marketing claims to the contrary) I've seen NFS break spectacularly with as few as 10 simultaneous clients. Admittedly that was about 6 years ago, on the first BLAST cluster I ever built. I learned the "NFS is bad" lesson very early. :-) >> Have a single process handle appending. Write the append meta >> information into a queue (a database), and have the single process >> walk the database. This way you are updating a database and not >> dealing with file locking issues. This was Incyte's approach with their pipeline for LifeSeq Foundation (RIP). It worked very well, since it also got rid of the database contention. However, it essentially meant they wrote their own job scheduler, talking to the database. The LifeSeq Foundation pipeline was a beautiful piece of work, and I don't often say that about bioinformatics software. It's a crying shame that it's locked away under tonnes of IP, and will probably never see the light of day again. > The only thing I would add to this suggestion is to take a look at > what you're going to use to consume these results. It's possible > that you could feed directly into that instead of saving > intermediary files. Ensembl does this (database writes directly > from the compute nodes as they finish their work) and it's pretty > nifty. Yup. Database contention raises its head eventually, but not until you reach the 500 node mark, or so, and even then if you get the granularity right, it can be managed. And finally you can use the Big Stick approach, and schedule jobs according to the current load on the database they need to access, so that it never causes the cluster to fill with jobs waiting for the database. Tim -- Dr Tim Cutts Informatics Systems Group, Wellcome Trust Sanger Institute GPG: 1024D/E3134233 FE3D 6C73 BBD6 726A A3F5 860B 3CDD 3F56 E313 4233