[Bioclusters] SGE array job produces too many files and

Adam S. Moskowitz adamm at menlo.com
Mon Jan 30 18:45:18 EST 2006


I can't help with the SGE or job side of things, but I do know a bit
about NFS . . .

> Also, when our file systems have more than a few thousand files in one 
> directory things slow down tremendously, and it becomes impossible to
> even ls the directory. It also can crash our file servers. We are using NFS.

NFS is far from the world's most efficient protocol, so it tends to fail
most badly on lots of small files. Also, directory operations on *nix
filesystems tend to be a bottleneck, so you're hitting the worst of both
parts of a fileserver. Several people have already made good suggestions
for getting around it, but if you can't do that . . .

One thing that can help is to optimize the directory operations. For
example, process files one at a time, rather than read the file names
into an array before processing them. Another is to not have ls sort the
entries; some versions of ls have a switch for this, if not it's a very
short C program to write.

As for your fileserver crashing, well, my standard answer to that is to
buy a NetApp box. Even as far back as 10 years ago I had NetApp boxes
handling over 75,000 files per directory without a single crash. No,
they're not cheap, but if you can't get around storing lots of little
files, a Netapp box will help.

If you can't buy a NetApp, consider switching the filesystem on your NFS
servers. Some of them are better than others at small files and/or large
directories; I haven't kept up with which ones do best for which types
of operations, but you may be able to gain back some speed that way (at
the cost of a simple back-up and restore).

Finally, lots of little files in a single directory is often a good
indication that you should be using a database. Since it sounds like
you're going to change your program anyway, consider that when doing so.
If you don't need SQL, look at Berkeley DB -- it can be *very* fast, and
fairly simple to use.

AdamM


More information about the Bioclusters mailing list