Hi Shane, Hi Joe, Hi Chris >> We are running large numbers (10,000 to 100,000) jobs that are very >> short (1 second). (KE)I concur with Joe and Chris. By taking a "processing-centric" approach, only data and results would need to be moved through the system and you'd only accrue a start-up and tear-down overhead time once. Ditto to what Joe said. In this case scheduling overhead is outstripping your performance by an order of magnitude or so. You're on the right track with batching the jobs. Nobody ever creates 10,000 truly unique jobs. They create 10,000 unique inputs. In one case, I found it useful to impose an artificial limitation of 500 tasks for any batch of work. I then scaled my chunk size (granularity) to fit that. This turned out to be much simpler than playing the never ending "explore the limits of my queuing system" game. >> Admittedly, one second is too short for a job and will produce a lot >> of overhead no matter what, but there are times when it is difficult >> to change our code to produce longer jobs, and we'd like to provide >> some facility to do this with at least minimal overhead. (KE) If the data sets are of a similar size and use the same algorithms and resource sets, it is possible to share the build up and tear down time so that the only overhead you accrue is one build-up, one tear-down and then the overhead per expansion (data expansion out to nodes) timestep. This would significantly reduce your overhead. If you need a variable cluster geometry that requires a "job-centric" approach, because the data sets are not of a similar size and don't use the same algorithms, you'll need a communications model that supports expansion to the maximum number of nodes in the shortest amount of time. Do you know what your expansion ratio is (number of nodes/time step)? Also, would you say that your jobs are typically "processing-centric" or "job-centric"? Sounds like you're looking for a way to schedule and dispatch several jobs at once as a single unit. You would run the jobs in the same sandbox on the node and share the build up and tear down costs? I've never heard of any queuing system with this functionality. >> Also, when our file systems have more than a few thousand files in >> one directory things slow down tremendously, and it becomes >> impossible to even ls the directory. It also can crash our file >> servers. We are using NFS. I've seen this too. It's a real pain. Again, the answer is to find the limits of your system and batch things up so as to work around them. When dealing with EST projects, we would make directory trees such that we could get below about 1,000 files per directory. It added complexity, but yielded a working system. Most filesystems just don't stand up that sort of flat topology. Besides, once you conquer that, there is an endless pit of trivia behind it, starting with "too many arguments on the command line" for all your favorite shell tools. Watch your inode counts. One behavior that's bitten me more than a couple of times is the fact that NFS servers can be overwhelmed by too many concurrent writes to the same file. So, one can either write a truly large number of STDERR and STDOUT files, or else risk having jobs die on file write errors. In theory, this is a bug in NFS, since it should scale "indefinitely." In reality, even a moderate size compute farm can quickly overwhelm even the most robust filesystem. (braces for marketing claims to the contrary) >> However, when I went to 100,000 jobs the number of files grew faster >> than they could be concatenated, and the system is now slowly going >> through that huge directory and trying to append the smaller files, >> even though the array job is long since finished. > > ... stuff like this happens. ... > Have a single process handle appending. Write the append meta > information into a queue (a database), and have the single process > walk the database. This way you are updating a database and not > dealing with file locking issues. The only thing I would add to this suggestion is to take a look at what you're going to use to consume these results. It's possible that you could feed directly into that instead of saving intermediary files. Ensembl does this (database writes directly from the compute nodes as they finish their work) and it's pretty nifty. -Chris Dwan _______________________________________________ Bioclusters maillist - Bioclusters at bioinformatics.org https://bioinformatics.org/mailman/listinfo/bioclusters