[Bioclusters] Daemonizing blast, ie running many sequences through 1 process

Tim Cutts bioclusters@bioinformatics.org
Fri, 7 Nov 2003 15:31:47 +0000


On 07-Nov-03, Chris Dwan (CCGB) wrote:
> It may be that my experience with Solaris is out of date, or that I failed
> to properly parameterize it, but I remember there being a limit on the
> volume of data that CacheFS would accept (the cache size, as it were).
> That limit was well below the size of any of the larger target sets we
> deal with, so using cachefs as a solution to data staging led to
> thrashing, particularly when we started splitting up the targets to better
> parallelize our searches.
> 
> I'm curious to know if this is still the case.
> 
> Of course, a truly brilliant resource scheduler would take into account
> the contents of the file cache when deciding where to run a particular
> job...

Quite.  CacheFS seems a bit pointless; the OS usually caches disk access
anyway.  I have to say that we've always gone with the distribute the
data set to all the machines anyway; NFS, or relying on caching at all,
only helps if the users are arranging their work in such a way that
takes advantage of caching, and that's not the case in my experience.
They tend to do this:

foreach (@sequence) {
    foreach (@dbs) {
        blastall ...
    }
}

which totally wrecks caching, rather than:

foreach (@dbs) {
    foreach (@sequence) {
        blastall ...
    }
}

which we all know runs much more efficiently (especially on sites with
blastable databases on shared storage).

Tim

-- 
Dr Tim Cutts
Informatics Systems Group
Wellcome Trust Sanger Institute, Hinxton, Cambridge, CB10 1SA, UK