[Bioclusters] NCBI database download and format code
Chris Dagdigian
bioclusters@bioinformatics.org
Fri, 02 May 2003 09:29:35 -0400
Jeremy Mann wrote:
> Then how would you tell blastall which nodes have which *piece* of the
> database?
>
Depends. If all your nodes have all the pieces then you just submit
multiple blastall searches to your cluster, each search specifying only
the database segment you want to query against. Easy. The harder part is
getting the multiple responses back and merging them into something
sensible.
If your nodes do not have all the fragments on hand then you don't tell
blastall. You tell your cluster load management system (PBS, GridEngine,
LSF) etc. to run your searches on a specific machine, queue or
consumable/static resource. There are lots of ways to do this -- you can
manually tell GridEngine or LSF to run job X on host Y or you can make
this a bit more abstract by making your cluster job scheduler aware of
which nodes have which pieces. This can be done by configuring LSF or
GridEngine with custom static or dynamic resource attributes. Once that
is done you can tell LSF for instance to "run this blast job on any
machine in my cluster that has the attribute NCBI-GENBANK-PART-1 set to
'true' " etc. etc.
Back in my Blackstone Computing days we had a cool solution to this
called smartcache. We basically added "data aware" scheduling
capabilities to LSF or GridEngine. The end result was that the scheduler
"knew" where the database pieces were and could allocate jobs
accordingly to the proper machine or queue
.
-Chris
--
Chris Dagdigian, <dag@sonsorol.org>
BioTeam Inc. - Independent Bio-IT & Informatics consulting
Office: 617-666-6454, Mobile: 617-877-5498, Fax: 425-699-0193
PGP KeyID: 83D4310E Yahoo IM: craffi Web: http://bioteam.net