Pull-based job scheduling (was: [Bioclusters] Best ways to ta ckle migration to dedicated cluster/Farm)

Joseph Murray bioclusters@bioinformatics.org
Sun, 28 Mar 2004 01:20:48 -0800


New to the list; my two cents.  One thing that I've ended up stitching
together is a system for managing the download, conversion, management, and
storage of biological-related data sources.  For example, we've all used
bioperl (or it's kin) to grab and parse some NCBI, SwissProt and GFF data
(perhaps even from DAS), only to realize... "To make this system
'effective', I will need to more fully incorporate all the data resources
locally."  Depends on the end-user experience, but sometimes you need to
store GenBank, LocusLink, EnsEMBL, GO, and a slew of other sources locally.
(We use Oracle in-house, so a standard MySQL to Oracle conversion would be
helpful as well!)

Here's where a dedicated system--instead of myriad scripts--would enable
more effective ends to research means.  As far as industry, Lion Bioscience
has created a tool called Prisma that attempts to address this problem, I
believe, but it's not OSS and/nor accessible to enough folks.

I know we've all stitched up systems--programmers are most oft writing the
crucial "glue" tying resources together--but perhaps a "supported system"
would help us all deal with the data wrangling.  There are many efforts to
enable information access, but many are a little too web-service based.
When you have everything from simple FASTA to ASN.1 (and all the XML in
between) to deal with, a common DRM/DBMS-related tool--sufficiently DRM/DBMS
agnostic--would go along way, IMHO.

I've seen requests concerning downloads, parses, database loads--of course
all as cron scripts--on several lists.

Thanks for any input on these issues; apologies if I'm off-topic,

AGY Therapeutics, Inc.
290 Utah Ave
South San Francisco
V: (650) 228-1146
F: (650) 228-1180

-----Original Message-----
From: bioclusters-admin@bioinformatics.org
[mailto:bioclusters-admin@bioinformatics.org] On Behalf Of Chris Dwan (CCGB)
Sent: Saturday, March 27, 2004 8:44 AM
To: 'bioclusters@bioinformatics.org'
Subject: RE: Pull-based job scheduling (was: [Bioclusters] Best ways to ta
ckle migration to dedicated cluster/Farm)

> I have also successfully implemented a RDBMS-based pull system that worked
> very well.

Having written one myself, I have respect for all the home grown workload
management tools out there.  I've seen them vary from combinations of cron
and "at", with shared files for job allocation, all the way up to the
RDBMS / thin client solutions.  Mine was an rsh based push scheduler
implemented mostly in Tcl using flatfiles for state.  It kept 5 quad
processor P-II's busy pretty much full time running BLASTs for about three

Here's a question for those who have created homemade workload managers:
Would you do it again, today?  Why or why not?  Personally, I would try
every other avenue before writing another scheduler.   Home grown systems
tend to make the developer a critical resource and a single point of
failure.  It's sort of like implementing your very own database management
system.  Maybe fun for the developer, but bad and wasteful for the
organization.  You can get all the power you want out of commercial /
and open source solutions for DRM and DBMS problems.

That said, there still are problems where home-grown is still the best way
to go.  In my opinion, one of them is stitching together compute resources
across organizational and administrative boundaries.  For which other
common tasks do people think it's still cost / time effective to build
homebrew solutions?  What homegrown software do you rely on, but dream of
replacing with someone else's supported code?

-Chris Dwan
Bioclusters maillist  -  Bioclusters@bioinformatics.org