[Bioclusters] Is the "OR" job dependency useful??

Malay mbasu at mail.nih.gov
Fri Jan 7 13:46:13 EST 2005

Tim Cutts wrote:
> On 6 Jan 2005, at 5:49 pm, Malay wrote:
>> Rayson Ho wrote:
>>> Gridengine currently has the "AND" operator job dependency:
>>> A,B -> C
>>> ie. we need to wait for job A and B finish before we start job C.
>>> There are discussions on the SGE dev mailing list about adding the OR
>>> job dependency:
>>> A|B -> C
>>> So job C will start as soon as job A or job B finishes.
>>> I am wondering if this is useful in bioinformatics job flows??
>> As far as bioinformatics goes I am afraid most of the bioinformatics 
>> applications are embarassingly independant :) Although such dependancy 
>> resolution issues will have it's niche application but I guess it's 
>> very limited as far as bioinformatics goes.
> I don't think that's true - when you consider something like a gene 
> annotation process, there are lots of dependencies.  Consider what goes 
> on with Ensembl; before any analyses are performed, the sequences have 
> to be dusted and RepeatMasked.  After that raw features such as blast 
> hits, ab initio gene predictors and EST alignments can be calculated.  
> Once the BLAST hits have been done, genewise alignments can be performed 
> (using the BLAST results to narrow down the areas genewise needs to 
> analyse). Only once the EST alignments, ab initio predictors and 
> genewise are complete can the code be run to combine these into a 
> coherent set of gene structures.

A pipeline of any kind by nature depends on previous process.

A -> B -> C

I don't understand what do you mean by jobs here. These rules can't be
hardcoded in scheduler, or can you?

In bioinformatics each of these steps is acutally not a job at all they
are what they called "steps". Each of these steps like A is composed of
1000,000 BLAST jobs which has no dependency on each other.

> Although each of these processes consists of thousands of independent 
> jobs, each type of analysis is dependent on the completion of the 
> previous ones.

As I said. But do you actually suggest completing a "job" pipeline
before a "step" pipleline. Do you actually carry out the analyis of a
small reginon of genome sequence and finish it to end, or finish the
blast searches for the whole genome at a time?

> As it happens, all of these dependencies are handled in the Ensembl 
> RuleManager rather than by the scheduling system.

That what I meant! The whole dependency issue is in user space, and can
be very well maintained my user software. In a software world,
unnecessary means, "thing can be managed by easier way".

> They're all AND dependencies as far as I can tell, and I've never needed 
> anything other than AND dependencies in by own pipelines, but I wouldn't 
> like to claim that OR dependencies aren't useful to someone.

You are an expert Tim. But majority of the cluster users are not like
you doing genome pipelines at all. When I can't say for all of them,
what I can say is, I never used any dependency resolution system on any
scheduler so far. I never felt needing it. All the rules I made are in
the software. But may be I am streching my own experience for others.


More information about the Bioclusters mailing list