Ivo, I will need to refer you to an SGE expert for this. These are SGE specific questions, and I dont know it well enough to comment. Joe On Wed, 2002-05-15 at 13:34, Ivo Grosse wrote: > Hi Joe and others, > > in our case of running 30,000 Blast jobs on a 100-CPU cluster you > recommended to not write the output directly to the central file > server, but to write the output to the local node, and to collect the > output in the end in a non-random manner, in order to avoid NFS server > hickups and the like. > > I love that idea, but people from Germany have the strange habit of > always trying to think of the worst possible scenario before accepting > a new idea, so here comes a set of German questions: > > Assume one slave node (A) dies. I suppose that SGE will restart the > non-finished jobs X from node A on a new node B. > > Question 1: Is that correect? > > Assume the dead node (A) comes back to life at some point. > > Question 2: Is SGE smart enough to notice that jobs X that were started > before node A went down have been restarted on node B, and is SGE smart > enough to remove the old (and useless) output of jobs X on node A? > > Question 3: Alternatively, can SGE be told to try to restart jobs X on > node A after that node is back to life? How? > > Question 4: If the answer to Q4 is yes, can SGE restart jobs X at the > point where they stopped, or does SGE always restart jobs from the > beginning? I mean: does SGE support checkpointing? How? > > Best regards, Ivo > > _______________________________________________ > Bioclusters maillist - Bioclusters@bioinformatics.org > http://bioinformatics.org/mailman/listinfo/bioclusters -- Joe Landman, email: landman@scientificappliance.com web : http://scientificappliance.com