[Bioclusters] OpenPBS problems

Donald Becker bioclusters@bioinformatics.org
Tue, 2 Dec 2003 20:06:35 -0500 (EST)


On Tue, 2 Dec 2003, Ron Chen wrote:

> For those who are interested in use checkpointing, the
> place to start is to link your applications against a
> checkpointing library:

That doesn't address the challenge in the previous message:
checkpointing a pipeline.  The Scyld cluster system has built-in
process checkpointing (the process migration and remote fork is
implemented by checkpointing down a socket and restarting on the remote
machine) and a single cluster-wide process space with remote signal
forwarding.  But even with that core functionality, doing a checkpoint
of an arbitrary process pipeline can't be done for the general case.

> http://www.checkpointing.org/
> 
> For SGE, follow the steps here:
> 
> http://gridengine.sunsource.net/project/gridengine/howto/condorckpt.html

Condor implements checkpointing by using a special library that records
calls.  To over simplify: when it sees foofd = open("/foo"), it
remembers the path name "/foo".  While this frequently works, it can be
easily misled.  Anonymous scratch files (open() then unlink()) and
ioctl() calls are two obvious examples.

Back to the core point: to checkpoint a pipeline the in-pipe data has to
   be throttled and drained, or
   extracted and stored
This goes beyond checkpointing a single process.  And a pipeline
spanning machines is even more interesting.


-- 
Donald Becker				becker@scyld.com
Scyld Computing Corporation		http://www.scyld.com
914 Bay Ridge Road, Suite 220		Scyld Beowulf cluster system
Annapolis MD 21403			410-990-9993