[Bioclusters] OpenPBS problems
Donald Becker
bioclusters@bioinformatics.org
Tue, 2 Dec 2003 20:06:35 -0500 (EST)
On Tue, 2 Dec 2003, Ron Chen wrote:
> For those who are interested in use checkpointing, the
> place to start is to link your applications against a
> checkpointing library:
That doesn't address the challenge in the previous message:
checkpointing a pipeline. The Scyld cluster system has built-in
process checkpointing (the process migration and remote fork is
implemented by checkpointing down a socket and restarting on the remote
machine) and a single cluster-wide process space with remote signal
forwarding. But even with that core functionality, doing a checkpoint
of an arbitrary process pipeline can't be done for the general case.
> http://www.checkpointing.org/
>
> For SGE, follow the steps here:
>
> http://gridengine.sunsource.net/project/gridengine/howto/condorckpt.html
Condor implements checkpointing by using a special library that records
calls. To over simplify: when it sees foofd = open("/foo"), it
remembers the path name "/foo". While this frequently works, it can be
easily misled. Anonymous scratch files (open() then unlink()) and
ioctl() calls are two obvious examples.
Back to the core point: to checkpoint a pipeline the in-pipe data has to
be throttled and drained, or
extracted and stored
This goes beyond checkpointing a single process. And a pipeline
spanning machines is even more interesting.
--
Donald Becker becker@scyld.com
Scyld Computing Corporation http://www.scyld.com
914 Bay Ridge Road, Suite 220 Scyld Beowulf cluster system
Annapolis MD 21403 410-990-9993