On Tue, 2 Dec 2003, Ron Chen wrote: > For those who are interested in use checkpointing, the > place to start is to link your applications against a > checkpointing library: That doesn't address the challenge in the previous message: checkpointing a pipeline. The Scyld cluster system has built-in process checkpointing (the process migration and remote fork is implemented by checkpointing down a socket and restarting on the remote machine) and a single cluster-wide process space with remote signal forwarding. But even with that core functionality, doing a checkpoint of an arbitrary process pipeline can't be done for the general case. > http://www.checkpointing.org/ > > For SGE, follow the steps here: > > http://gridengine.sunsource.net/project/gridengine/howto/condorckpt.html Condor implements checkpointing by using a special library that records calls. To over simplify: when it sees foofd = open("/foo"), it remembers the path name "/foo". While this frequently works, it can be easily misled. Anonymous scratch files (open() then unlink()) and ioctl() calls are two obvious examples. Back to the core point: to checkpoint a pipeline the in-pipe data has to be throttled and drained, or extracted and stored This goes beyond checkpointing a single process. And a pipeline spanning machines is even more interesting. -- Donald Becker becker@scyld.com Scyld Computing Corporation http://www.scyld.com 914 Bay Ridge Road, Suite 220 Scyld Beowulf cluster system Annapolis MD 21403 410-990-9993