|
Implementation and Evaluation of a Scalable Application-level Checkpoint-Recovery Scheme for MPI Programs
Martin Schulz, Greg Bronevetsky, Rohit Fernandes, Daniel Marques,Keshav Pingali, Paul Stodghill
Supercomputing 2004
, Nov., 2004
Abstract:
The running times of many computational science applications are
much longer than the mean-time-to-failure of current
high-performance computing platforms. To run to completion,
such applications must tolerate hardware failures.
Checkpoint-and-restart (CPR) is the most commonly used
scheme for accomplishing this - the state of the computation is
saved periodically on stable storage, and when a hardware failure
is detected, the computation is restarted from the most recently
saved state. Most automatic CPR schemes in the literature can be
classified as system-level checkpointing schemes because they take
core-dump style snapshots of the computational state when all the
processes are blocked at global barriers in the program.
Unfortunately, a system that implements this style of
checkpointing is tied to a particular platform; in addition, it
cannot be used if there are no global barriers in the program.
We are exploring an alternative called
application-level, non-blocking checkpointing. In our approach,
programs are transformed by a pre-processor so that they become
self-checkpointing and self-restartable on any platform; there is
also no assumption about the existence of global barriers in the
code. In this paper, we describe our implementation of
application-level, non-blocking checkpointing. We present
experimental results on both a Windows cluster and a Compaq Alpha
cluster, which show that the overheads introduced by our approach
are small.
Download:
Back to my Full list of Publications/Talks/Conferences.
Back to my Publications Overview.
|