|
Application-level Checkpointing for Shared Memory Programs
Greg Bronevetsky, Daniel Marques, Martin Schulz, Peter Szwed, Keshav Pingali
Eleventh International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 2004)
, Oct., 2004
Abstract:
Trends in high-performance computing are making it necessary for
long-running applications to tolerate hardware faults. The most
commonly used approach is checkpoint and restart (CPR) - the state
of the computation is saved periodically on disk, and when a
failure occurs, the computation is restarted from the last saved
state. At present, it is the responsibility of the programmer to
instrument applications for CPR.
Our group is investigating the use of compiler technology to
instrument codes to make them self-checkpointing and
self-restarting, thereby providing an automatic solution to the
problem of making long-running scientific applications resilient
to hardware faults. Our previous work focused on message-passing
programs.
In this paper, we describe such a system for shared-memory
programs running on symmetric multiprocessors. This system has two
components: (i) a pre-compiler for source-to-source modification
of applications, and (ii) a runtime system that implements a
protocol for coordinating CPR among the threads of the parallel
application. For the sake of concreteness, we focus on a
non-trivial subset of OpenMP that includes barriers and locks.
One of the advantages of this approach is that the ability to
tolerate faults becomes embedded within the application itself, so
applications become self-checkpointing and self-restarting on any
platform. We demonstrate this by showing that our transformed
benchmarks can checkpoint and restart on three different platforms
(Windows/x86, Linux/x86, and Tru64/Alpha). Our experiments show
that the overhead introduced by this approach is usually quite
small; they also suggest ways in which the current implementation
can be tuned to reduced overheads further.
Download:
Back to my Full list of Publications/Talks/Conferences.
Back to my Publications Overview.
|