Abstract
Distributed checkpointing is an important concept in providing fault tolerance in distributed systems. In today's applications, e.g., grid and massively parallel applications, the imposed overhead of taking a distributed checkpoint using the known approaches can often outweigh its benefits due to coordination and other overhead from the processes. This paper presents an innovative approach for distributed checkpointing. In this approach, the checkpoints are obtained using offline analysis based on the application level. During execution, no coordination is required. After presenting our approach, we prove its safety and present a performance analysis of it using stochastic models.
Original language | English (US) |
---|---|
Pages | 177-186 |
Number of pages | 10 |
State | Published - 2005 |
Externally published | Yes |
Event | 25th IEEE International Conference on Distributed Computing Systems - Columbus, OH, United States Duration: Jun 6 2005 → Jun 10 2005 |
Other
Other | 25th IEEE International Conference on Distributed Computing Systems |
---|---|
Country/Territory | United States |
City | Columbus, OH |
Period | 6/6/05 → 6/10/05 |
ASJC Scopus subject areas
- Software
- Hardware and Architecture
- Computer Networks and Communications