Application-driven coordination-free distributed checkpointing

Adnan Agbaria, William H. Sanders

Research output: Contribution to conferencePaper

Abstract

Distributed checkpointing is an important concept in providing fault tolerance in distributed systems. In today's applications, e.g., grid and massively parallel applications, the imposed overhead of taking a distributed checkpoint using the known approaches can often outweigh its benefits due to coordination and other overhead from the processes. This paper presents an innovative approach for distributed checkpointing. In this approach, the checkpoints are obtained using offline analysis based on the application level. During execution, no coordination is required. After presenting our approach, we prove its safety and present a performance analysis of it using stochastic models.

Original languageEnglish (US)
Pages177-186
Number of pages10
StatePublished - Jan 1 2005
Event25th IEEE International Conference on Distributed Computing Systems - Columbus, OH, United States
Duration: Jun 6 2005Jun 10 2005

Other

Other25th IEEE International Conference on Distributed Computing Systems
CountryUnited States
CityColumbus, OH
Period6/6/056/10/05

Fingerprint

Stochastic models
Fault tolerance

ASJC Scopus subject areas

  • Software
  • Hardware and Architecture
  • Computer Networks and Communications

Cite this

Agbaria, A., & Sanders, W. H. (2005). Application-driven coordination-free distributed checkpointing. 177-186. Paper presented at 25th IEEE International Conference on Distributed Computing Systems, Columbus, OH, United States.

Application-driven coordination-free distributed checkpointing. / Agbaria, Adnan; Sanders, William H.

2005. 177-186 Paper presented at 25th IEEE International Conference on Distributed Computing Systems, Columbus, OH, United States.

Research output: Contribution to conferencePaper

Agbaria, A & Sanders, WH 2005, 'Application-driven coordination-free distributed checkpointing', Paper presented at 25th IEEE International Conference on Distributed Computing Systems, Columbus, OH, United States, 6/6/05 - 6/10/05 pp. 177-186.
Agbaria A, Sanders WH. Application-driven coordination-free distributed checkpointing. 2005. Paper presented at 25th IEEE International Conference on Distributed Computing Systems, Columbus, OH, United States.
Agbaria, Adnan ; Sanders, William H. / Application-driven coordination-free distributed checkpointing. Paper presented at 25th IEEE International Conference on Distributed Computing Systems, Columbus, OH, United States.10 p.
@conference{42328615ab5148ca8c9c1f28436140c7,
title = "Application-driven coordination-free distributed checkpointing",
abstract = "Distributed checkpointing is an important concept in providing fault tolerance in distributed systems. In today's applications, e.g., grid and massively parallel applications, the imposed overhead of taking a distributed checkpoint using the known approaches can often outweigh its benefits due to coordination and other overhead from the processes. This paper presents an innovative approach for distributed checkpointing. In this approach, the checkpoints are obtained using offline analysis based on the application level. During execution, no coordination is required. After presenting our approach, we prove its safety and present a performance analysis of it using stochastic models.",
author = "Adnan Agbaria and Sanders, {William H.}",
year = "2005",
month = "1",
day = "1",
language = "English (US)",
pages = "177--186",
note = "25th IEEE International Conference on Distributed Computing Systems ; Conference date: 06-06-2005 Through 10-06-2005",

}

TY - CONF

T1 - Application-driven coordination-free distributed checkpointing

AU - Agbaria, Adnan

AU - Sanders, William H.

PY - 2005/1/1

Y1 - 2005/1/1

N2 - Distributed checkpointing is an important concept in providing fault tolerance in distributed systems. In today's applications, e.g., grid and massively parallel applications, the imposed overhead of taking a distributed checkpoint using the known approaches can often outweigh its benefits due to coordination and other overhead from the processes. This paper presents an innovative approach for distributed checkpointing. In this approach, the checkpoints are obtained using offline analysis based on the application level. During execution, no coordination is required. After presenting our approach, we prove its safety and present a performance analysis of it using stochastic models.

AB - Distributed checkpointing is an important concept in providing fault tolerance in distributed systems. In today's applications, e.g., grid and massively parallel applications, the imposed overhead of taking a distributed checkpoint using the known approaches can often outweigh its benefits due to coordination and other overhead from the processes. This paper presents an innovative approach for distributed checkpointing. In this approach, the checkpoints are obtained using offline analysis based on the application level. During execution, no coordination is required. After presenting our approach, we prove its safety and present a performance analysis of it using stochastic models.

UR - http://www.scopus.com/inward/record.url?scp=27944506939&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=27944506939&partnerID=8YFLogxK

M3 - Paper

AN - SCOPUS:27944506939

SP - 177

EP - 186

ER -