A case for two-level distributed recovery schemes

Nitin H. Vaidya

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Most distributed and multiprocessor recovery schemes proposed in the literature are designed to tolerate arbitrary number of failures. In this paper, we demonstrate that, it is often advantageous to use "two-level" recovery schemes. A two-level recovery scheme tolerates the more probable failures with low performance overhead, while the less probable failures may be tolerated with a higher overhead. By minimizing the overhead for the more frequently occurring failure scenarios, our approach is expected to achieve lower performance overhead (on average) as compared to existing recovery schemes. To demonstrate the advantages of two-level recovery, we evaluate the performance of a recovery scheme that takes two different types of checkpoints, namely, 1-checkpoints and N-checkpoints. A single failure can be tolerated by rolling the system back to a 1-checkpoint, while multiple failure recovery is possible by rolling back to an N-checkpoint. For such a system, we demonstrate that to minimize the average overhead, it is often necessary to take both 1-checkpoints and N-checkpoints. While the conclusions of this paper are intuitive, the work on design of appropriate recovery schemes is lacking. The objective of this paper is to motivate research into recovery schemes that can provide multiple levels of fault tolerance.

Original languageEnglish (US)
Title of host publicationProceedings of the 1995 ACM SIGMETRICS Joint International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS 1995/PERFORMANCE 1995
EditorsBlaine D. Gaither
PublisherAssociation for Computing Machinery
Pages64-73
Number of pages10
ISBN (Electronic)0897916956, 9780897916950
DOIs
StatePublished - May 1 1995
Externally publishedYes
Event1995 ACM SIGMETRICS Joint International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS 1995/PERFORMANCE 1995 - Ottawa, Canada
Duration: May 15 1995May 19 1995

Publication series

NameProceedings of the 1995 ACM SIGMETRICS Joint International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS 1995/PERFORMANCE 1995

Other

Other1995 ACM SIGMETRICS Joint International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS 1995/PERFORMANCE 1995
Country/TerritoryCanada
CityOttawa
Period5/15/955/19/95

ASJC Scopus subject areas

  • Hardware and Architecture
  • Software
  • Computer Networks and Communications
  • Computational Theory and Mathematics

Fingerprint

Dive into the research topics of 'A case for two-level distributed recovery schemes'. Together they form a unique fingerprint.

Cite this