A case for two-level recovery schemes

Nitin H. Vaidya

Research output: Contribution to journalArticlepeer-review

Abstract

Long-running applications are often subject to failures. Failures can result in significant loss of computation. Therefore, it is necessary to use a failure recovery scheme to minimize performance overhead in the presence of failures. In this paper, we argue that it is often advantageous to use "two-level" recovery schemes. A two-level recovery scheme tolerates the more probable failures with low performance overhead, while the less probable failures may possibly incur a higher overhead. By minimizing overhead for the more frequently occurring failure scenarios, the two-level approach can achieve lower performance overhead (on average) as compared to existing recovery schemes. The paper describes two two-level recovery schemes. Performance analysis using a Markov chain shows that, in practice, a two-level scheme can perform better than its "one-level" counterpart. While the conclusions of this paper are intuitive, the work on design of appropriate recovery schemes is lacking. The objective of this paper is to motivate research into recovery schemes that can provide multiple levels of fault tolerance and achieve better performance than existing recovery schemes. The paper presents an analytical approach for evaluating performance of two-level schemes and shows that such schemes are hard to optimize analytically.

Original languageEnglish (US)
Pages (from-to)656-666
Number of pages11
JournalIEEE Transactions on Computers
Volume47
Issue number6
DOIs
StatePublished - Dec 1 1998

Keywords

  • Checkpointing and rollback
  • Failure recovery
  • Markov chains
  • Performance analysis
  • Recovery overhead

ASJC Scopus subject areas

  • Software
  • Theoretical Computer Science
  • Hardware and Architecture
  • Computational Theory and Mathematics

Fingerprint Dive into the research topics of 'A case for two-level recovery schemes'. Together they form a unique fingerprint.

Cite this