Rebound: Scalable checkpointing for coherent shared memory

Rishi Agarwal, Pranav Garg, Josep Torrellas

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

As we move to large manycores, the hardware-based global checkpointing schemes that have been proposed for small shared-memory machines do not scale. Scalability barriers include global operations, work lost to global rollback, and inefficiencies in imbalanced or I/O-intensive loads. Scalable checkpointing requires tracking inter-thread dependences and building the checkpoint and rollback operations around dynamic groups of communicating processors. To address this problem, this paper introduces Rebound, the first hardware-based scheme for coordinated local checkpointing in multiprocessors with directory-based cache coherence. Rebound leverages the transactions of a directory protocol to track inter-thread dependences. In addition, it boosts checkpointing efficiency by: (i) delaying the writeback of data to safe memory at checkpoints, (ii) supporting operation with multiple checkpoints, and (iii) optimizing checkpointing at barrier synchronization. Finally, Rebound introduces distributed algorithms for checkpointing and rollback sets of processors. Simulations of parallel programs with up to 64 threads show that Rebound is scalable and has very low overhead. For 64 processors, its average performance overhead is only 2%, compared to 15% for global checkpointing.

Original languageEnglish (US)
Title of host publicationProceeding of the 38th Annual International Symposium on Computer Architecture, ISCA'11
Pages153-164
Number of pages12
DOIs
StatePublished - 2011
Event38th Annual International Symposium on Computer Architecture, ISCA'11 - San Jose, CA, United States
Duration: Jun 4 2011Jun 8 2011

Publication series

NameProceedings - International Symposium on Computer Architecture
ISSN (Print)1063-6897

Other

Other38th Annual International Symposium on Computer Architecture, ISCA'11
Country/TerritoryUnited States
CitySan Jose, CA
Period6/4/116/8/11

Keywords

  • Faults
  • Scalable checkpointing
  • Shared-memory multiprocessors

ASJC Scopus subject areas

  • Hardware and Architecture

Fingerprint

Dive into the research topics of 'Rebound: Scalable checkpointing for coherent shared memory'. Together they form a unique fingerprint.

Cite this