AI-Ckpt: Leveraging memory access patterns for adaptive asynchronous incremental checkpointing

Bogdan Nicolae, Franck Cappello

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

With increasing scale and complexity of supercomputing and cloud computing architectures, faults are becoming a frequent occurrence, which makes reliability a difficult challenge. Although for some applications it is enough to restart failed tasks, there is a large class of applications where tasks run for a long time or are tightly coupled, thus making a restart from scratch unfeasible. Checkpoint-Restart (CR), the main method to survive failures for such applications faces additional challenges in this context: not only does it need to minimize the performance overhead on the application due to checkpointing, but it also needs to operate with scarce resources. Given the iterative nature of the targeted applications, we launch the assumption that first-time writes to memory during asynchronous checkpointing generate the same kind of interference as they did in past iterations. Based on this assumption, we propose novel asynchronous checkpointing approach that leverages both current and past access pattern trends in order to optimize the order in which memory pages are flushed to stable storage. Large scale experiments show up to 60% improvement when compared to state-of-art checkpointing approaches, all this achievable with an extra memory requirement of less than 5% of the total application memory.

Original languageEnglish (US)
Title of host publicationHPDC 2013 - Proceedings of the 22nd ACM International Symposium on High-Performance Parallel and Distributed Computing
PublisherAssociation for Computing Machinery
Pages155-166
Number of pages12
ISBN (Print)9781450319102
DOIs
StatePublished - 2013
Event22nd ACM International Symposium on High-Performance Parallel and Distributed Computing, HPDC 2013 - New York, NY, United States
Duration: Jun 17 2013Jun 21 2013

Publication series

NameHPDC 2013 - Proceedings of the 22nd ACM International Symposium on High-Performance Parallel and Distributed Computing

Other

Other22nd ACM International Symposium on High-Performance Parallel and Distributed Computing, HPDC 2013
Country/TerritoryUnited States
CityNew York, NY
Period6/17/136/21/13

Keywords

  • access pattern adaptation
  • asynchronous checkpointing
  • checkpoint restart
  • cloud computing
  • fault tolerance
  • high performance computing
  • reliability
  • scientific computing

ASJC Scopus subject areas

  • Software

Fingerprint

Dive into the research topics of 'AI-Ckpt: Leveraging memory access patterns for adaptive asynchronous incremental checkpointing'. Together they form a unique fingerprint.

Cite this