Optimization of a Multilevel Checkpoint Model with Uncertain Execution Scales

Sheng Di, Leonardo Bautista-Gomez, Franck Cappello

Research output: Contribution to journalConference articlepeer-review

Abstract

Future extreme-scale systems are expected to experience different types of failures affecting applications with different failure scales, from transient uncorrectable memory errors in processes to massive system outages. In this paper, we propose a multilevel checkpoint model by taking into account uncertain execution scales (different numbers of processes/cores). The contribution is threefold: (1) we provide an in-depth analysis on why it is difficult to derive the optimal checkpoint intervals for different checkpoint levels and optimize the number of cores simultaneously, (2) we devise a novel method that can quickly obtain an optimized solution - the first successful attempt in multilevel checkpoint models with uncertain scales, and (3) we perform both large scale real experiments and extreme-scale numerical simulation to validate the effectiveness of our design. The experiments confirm that our optimized solution outperforms other state of-the-art solutions by 4.3 - 88% on wall-clock length.

Original languageEnglish (US)
Article number7013061
Pages (from-to)907-918
Number of pages12
JournalInternational Conference for High Performance Computing, Networking, Storage and Analysis, SC
Volume2015-January
Issue numberJanuary
DOIs
StatePublished - Jan 16 2014
EventInternational Conference for High Performance Computing, Networking, Storage and Analysis, SC 2014 - New Orleans, United States
Duration: Nov 16 2014Nov 21 2014

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Computer Science Applications
  • Hardware and Architecture
  • Software

Fingerprint

Dive into the research topics of 'Optimization of a Multilevel Checkpoint Model with Uncertain Execution Scales'. Together they form a unique fingerprint.

Cite this