Optimization of multi-level checkpoint model for large scale HPC applications

Sheng Di, Mohamed Slim Bouguerra, Leonardo Bautista-Gomez, Franck Cappello

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

HPC community projects that future extreme scale systems will be much less stable than current Petascale systems, thus requiring sophisticated fault tolerance to guarantee the completion of large scale numerical computations. Execution failures may occur due to multiple factors with different scales, from transient uncorrectable memory errors localized in processes to massive system outages. Multi-level checkpoint/restart is a promising model that provides an elastic response to tolerate different types of failures. It stores checkpoints at different levels: e.g., local memory, remote memory, using a software RAID, local SSD, remote file system. In this paper, we respond to two open questions: 1) how to optimize the selection of checkpoint levels based on failure distributions observed in a system, 2) how to compute the optimal checkpoint intervals for each of these levels. The contribution is three-fold. (1) We build a mathematical model to fit the multi-level checkpoint/restart mechanism with large scale applications regarding various types of failures. (2) We theoretically optimize the entire execution performance for each parallel application by selecting the best checkpoint level combination and corresponding checkpoint intervals. (3) We characterize checkpoint overheads on different checkpoint levels in a real cluster environment, and evaluate our optimal solutions using both simulation with millions of cores and real environment with real-world MPI programs running on hundreds of cores. Experiments show that optimized selections of levels associated with optimal checkpoint intervals at each level outperforms other state-of-the-art solutions by 5-50 percent.

Original languageEnglish (US)
Title of host publicationProceedings - IEEE 28th International Parallel and Distributed Processing Symposium, IPDPS 2014
PublisherIEEE Computer Society
Pages1181-1190
Number of pages10
ISBN (Print)9780769552071
DOIs
StatePublished - 2014
Externally publishedYes
Event28th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2014 - Phoenix, AZ, United States
Duration: May 19 2014May 23 2014

Publication series

NameProceedings of the International Parallel and Distributed Processing Symposium, IPDPS
ISSN (Print)1530-2075
ISSN (Electronic)2332-1237

Other

Other28th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2014
Country/TerritoryUnited States
CityPhoenix, AZ
Period5/19/145/23/14

Keywords

  • Checkpoint/Restart model
  • exascale High Performance Computing
  • Resilience

ASJC Scopus subject areas

  • Computational Theory and Mathematics
  • Computer Networks and Communications
  • Hardware and Architecture
  • Software

Fingerprint

Dive into the research topics of 'Optimization of multi-level checkpoint model for large scale HPC applications'. Together they form a unique fingerprint.

Cite this