Checkpointing vs. migration for post-petascale supercomputers

Franck Cappello, Henri Casanovay, Yves Robertz

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

An alternative to classical fault-tolerant approaches for large-scale clusters is failure avoidance, by which the occurrence of a fault is predicted and a preventive measure is taken. We develop analytical performance models for two types of preventive measures: preventive checkpointing and preventive migration. We also develop an analytical model of the performance of a standard periodic checkpoint fault-tolerant approach. We instantiate these models for platform scenarios representative of current and future technology trends. We find that preventive migration is the better approach in the short term by orders of magnitude. However, in the longer term, both approaches have comparable merit with a marginal advantage for preventive checkpointing. We also find that standard non-prediction-based fault tolerance achieves poor scaling when compared to prediction-based failure avoidance, thereby demonstrating the importance of failure prediction capabilities. Finally, our results show that achieving good utilization in truly large-scale machines (e.g., 220 nodes) for parallel workloads will require more than the failure avoidance techniques evaluated in this work.

Original languageEnglish (US)
Title of host publicationProceedings - 39th International Conference on Parallel Processing, ICPP 2010
Pages168-177
Number of pages10
DOIs
StatePublished - 2010
Externally publishedYes
Event39th International Conference on Parallel Processing, ICPP 2010 - San Diego, CA, United States
Duration: Sep 13 2010Sep 16 2010

Publication series

NameProceedings of the International Conference on Parallel Processing
ISSN (Print)0190-3918

Conference

Conference39th International Conference on Parallel Processing, ICPP 2010
Country/TerritoryUnited States
CitySan Diego, CA
Period9/13/109/16/10

Keywords

  • Checkpointing
  • Failure prediction
  • Migration
  • Parallel jobs

ASJC Scopus subject areas

  • Software
  • General Mathematics
  • Hardware and Architecture

Fingerprint

Dive into the research topics of 'Checkpointing vs. migration for post-petascale supercomputers'. Together they form a unique fingerprint.

Cite this