FTI: High performance fault tolerance interface for hybrid systems

Leonardo Bautista-Gomez, Dimitri Komatitsch, Naoya Maruyama, Seiji Tsuboi, Franck Cappello, Satoshi Matsuoka

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Large scientific applications deployed on current petascale systems expend a significant amount of their execution time dumping checkpoint files to remote storage. New fault tolerant techniques will be critical to efficiently exploit post-petascale systems. In this work, we propose a low-overhead high-frequency multi-level checkpoint technique in which we integrate a highly-reliable topology-aware Reed-Solomon encoding in a three-level checkpoint scheme. We efficiently hide the encoding time using one Fault-Tolerance dedicated thread per node. We implement our technique in the Fault Tolerance Interface FTI. We evaluate the correctness of our performance model and conduct a study of the reliability of our library. To demonstrate the performance of FTI, we present a case study of the Mw9.0 Tohoku Japan earthquake simulation with SPECFEM3D on TSUBAME2.0. We demonstrate a checkpoint overhead as low as 8% on sustained 0.1 petaflops runs (1152 GPUs) while checkpointing at high frequency.

Original languageEnglish (US)
Title of host publicationProceedings of 2011 SC - International Conference for High Performance Computing, Networking, Storage and Analysis
DOIs
StatePublished - 2011
Event2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC11 - Seattle, WA, United States
Duration: Nov 12 2011Nov 18 2011

Publication series

NameProceedings of 2011 SC - International Conference for High Performance Computing, Networking, Storage and Analysis

Other

Other2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC11
Country/TerritoryUnited States
CitySeattle, WA
Period11/12/1111/18/11

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Computer Science Applications

Fingerprint

Dive into the research topics of 'FTI: High performance fault tolerance interface for hybrid systems'. Together they form a unique fingerprint.

Cite this