Fault tolerance in petascale/exascale systems: Current knowledge, challenges and research opportunities

Research output: Contribution to journalArticlepeer-review


The emergence of petascale systems and the promise of future exascale systems have reinvigorated the community interest in how to manage failures in such systems and ensure that large applications, lasting several hours or tens of hours, are completed successfully. Most of the existing results for several key mechanisms associated with fault tolerance in high-performance computing (HPC) platforms follow the rollback - recovery approach. Over the last decade, these mechanisms have received a lot of attention from the community with different levels of success. Unfortunately, despite their high degree of optimization, existing approaches do not fit well with the challenging evolutions of large-scale systems. There is room and even a need for new approaches. Opportunities may come from different origins: diskless checkpointing, algorithmic-based fault tolerance, proactive operation, speculative execution, software transactional memory, forward recovery, etc. The contributions of this paper are as follows: (1) we summarize and analyze the existing results concerning the failures in large-scale computers and point out the urgent need for drastic improvements or disruptive approaches for fault tolerance in these systems; (2) we sketch most of the known opportunities and analyze their associated limitations; (3) we extract and express the challenges that the HPC community will have to face for addressing the stringent issue of failures in HPC systems.

Original languageEnglish (US)
Pages (from-to)212-226
Number of pages15
JournalInternational Journal of High Performance Computing Applications
Issue number3
StatePublished - 2009


  • Challenges
  • Fault tolerance
  • Knowledge
  • Opportunities
  • Petascale/exascale

ASJC Scopus subject areas

  • Software
  • Theoretical Computer Science
  • Hardware and Architecture


Dive into the research topics of 'Fault tolerance in petascale/exascale systems: Current knowledge, challenges and research opportunities'. Together they form a unique fingerprint.

Cite this