Abstract
The emergence of petascale systems and the promise of future exascale systems have reinvigorated the community interest in how to manage failures in such systems and ensure that large applications, lasting several hours or tens of hours, are completed successfully. Most of the existing results for several key mechanisms associated with fault tolerance in high-performance computing (HPC) platforms follow the rollback - recovery approach. Over the last decade, these mechanisms have received a lot of attention from the community with different levels of success. Unfortunately, despite their high degree of optimization, existing approaches do not fit well with the challenging evolutions of large-scale systems. There is room and even a need for new approaches. Opportunities may come from different origins: diskless checkpointing, algorithmic-based fault tolerance, proactive operation, speculative execution, software transactional memory, forward recovery, etc. The contributions of this paper are as follows: (1) we summarize and analyze the existing results concerning the failures in large-scale computers and point out the urgent need for drastic improvements or disruptive approaches for fault tolerance in these systems; (2) we sketch most of the known opportunities and analyze their associated limitations; (3) we extract and express the challenges that the HPC community will have to face for addressing the stringent issue of failures in HPC systems.
Original language | English (US) |
---|---|
Pages (from-to) | 212-226 |
Number of pages | 15 |
Journal | International Journal of High Performance Computing Applications |
Volume | 23 |
Issue number | 3 |
DOIs | |
State | Published - 2009 |
Keywords
- Challenges
- Fault tolerance
- Knowledge
- Opportunities
- Petascale/exascale
ASJC Scopus subject areas
- Software
- Theoretical Computer Science
- Hardware and Architecture