Abstract
Clusters built from commodity PCs dominate high-performance computing today, with systems containing thousands of processors now being deployed. As node counts for multi-teraflop systems grow to tens of thousands, with proposed petaflop system likely to contain hundreds of thousands of 1nodes, the assumption of fully reliable hardware and software becomes much less credible. In this paper, after presenting examples and experimental data that quantify the reliability of current systems, we describe possible approaches for effective system use. In particular, we present techniques for detecting imminent failures in the environment and that allow an application to run successfully despite such failures. We also show how intelligent and adaptive software can lead to failure resilience and efficient system usage.
Original language | English (US) |
---|---|
Pages (from-to) | 293-302 |
Number of pages | 10 |
Journal | Future Generation Computer Systems |
Volume | 22 |
Issue number | 3 |
DOIs | |
State | Published - Feb 2006 |
Keywords
- Adaptive software
- Fault-tolerance
- System reliability
ASJC Scopus subject areas
- Software
- Hardware and Architecture
- Computer Networks and Communications