Reliability challenges in large systems

Daniel A. Reed, Charng Da Lu, Celso L. Mendes

Research output: Contribution to journalArticlepeer-review

Abstract

Clusters built from commodity PCs dominate high-performance computing today, with systems containing thousands of processors now being deployed. As node counts for multi-teraflop systems grow to tens of thousands, with proposed petaflop system likely to contain hundreds of thousands of 1nodes, the assumption of fully reliable hardware and software becomes much less credible. In this paper, after presenting examples and experimental data that quantify the reliability of current systems, we describe possible approaches for effective system use. In particular, we present techniques for detecting imminent failures in the environment and that allow an application to run successfully despite such failures. We also show how intelligent and adaptive software can lead to failure resilience and efficient system usage.

Original languageEnglish (US)
Pages (from-to)293-302
Number of pages10
JournalFuture Generation Computer Systems
Volume22
Issue number3
DOIs
StatePublished - Feb 2006

Keywords

  • Adaptive software
  • Fault-tolerance
  • System reliability

ASJC Scopus subject areas

  • Software
  • Hardware and Architecture
  • Computer Networks and Communications

Fingerprint

Dive into the research topics of 'Reliability challenges in large systems'. Together they form a unique fingerprint.

Cite this