Blue Waters system and component reliability

Brett Bode, David King, Celso L. Mendes, William T. Kramer, Saurabh Jha, Roger Ford, Justin Davis, Steven Dramstad

Research output: Contribution to journalArticlepeer-review

Abstract

The Blue Waters system, installed in 2012 at NCSA, has the largest component count of any system Cray has built. Blue Waters includes a mix of dual-socket CPU (XE) and single-socket CPU, single GPU (XK) nodes. The primary storage is provided by Cray's Sonexion/ClusterStor Luster storage system delivering 35 PB (raw) storage at 1 TB/s. The statistical failure rates over time for each component including CPU, DIMM, GPU, disk drive, power supply, blower, etc and their impact on higher level failure rates for individual nodes and the systems as a whole are presented in detail, with a particular emphasis on identifying any increases in rate that might indicate the right-side of the expected bathtub curve has been reached. Strategies employed by NCSA and Cray for minimizing the impact of component failure, such as the preemptive removal of suspect disk drives, are also presented.

Original languageEnglish (US)
Article numbere7978
JournalConcurrency and Computation: Practice and Experience
Volume36
Issue number8
DOIs
StatePublished - Apr 10 2024

Keywords

  • failure analysis
  • system management

ASJC Scopus subject areas

  • Software
  • Theoretical Computer Science
  • Computer Science Applications
  • Computer Networks and Communications
  • Computational Theory and Mathematics

Fingerprint

Dive into the research topics of 'Blue Waters system and component reliability'. Together they form a unique fingerprint.

Cite this