Lessons learned from the analysis of system failures at petascale: The case of blue waters

Catello Di Martino, Zbigniew Kalbarczyk, Ravishankar K. Iyer, Fabio Baccanico, Joseph Fullop, William Kramer

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

This paper provides an analysis of failures and their impact for Blue Waters, the Cray hybrid (CPU/GPU) supercomputer at the University of Illinois at Urbana-Champaign. The analysis is based on both manual failure reports and automatically generated event logs collected over 261 days. Results include i) a characterization of the root causes of single-node failures, ii) a direct assessment of the effectiveness of system-level fail over as well as memory, processor, network, GPU accelerator, and file system error resiliency, and iii) an analysis of system-wide outages. The major findings of this study are as follows. Hardware is not the main cause of system downtime. This is notwithstanding the fact that hardware-related failures are 42% of all failures. Failures caused by hardware were responsible for only 23% of the total repair time. These results are partially due to the fact that processor and memory protection mechanisms (x8 and x4 Chip kill, ECC, and parity) are able to handle a sustained rate of errors as high as 250 errors/h while providing a coverage of 99.997% out of a set of more than 1.5 million of analyzed errors. Only 28 multiple-bit errors bypassed the employed protection mechanisms. Software, on the other hand, was the largest contributor to the node repair hours (53%), despite being the cause of only 20% of the total number of failures. A total of 29 out of 39 system-wide outages involved the Lustre file system with 42% of them caused by the inadequacy of the automated fail over procedures.

Original languageEnglish (US)
Title of host publicationProceedings of the International Conference on Dependable Systems and Networks
PublisherIEEE Computer Society
Pages610-621
Number of pages12
ISBN (Electronic)9781479922338
DOIs
StatePublished - Sep 18 2014
Event44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2014 - Atlanta, United States
Duration: Jun 23 2014Jun 26 2014

Publication series

NameProceedings of the International Conference on Dependable Systems and Networks

Other

Other44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2014
Country/TerritoryUnited States
CityAtlanta
Period6/23/146/26/14

Keywords

  • Cray XE6
  • Cray XK7
  • Failure Analysis
  • Failure Reports
  • Machine Check
  • Nvidia GPU errors
  • Supercomputer

ASJC Scopus subject areas

  • Software
  • Hardware and Architecture
  • Computer Networks and Communications

Fingerprint

Dive into the research topics of 'Lessons learned from the analysis of system failures at petascale: The case of blue waters'. Together they form a unique fingerprint.

Cite this