Abstract

This chapter presents a case study on how to characterize the resiliency of large-scale computers. The analysis focuses on the failures and errors of Blue Waters, the Cray hybrid (CPU/GPU) supercomputer at the University of Illinois at Urbana-Champaign. The characterization is performed by a joint analysis of several data sources, which include workload and error/failure logs as well as manual failure reports.We describe LogDiver, a tool to automate the data preprocessing and metric computation that measure the impact of system errors and failures on user applications, i.e., the compiled programs launched by user jobs that can execute across one or more XE (CPU) or XK (CPU+GPU) nodes. Results include (i) a characterization of the root causes of single node failures; (ii) a direct assessment of the effectiveness of system-level failover and of memory, processor, network, GPU accelerator, and file system error resiliency; (iii) an analysis of system-wide outages; (iv) analysis of application resiliency to system-related errors; and (v) insight into the relationship between application scale and resiliency across different error categories.

Original languageEnglish (US)
Title of host publicationSpringer Series in Reliability Engineering
PublisherSpringer
Pages609-655
Number of pages47
DOIs
StatePublished - 2016

Publication series

NameSpringer Series in Reliability Engineering
ISSN (Print)1614-7839
ISSN (Electronic)2196-999X

ASJC Scopus subject areas

  • Safety, Risk, Reliability and Quality

Fingerprint

Dive into the research topics of 'Measuring the Resiliency of Extreme-Scale Computing Environments'. Together they form a unique fingerprint.

Cite this