Abstract

This paper presents LogDiver, a tool for the analysis of application-level resiliency in extreme-scale computing systems. The tool has been implemented to handle data generated by system monitoring tools in Blue Waters, the petascale machine in production at the University of Illinois' National Center for Supercomputing Applications. The tool is able: i) to filter, extract, and classify error data from different sources of information, such as system logs, hardware sensors and workload logs; ii) to extract signals from the categorized errors; iii) to consolidate user application data and decode application and job exit status, highlighting the reasons for the application/job exit; and iv) to correlate application failures with errors using a mix of empirical and analytical techniques. To the best of our knowledge, this is the first tool capable of measuring application-level resiliency in extreme-scale machines. We also demonstrate the power of the tool by showing that XK applications are more vulnerable to failures when compared to XE applications.

Original languageEnglish (US)
Title of host publicationFTXS 2015 - Proceedings of the 2015 Workshop on Fault Tolerance for HPC at eXtreme Scale, Part of HPDC 2015
PublisherAssociation for Computing Machinery
Pages11-18
Number of pages8
ISBN (Electronic)9781450335690
DOIs
StatePublished - Jun 15 2015
Event5th Workshop on Fault Tolerance for HPC at eXtreme Scale, FTXS 2015 - Portland, United States
Duration: Jun 15 2015 → …

Publication series

NameFTXS 2015 - Proceedings of the 2015 Workshop on Fault Tolerance for HPC at eXtreme Scale, Part of HPDC 2015

Other

Other5th Workshop on Fault Tolerance for HPC at eXtreme Scale, FTXS 2015
Country/TerritoryUnited States
CityPortland
Period6/15/15 → …

Keywords

  • B.8.1 [Performance and Reliability]: Reliability
  • Fault-Tolerance - HPC applications; Log Analysis
  • Testing

ASJC Scopus subject areas

  • Computer Science Applications

Fingerprint

Dive into the research topics of 'LogDiver: A tool for measuring resilience of extreme-scale systems and applications'. Together they form a unique fingerprint.

Cite this