FaultSight: A fault analysis tool for HPC researchers

Einar Horn, Dakota Fulp, Jon Calhoun, Luke Olson

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

System reliability is expected to be a significant challenge for future extreme-scale systems. Poor reliability results in a higher frequency of interruptions in high-performance computer (HPC) applications due to system/application crashes or data corruption due to soft errors. In response, application level error detection and recovery schemes are devised to mitigate the impact of these interruptions. Evaluating these schemes and the reliability of an application re-quires the analysis of thousands of fault injection trials, resulting in tedious and time-consuming process. Furthermore, there is no one data analysis tool that can work with all of the fault injection frameworks currently in use. In this paper, we present FaultSight, a fault injection analysis tool capable of efficiently assisting in the analysis of HPC application reliability as well as the effectiveness of resiliency schemes. FaultSight is designed to be flexible and work with data coming from a variety of fault injection frameworks. The effectiveness of FaultSight is demonstrated by exploring the reliability of different versions of the Matrix-Matrix Multiplication kernel using two different fault injection tools. In addition, the detection and recovery schemes are highlighted for the HPCCG mini-app.

Original languageEnglish (US)
Title of host publicationProceedings of FTXS 2019
Subtitle of host publicationFault Tolerance for HPC at eXtreme Scale Workshop - Held in conjunction with SC 2019: The International Conference for High Performance Computing, Networking, Storage and Analysis
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages21-30
Number of pages10
ISBN (Electronic)9781728160139
DOIs
StatePublished - Nov 2019
Event9th IEEE/ACM Workshop on Fault Tolerance for HPC at eXtreme Scale, FTXS 2019 - Denver, United States
Duration: Nov 22 2019 → …

Publication series

NameProceedings of FTXS 2019: Fault Tolerance for HPC at eXtreme Scale Workshop - Held in conjunction with SC 2019: The International Conference for High Performance Computing, Networking, Storage and Analysis

Conference

Conference9th IEEE/ACM Workshop on Fault Tolerance for HPC at eXtreme Scale, FTXS 2019
Country/TerritoryUnited States
CityDenver
Period11/22/19 → …

Keywords

  • Fault-analysis
  • Fault-analysis-tool
  • Fault-injection
  • Fault-tolerance
  • Resiliency
  • Soft-error-analysis

ASJC Scopus subject areas

  • Hardware and Architecture
  • Safety, Risk, Reliability and Quality

Fingerprint

Dive into the research topics of 'FaultSight: A fault analysis tool for HPC researchers'. Together they form a unique fingerprint.

Cite this