MACORD: Online adaptive machine learning framework for silent error detection

Omer Subasi, Sheng Di, Prasanna Balaprakash, Osman Unsal, Jesus Labarta, Adrian Cristal, Sriram Krishnamoorthy, Franck Cappello

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Future high-performance computing (HPC) systems with ever-increasing resource capacity (such as compute cores, memory and storage) may significantly increase the risks on reliability. Silent data corruptions (SDCs) or silent errors are among the major sources that corrupt HPC execution results. Unlike fail-stop errors, SDCs can be harmful and dangerous in that they cannot be detected by hardware. To remedy this, we propose an online MAchine-learning-based silent data CORruption Detection framework (abbreviated as MACORD) for detecting SDCs in HPC applications. In our study, we comprehensively investigate the prediction ability of a multitude of machine-learning algorithms and enable the detector to automatically select the best-fit algorithms at runtime to adapt to the data dynamics. Because it takes only spatial features (i.e., neighboring data values for each data point in the current time step) into the training data, our learning framework exhibits low memory overhead (less than 1%). Experiments based on real-world scientific applications/benchmarks show that our framework can elevate the detection sensitivity (i.e., recall) up to 99%. Meanwhile the false positive rate is limited to 0.1% in most cases, which is one order of magnitude improvement compared with the latest state-of-The-Art spatial technique.

Original languageEnglish (US)
Title of host publicationProceedings - 2017 IEEE International Conference on Cluster Computing, CLUSTER 2017
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages717-724
Number of pages8
ISBN (Electronic)9781538623268
DOIs
StatePublished - Sep 22 2017
Externally publishedYes
Event2017 IEEE International Conference on Cluster Computing, CLUSTER 2017 - Honolulu, United States
Duration: Sep 5 2017Sep 8 2017

Publication series

NameProceedings - IEEE International Conference on Cluster Computing, ICCC
Volume2017-September
ISSN (Print)1552-5244

Other

Other2017 IEEE International Conference on Cluster Computing, CLUSTER 2017
Country/TerritoryUnited States
CityHonolulu
Period9/5/179/8/17

ASJC Scopus subject areas

  • Software
  • Hardware and Architecture
  • Signal Processing

Fingerprint

Dive into the research topics of 'MACORD: Online adaptive machine learning framework for silent error detection'. Together they form a unique fingerprint.

Cite this