Spatial Support Vector Regression to Detect Silent Errors in the Exascale Era

Omer Subasi, Sheng Di, Leonardo Bautista-Gomez, Prasanna Balaprakash, Osman Unsal, Jesus Labarta, Adrian Cristal, Franck Cappello

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

As the exascale era approaches, the increasing capacity of high-performance computing (HPC) systems with targeted power and energy budget goals introduces significant challenges in reliability. Silent data corruptions (SDCs) or silent errors are one of the major sources that corrupt the executionresults of HPC applications without being detected. In this work, we explore a low-memory-overhead SDC detector, by leveraging epsilon-insensitive support vector machine regression, to detect SDCs that occur in HPC applications that can be characterized by an impact error bound. The key contributions are three fold. (1) Our design takes spatialfeatures (i.e., neighbouring data values for each data point in a snapshot) into training data, such that little memory overhead (less than 1%) is introduced. (2) We provide an in-depth study on the detection ability and performance with different parameters, and we optimize the detection range carefully. (3) Experiments with eight real-world HPC applications show thatour detector can achieve the detection sensitivity (i.e., recall) up to 99% yet suffer a less than 1% of false positive rate for most cases. Our detector incurs low performance overhead, 5% on average, for all benchmarks studied in the paper. Compared with other state-of-the-art techniques, our detector exhibits the best tradeoff considering the detection ability and overheads.

Original languageEnglish (US)
Title of host publicationProceedings - 2016 16th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing, CCGrid 2016
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages413-424
Number of pages12
ISBN (Electronic)9781509024520
DOIs
StatePublished - Jul 18 2016
Externally publishedYes
Event16th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing, CCGrid 2016 - Cartagena, Colombia
Duration: May 16 2016May 19 2016

Publication series

NameProceedings - 2016 16th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing, CCGrid 2016

Conference

Conference16th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing, CCGrid 2016
Country/TerritoryColombia
CityCartagena
Period5/16/165/19/16

Keywords

  • Exascale and HPC Systems
  • Fault-tolerance
  • Silent Data Corruptions
  • Silent Errors
  • Support Vector Machines

ASJC Scopus subject areas

  • Computer Networks and Communications

Fingerprint

Dive into the research topics of 'Spatial Support Vector Regression to Detect Silent Errors in the Exascale Era'. Together they form a unique fingerprint.

Cite this