TY - JOUR
T1 - Exploring the capabilities of support vector machines in detecting silent data corruptions
AU - Subasi, Omer
AU - Di, Sheng
AU - Bautista-Gomez, Leonardo
AU - Balaprakash, Prasanna
AU - Unsal, Osman
AU - Labarta, Jesus
AU - Cristal, Adrian
AU - Krishnamoorthy, Sriram
AU - Cappello, Franck
N1 - This material is based upon work supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research under Award Number 66905, program manager Lucy Nowell. Pacific Northwest National Laboratory is operated by Battelle for DOE under Contract DE-AC05-76RL01830. In addition, this material is based upon work supported by the National Science Foundation under Grant No. 1619253 , and also by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, program manager Lucy Nowell, under contract number DE-AC02-06CH11357 (DOE Catalog project) and in part by the European Union FEDER funds under contract TIN2015-65316-P.
This material is based upon work supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research under Award Number 66905, program manager Lucy Nowell. Pacific Northwest National Laboratory is operated by Battelle for DOE under Contract DE-AC05-76RL01830. In addition, this material is based upon work supported by the National Science Foundation under Grant No. 1619253, and also by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, program manager Lucy Nowell, under contract number DE-AC02-06CH11357 (DOE Catalog project) and in part by the European Union FEDER funds under contract TIN2015-65316-P.
PY - 2018/9
Y1 - 2018/9
N2 - As the exascale era approaches, the increasing capacity of high-performance computing (HPC) systems with targeted power and energy budget goals introduces significant challenges in reliability. Silent data corruptions (SDCs), or silent errors, are one of the major sources that corrupt the execution results of HPC applications without being detected. In this work, we explore a set of novel SDC detectors – by leveraging epsilon-insensitive support vector machine regression – to detect SDCs that occur in HPC applications. The key contributions are threefold. (1) Our exploration takes temporal, spatial, and spatiotemporal features into account and analyzes different detectors based on different features. (2) We provide an in-depth study on the detection ability and performance with different parameters, and we optimize the detection range carefully. (3) Experiments with eight real-world HPC applications show that support-vector-machine-based detectors can achieve detection sensitivity (i.e., recall) up to 99% yet suffer a less than 1% false positive rate for most cases. Our detectors incur low performance overhead, 5% on average, for all benchmarks studied in this work.
AB - As the exascale era approaches, the increasing capacity of high-performance computing (HPC) systems with targeted power and energy budget goals introduces significant challenges in reliability. Silent data corruptions (SDCs), or silent errors, are one of the major sources that corrupt the execution results of HPC applications without being detected. In this work, we explore a set of novel SDC detectors – by leveraging epsilon-insensitive support vector machine regression – to detect SDCs that occur in HPC applications. The key contributions are threefold. (1) Our exploration takes temporal, spatial, and spatiotemporal features into account and analyzes different detectors based on different features. (2) We provide an in-depth study on the detection ability and performance with different parameters, and we optimize the detection range carefully. (3) Experiments with eight real-world HPC applications show that support-vector-machine-based detectors can achieve detection sensitivity (i.e., recall) up to 99% yet suffer a less than 1% false positive rate for most cases. Our detectors incur low performance overhead, 5% on average, for all benchmarks studied in this work.
KW - HPC applications
KW - Silent data corruptions
KW - Support vector machines
UR - https://www.scopus.com/pages/publications/85041657607
UR - https://www.scopus.com/pages/publications/85041657607#tab=citedBy
U2 - 10.1016/j.suscom.2018.01.004
DO - 10.1016/j.suscom.2018.01.004
M3 - Article
AN - SCOPUS:85041657607
SN - 2210-5379
VL - 19
SP - 277
EP - 290
JO - Sustainable Computing: Informatics and Systems
JF - Sustainable Computing: Informatics and Systems
ER -