TY - GEN
T1 - MACORD
T2 - 2017 IEEE International Conference on Cluster Computing, CLUSTER 2017
AU - Subasi, Omer
AU - Di, Sheng
AU - Balaprakash, Prasanna
AU - Unsal, Osman
AU - Labarta, Jesus
AU - Cristal, Adrian
AU - Krishnamoorthy, Sriram
AU - Cappello, Franck
N1 - Funding Information:
ACKNOWLEDGMENTS We thank Leonardo Bautista-Gomez for his helpful feedback and discussions. This material is based upon work supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research under Award Number 66905, program manager Lucy Now-ell. Pacific Northwest National Laboratory is operated by Battelle for DOE under Contract DE-AC05-76RL01830.
Publisher Copyright:
© 2017 IEEE.
PY - 2017/9/22
Y1 - 2017/9/22
N2 - Future high-performance computing (HPC) systems with ever-increasing resource capacity (such as compute cores, memory and storage) may significantly increase the risks on reliability. Silent data corruptions (SDCs) or silent errors are among the major sources that corrupt HPC execution results. Unlike fail-stop errors, SDCs can be harmful and dangerous in that they cannot be detected by hardware. To remedy this, we propose an online MAchine-learning-based silent data CORruption Detection framework (abbreviated as MACORD) for detecting SDCs in HPC applications. In our study, we comprehensively investigate the prediction ability of a multitude of machine-learning algorithms and enable the detector to automatically select the best-fit algorithms at runtime to adapt to the data dynamics. Because it takes only spatial features (i.e., neighboring data values for each data point in the current time step) into the training data, our learning framework exhibits low memory overhead (less than 1%). Experiments based on real-world scientific applications/benchmarks show that our framework can elevate the detection sensitivity (i.e., recall) up to 99%. Meanwhile the false positive rate is limited to 0.1% in most cases, which is one order of magnitude improvement compared with the latest state-of-The-Art spatial technique.
AB - Future high-performance computing (HPC) systems with ever-increasing resource capacity (such as compute cores, memory and storage) may significantly increase the risks on reliability. Silent data corruptions (SDCs) or silent errors are among the major sources that corrupt HPC execution results. Unlike fail-stop errors, SDCs can be harmful and dangerous in that they cannot be detected by hardware. To remedy this, we propose an online MAchine-learning-based silent data CORruption Detection framework (abbreviated as MACORD) for detecting SDCs in HPC applications. In our study, we comprehensively investigate the prediction ability of a multitude of machine-learning algorithms and enable the detector to automatically select the best-fit algorithms at runtime to adapt to the data dynamics. Because it takes only spatial features (i.e., neighboring data values for each data point in the current time step) into the training data, our learning framework exhibits low memory overhead (less than 1%). Experiments based on real-world scientific applications/benchmarks show that our framework can elevate the detection sensitivity (i.e., recall) up to 99%. Meanwhile the false positive rate is limited to 0.1% in most cases, which is one order of magnitude improvement compared with the latest state-of-The-Art spatial technique.
UR - http://www.scopus.com/inward/record.url?scp=85032615470&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85032615470&partnerID=8YFLogxK
U2 - 10.1109/CLUSTER.2017.128
DO - 10.1109/CLUSTER.2017.128
M3 - Conference contribution
AN - SCOPUS:85032615470
T3 - Proceedings - IEEE International Conference on Cluster Computing, ICCC
SP - 717
EP - 724
BT - Proceedings - 2017 IEEE International Conference on Cluster Computing, CLUSTER 2017
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 5 September 2017 through 8 September 2017
ER -