TY - GEN
T1 - Lightweight silent data corruption detection based on runtime data analysis for hpc applications
AU - Berrocal, Eduardo
AU - Bautista-Gomez, Leonardo
AU - Di, Sheng
AU - Lan, Zhiling
AU - Cappello, Franck
N1 - Publisher Copyright:
© 2015 ACM.
PY - 2015/6/15
Y1 - 2015/6/15
N2 - Next-generation supercomputers are expected to have more components and, at the same time, consume several times less energy per operation. Consequently, the number of soft errors is expected to increase dramatically in the coming years. In this respect, techniques that leverage certain properties of iterative HPC applications (such as the smoothness of the evolution of a particular dataset) can be used to detect silent errors at the application level. In this paper, we present a pointwise detection model with two phases: one involving the prediction of the next expected value in the time series for each data point, and another determining a range (i.e., normal value interval) surrounding the predicted next-step value. We show that dataset correlation can be used to detect corruptions indirectly and limit the size of the data set to monitor, taking advantage of the underlying physics of the simulation. Our results show that, using our techniques, we can detect a large number of corruptions (i.e., above 90% in some cases) with 84% memory overhead, and 13.75% extra computation time.
AB - Next-generation supercomputers are expected to have more components and, at the same time, consume several times less energy per operation. Consequently, the number of soft errors is expected to increase dramatically in the coming years. In this respect, techniques that leverage certain properties of iterative HPC applications (such as the smoothness of the evolution of a particular dataset) can be used to detect silent errors at the application level. In this paper, we present a pointwise detection model with two phases: one involving the prediction of the next expected value in the time series for each data point, and another determining a range (i.e., normal value interval) surrounding the predicted next-step value. We show that dataset correlation can be used to detect corruptions indirectly and limit the size of the data set to monitor, taking advantage of the underlying physics of the simulation. Our results show that, using our techniques, we can detect a large number of corruptions (i.e., above 90% in some cases) with 84% memory overhead, and 13.75% extra computation time.
KW - Fault tolerance
KW - High-performance computing
KW - Resilience
KW - Silent data corruption
KW - Soft errors
KW - Time series
UR - http://www.scopus.com/inward/record.url?scp=84987719472&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84987719472&partnerID=8YFLogxK
U2 - 10.1145/2749246.2749253
DO - 10.1145/2749246.2749253
M3 - Conference contribution
AN - SCOPUS:84987719472
T3 - HPDC 2015 - Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing
SP - 275
EP - 278
BT - HPDC 2015 - Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing
PB - Association for Computing Machinery
T2 - 24th ACM Symposium on High-Performance Parallel and Distributed Computing, HPDC 2015
Y2 - 15 June 2015 through 19 June 2015
ER -