TY - GEN
T1 - Detecting and correcting data corruption in stencil applications through multivariate interpolation
AU - Gomez, Leonardo Arturo Bautista
AU - Cappello, Franck
N1 - Publisher Copyright:
© 2015 IEEE.
PY - 2015/10/26
Y1 - 2015/10/26
N2 - High-performance computing is a powerful tool that allows scientists to study complex natural phenomena. Extreme-scale supercomputers promise orders of magnitude higher performance compared with that of current systems. However, power constrains in future exascale systems might limit the level of resilience of those machines. In particular, data could get corrupted silently, that is, without the hardware detecting the corruption. This situation is clearly unacceptable: simulation results must be within the error margin specified by the user. In this paper, we exploit multivariate interpolation in order to detect and correct data corruption in stencil applications. We evaluate this technique with a turbulent fluid application, and we demonstrate that the prediction error using multivariate interpolation is on the order of 0.01. Our results show that this mechanism can detect and correct most important corruptions and keep the error deviation under 1% during the entire execution while injecting one corruption per minute. In addition, we stress test the detector by injecting more than ten corruptions per minute and observe that our strategy allows the application to produce results with an error deviation under 10% in such a stressful scenario.
AB - High-performance computing is a powerful tool that allows scientists to study complex natural phenomena. Extreme-scale supercomputers promise orders of magnitude higher performance compared with that of current systems. However, power constrains in future exascale systems might limit the level of resilience of those machines. In particular, data could get corrupted silently, that is, without the hardware detecting the corruption. This situation is clearly unacceptable: simulation results must be within the error margin specified by the user. In this paper, we exploit multivariate interpolation in order to detect and correct data corruption in stencil applications. We evaluate this technique with a turbulent fluid application, and we demonstrate that the prediction error using multivariate interpolation is on the order of 0.01. Our results show that this mechanism can detect and correct most important corruptions and keep the error deviation under 1% during the entire execution while injecting one corruption per minute. In addition, we stress test the detector by injecting more than ten corruptions per minute and observe that our strategy allows the application to produce results with an error deviation under 10% in such a stressful scenario.
KW - Computational fluid dynamics
KW - Detectors
KW - Hardware
KW - Interpolation
KW - Prediction algorithms
KW - Switches
KW - Three-dimensional displays
UR - http://www.scopus.com/inward/record.url?scp=84959316878&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84959316878&partnerID=8YFLogxK
U2 - 10.1109/CLUSTER.2015.108
DO - 10.1109/CLUSTER.2015.108
M3 - Conference contribution
AN - SCOPUS:84959316878
T3 - Proceedings - IEEE International Conference on Cluster Computing, ICCC
SP - 595
EP - 602
BT - Proceedings - 2015 IEEE International Conference on Cluster Computing, CLUSTER 2015
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - IEEE International Conference on Cluster Computing, CLUSTER 2015
Y2 - 8 September 2015 through 11 September 2015
ER -