TY - GEN
T1 - Detection of Silent Data Corruption in Adaptive Numerical Integration Solvers
AU - Guhur, Pierre Louis
AU - Constantinescu, Emil
AU - Ghosh, Debojyoti
AU - Peterka, Tom
AU - Cappello, Franck
N1 - Funding Information:
This material is based upon work supported in part by the National Science Foundation under Grant No. 1619253, and in part by the US Department of Energy Office of Sciences under contract DE-AC02-06CH11357.Part of this work also was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344.
Publisher Copyright:
© 2017 IEEE.
PY - 2017/9/22
Y1 - 2017/9/22
N2 - Scientific computing requires trust in results. In high-performance computing, trust is impeded by silent data corruption (SDC), in other words corruption that remains unnoticed. Numerical integration solvers are especially sensitive to SDCs because an SDC introduced in a certain step affects all the following steps. SDCs can even cause the solver to become unstable. Adaptive solvers can change the step size, by comparing an estimation of the approximation error with an user-defined tolerance. If the estimation exceeds the tolerance, the step is rejected and recomputed. Adaptive solvers have an inherent resilience, because some SDCs might have no consequences on the accuracy of the results, and some SDCs might push the approximation error beyond the tolerance. Our first contribution shows that the rejection mechanism is not reliable enough to reject all SDCs that affect the results' accuracy, because the estimation is also corrupted. We therefore provide another protection mechanism: At the end of each step, a second error estimation is employed to increase the redundancy. Because of the complex dynamics, the choice of the second estimate is difficult: Two methods are explored. We evaluated them in HyPar and PETSc, on a cluster of 4,096 cores. We injected SDCs that are large enough to affect the trust or the convergence of the solvers. The new approach can detect 99% of the SDCs, reducing by more than 10 times the number of undetected SDCs. Compared with replication, a classic SDC detector, our protection mechanism reduces the memory overhead by more than 2 times and the computational overhead by more than 20 times in our experiments.
AB - Scientific computing requires trust in results. In high-performance computing, trust is impeded by silent data corruption (SDC), in other words corruption that remains unnoticed. Numerical integration solvers are especially sensitive to SDCs because an SDC introduced in a certain step affects all the following steps. SDCs can even cause the solver to become unstable. Adaptive solvers can change the step size, by comparing an estimation of the approximation error with an user-defined tolerance. If the estimation exceeds the tolerance, the step is rejected and recomputed. Adaptive solvers have an inherent resilience, because some SDCs might have no consequences on the accuracy of the results, and some SDCs might push the approximation error beyond the tolerance. Our first contribution shows that the rejection mechanism is not reliable enough to reject all SDCs that affect the results' accuracy, because the estimation is also corrupted. We therefore provide another protection mechanism: At the end of each step, a second error estimation is employed to increase the redundancy. Because of the complex dynamics, the choice of the second estimate is difficult: Two methods are explored. We evaluated them in HyPar and PETSc, on a cluster of 4,096 cores. We injected SDCs that are large enough to affect the trust or the convergence of the solvers. The new approach can detect 99% of the SDCs, reducing by more than 10 times the number of undetected SDCs. Compared with replication, a classic SDC detector, our protection mechanism reduces the memory overhead by more than 2 times and the computational overhead by more than 20 times in our experiments.
KW - Fault tolerance
KW - High-performance computing
KW - Numerical integration solver
KW - Resilience
KW - Silent data corruption
UR - http://www.scopus.com/inward/record.url?scp=85032632123&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85032632123&partnerID=8YFLogxK
U2 - 10.1109/CLUSTER.2017.13
DO - 10.1109/CLUSTER.2017.13
M3 - Conference contribution
AN - SCOPUS:85032632123
T3 - Proceedings - IEEE International Conference on Cluster Computing, ICCC
SP - 592
EP - 602
BT - Proceedings - 2017 IEEE International Conference on Cluster Computing, CLUSTER 2017
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2017 IEEE International Conference on Cluster Computing, CLUSTER 2017
Y2 - 5 September 2017 through 8 September 2017
ER -