TY - GEN
T1 - Lightweight and accurate silent data corruption detection in ordinary differential equation solvers
AU - Guhur, Pierre Louis
AU - Zhang, Hong
AU - Peterka, Tom
AU - Constantinescu, Emil
AU - Cappello, Franck
N1 - Funding Information:
We express our gratitude to Julie Bessac for assistance with the algorithm and Gail Pieper for comments that greatly improved the manuscript. We also gratefully acknowledge the use of the services and facilities of the Decaf project at Argonne National Laboratory, supported by U.S. Department of Energy, Office of Science, Advanced Scientific Computing Research, under Contract DE-AC02-06CH11357, program manager Lucy Nowell. We also thank the anonymous reviewers for their helpful comments.
Publisher Copyright:
© Springer International Publishing Switzerland 2016.
PY - 2016
Y1 - 2016
N2 - Silent data corruptions (SDCs) are errors that corrupt the system or falsify results while remaining unnoticed by firmware or operating systems. In numerical integration solvers, SDCs that impact the accuracy of the solver are considered significant. Detecting SDCs in high-performance computing is necessary because results need to be trustworthy and the increase of the number and complexity of components in emerging large-scale architectures makes SDCs more likely to occur. Until recently, SDC detection methods consisted in replicating the processes of the execution or in using checksums (for example algorithm-based fault tolerance). Recently, new detection methods have been proposed relying on mathematical properties of numerical kernels or performing data analysis of the results modified by the application. None of those methods, however, provide a lightweight solution guaranteeing that all significant SDCs are detected. We propose a new method called Hot Rod as a solution to this problem. It checks and potentially corrects the data produced by numerical integration solvers. Our theoretical model shows that all significant SDCs can be detected. We present two detectors and conduct experiments on streamline integration from the WRF meteorology application. Compared with the algorithmic detection methods, the accuracy of our first detector is increased by 52% with a similar false detection rate. The second detector has a false detection rate one order of magnitude lower than these detection methods while improving the detection accuracy by 23 %. The computational overhead is lower than 5% in both cases. The model has been developed for an explicit Runge-Kutta method, although it can be generalized to other solvers.
AB - Silent data corruptions (SDCs) are errors that corrupt the system or falsify results while remaining unnoticed by firmware or operating systems. In numerical integration solvers, SDCs that impact the accuracy of the solver are considered significant. Detecting SDCs in high-performance computing is necessary because results need to be trustworthy and the increase of the number and complexity of components in emerging large-scale architectures makes SDCs more likely to occur. Until recently, SDC detection methods consisted in replicating the processes of the execution or in using checksums (for example algorithm-based fault tolerance). Recently, new detection methods have been proposed relying on mathematical properties of numerical kernels or performing data analysis of the results modified by the application. None of those methods, however, provide a lightweight solution guaranteeing that all significant SDCs are detected. We propose a new method called Hot Rod as a solution to this problem. It checks and potentially corrects the data produced by numerical integration solvers. Our theoretical model shows that all significant SDCs can be detected. We present two detectors and conduct experiments on streamline integration from the WRF meteorology application. Compared with the algorithmic detection methods, the accuracy of our first detector is increased by 52% with a similar false detection rate. The second detector has a false detection rate one order of magnitude lower than these detection methods while improving the detection accuracy by 23 %. The computational overhead is lower than 5% in both cases. The model has been developed for an explicit Runge-Kutta method, although it can be generalized to other solvers.
KW - Fault tolerance
KW - HPC
KW - Numerical integration solvers
KW - Resilience
KW - Runge-kutta
KW - SDC
UR - http://www.scopus.com/inward/record.url?scp=84984782952&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84984782952&partnerID=8YFLogxK
U2 - 10.1007/978-3-319-43659-3_47
DO - 10.1007/978-3-319-43659-3_47
M3 - Conference contribution
AN - SCOPUS:84984782952
SN - 9783319436586
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 644
EP - 656
BT - Parallel Processing - 22nd International Conference on Parallel and Distributed Computing, Euro-Par 2016, Proceedings
A2 - Dutot, Pierre-François
A2 - Trystram, Denis
PB - Springer
T2 - 22nd International Conference on Parallel and Distributed Computing, Euro-Par 2016
Y2 - 24 August 2016 through 26 August 2016
ER -