TY - GEN
T1 - Exploring partial replication to improve lightweight silent data corruption detection for HPC applications
AU - Berrocal, Eduardo
AU - Bautista-Gomez, Leonardo
AU - Di, Sheng
AU - Lan, Zhiling
AU - Cappello, Franck
N1 - Funding Information:
Government License Section: The submitted manuscript has been created by UChicago Argonne, LLC, Operator of Argonne National Laboratory (“Argonne”). Argonne, a U.S. Department of Energy Office of Science laboratory, is operated under Contract No. DE-AC02-06CH11357. The U.S. Government retains for itself, and others acting on its behalf, a paid-up nonexclusive, irrevocable worldwide license in said article to reproduce, prepare derivative works, distribute copies to the public, and perform publicly and display publicly, by or on behalf of the Government.
Funding Information:
This material was based upon work supported by the U.S. Department of Energy, Office of Science, Advanced Scientific Computing Research Program, under Contract DE-AC02-06CH11357, and by the ANR RESCUE and the INRIA-Illinois-ANL- BSC-JSC-Riken Joint Laboratory on Extreme Scale Computing. The work at the Illinois Institute of Technology is supported in part by U.S. National Science Foundation grants CNS-1320125 and CCF-1422009.
Publisher Copyright:
© Springer International Publishing Switzerland 2016.
PY - 2016
Y1 - 2016
N2 - Silent data corruption (SDC) poses a great challenge for high-performance computing (HPC) applications as we move to extremescale systems. If not dealt with properly, SDC has the potential to influence important scientific results, leading scientists to wrong conclusions. In previous work, our detector was able to detect SDC in HPC applications to a certain level by using the peculiarities of the data (more specifically, its “smoothness” in time and space) to make predictions. Accurate predictions allow us to detect corruptions when data values are far “enough” from them. However, these data-analytic solutions are still far from fully protecting applications to a level comparable with more expensive solutions such as full replication. In this work, we propose partial replication to overcome this limitation. More specifically, we have observed that not all processes of an MPI application experience the same level of data variability at exactly the same time. Thus, we can smartly choose and replicate only those processes for which our lightweight data-analytic detectors would perform poorly. Our results indicate that our new approach can protect the MPI applications analyzed with 49–53% less overhead than that of full duplication with similar detection recall.
AB - Silent data corruption (SDC) poses a great challenge for high-performance computing (HPC) applications as we move to extremescale systems. If not dealt with properly, SDC has the potential to influence important scientific results, leading scientists to wrong conclusions. In previous work, our detector was able to detect SDC in HPC applications to a certain level by using the peculiarities of the data (more specifically, its “smoothness” in time and space) to make predictions. Accurate predictions allow us to detect corruptions when data values are far “enough” from them. However, these data-analytic solutions are still far from fully protecting applications to a level comparable with more expensive solutions such as full replication. In this work, we propose partial replication to overcome this limitation. More specifically, we have observed that not all processes of an MPI application experience the same level of data variability at exactly the same time. Thus, we can smartly choose and replicate only those processes for which our lightweight data-analytic detectors would perform poorly. Our results indicate that our new approach can protect the MPI applications analyzed with 49–53% less overhead than that of full duplication with similar detection recall.
KW - Data analysis
KW - HPC applications
KW - Partial replication
KW - Silent data corruption detection
UR - http://www.scopus.com/inward/record.url?scp=84984861455&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84984861455&partnerID=8YFLogxK
U2 - 10.1007/978-3-319-43659-3_31
DO - 10.1007/978-3-319-43659-3_31
M3 - Conference contribution
AN - SCOPUS:84984861455
SN - 9783319436586
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 419
EP - 430
BT - Parallel Processing - 22nd International Conference on Parallel and Distributed Computing, Euro-Par 2016, Proceedings
A2 - Dutot, Pierre-François
A2 - Trystram, Denis
PB - Springer
T2 - 22nd International Conference on Parallel and Distributed Computing, Euro-Par 2016
Y2 - 24 August 2016 through 26 August 2016
ER -