TY - JOUR
T1 - Toward General Software Level Silent Data Corruption Detection for Parallel Applications
AU - Berrocal, Eduardo
AU - Bautista-Gomez, Leonardo
AU - Di, Sheng
AU - Lan, Zhiling
AU - Cappello, Franck
N1 - Funding Information:
The work at the Illinois Institute of Technology is supported in part by U.S. National Science Foundation grants CNS-1320125 and CCF-1422009. This material was based upon work supported by the U.S. Department of Energy, Office of Science, Advanced Scientific Computing Research Program, under Contract DE-AC02-06CH11357, and by the ANR RESCUE and the INRIA-Illinois-ANL-BSC-JSC-Riken Joint Laboratory on Extreme Scale Computing.
Publisher Copyright:
© 2012 IEEE.
PY - 2017/12/1
Y1 - 2017/12/1
N2 - Silent data corruption (SDC) poses a great challenge for high-performance computing (HPC) applications as we move to extreme-scale systems. Mechanisms have been proposed that are able to detect SDC in HPC applications by using the peculiarities of the data (more specifically, its 'smoothness' in time and space) to make predictions. However, these data-analytic solutions are still far from fully protecting applications to a level comparable with more expensive solutions such as full replication. In this work, we propose partial replication to overcome this limitation. More specifically, we have observed that not all processes of an MPI application experience the same level of data variability at exactly the same time. Thus, we can smartly choose and replicate only those processes for which the lightweight data-analytic detectors would perform poorly. In addition, we propose a new evaluation method based on the probability that a corruption will pass unnoticed by a particular detector (instead of just reporting overall single-bit precision and recall). In our experiments, we use four applications dealing with different explosions. Our results indicate that our new approach can protect the MPI applications analyzed with 7-70 percent less overhead (depending on the application) than that of full duplication with similar detection recall.
AB - Silent data corruption (SDC) poses a great challenge for high-performance computing (HPC) applications as we move to extreme-scale systems. Mechanisms have been proposed that are able to detect SDC in HPC applications by using the peculiarities of the data (more specifically, its 'smoothness' in time and space) to make predictions. However, these data-analytic solutions are still far from fully protecting applications to a level comparable with more expensive solutions such as full replication. In this work, we propose partial replication to overcome this limitation. More specifically, we have observed that not all processes of an MPI application experience the same level of data variability at exactly the same time. Thus, we can smartly choose and replicate only those processes for which the lightweight data-analytic detectors would perform poorly. In addition, we propose a new evaluation method based on the probability that a corruption will pass unnoticed by a particular detector (instead of just reporting overall single-bit precision and recall). In our experiments, we use four applications dealing with different explosions. Our results indicate that our new approach can protect the MPI applications analyzed with 7-70 percent less overhead (depending on the application) than that of full duplication with similar detection recall.
KW - data analysis
KW - high-performance computing
KW - parallel applications
KW - partial replication
KW - Silent data corruption detection
UR - http://www.scopus.com/inward/record.url?scp=85028923660&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85028923660&partnerID=8YFLogxK
U2 - 10.1109/TPDS.2017.2735971
DO - 10.1109/TPDS.2017.2735971
M3 - Article
AN - SCOPUS:85028923660
SN - 1045-9219
VL - 28
SP - 3642
EP - 3655
JO - IEEE Transactions on Parallel and Distributed Systems
JF - IEEE Transactions on Parallel and Distributed Systems
IS - 12
M1 - 8002625
ER -