TY - GEN
T1 - Detecting silent data corruption through data dynamic monitoring for scientific applications
AU - Bautista Gomez, Leonardo
AU - Cappello, Franck
PY - 2014
Y1 - 2014
N2 - Parallel programming has become one of the best ways to express scientific models that simulate a wide range of natural phenomena. These complex parallel codes are deployed and executed on large-scale parallel computers, making them important tools for scientific discovery. As supercomputers get faster and larger, the increasing number of components is leading to higher failure rates. In particular, the miniaturization of electronic components is expected to lead to a dramatic rise in soft errors and data corruption. Moreover, soft errors can corrupt data silently and generate large inaccuracies or wrong results at the end of the computation. In this paper we propose a novel technique to detect silent data corruption based on data monitoring. Using this technique, an application can learn the normal dynamics of its datasets, allowing it to quickly spot anomalies. We evaluate our technique with synthetic benchmarks and we show that our technique can detect up to 50% of injected errors while incurring only negligible overhead. Copyright is held by the author/owner(s).
AB - Parallel programming has become one of the best ways to express scientific models that simulate a wide range of natural phenomena. These complex parallel codes are deployed and executed on large-scale parallel computers, making them important tools for scientific discovery. As supercomputers get faster and larger, the increasing number of components is leading to higher failure rates. In particular, the miniaturization of electronic components is expected to lead to a dramatic rise in soft errors and data corruption. Moreover, soft errors can corrupt data silently and generate large inaccuracies or wrong results at the end of the computation. In this paper we propose a novel technique to detect silent data corruption based on data monitoring. Using this technique, an application can learn the normal dynamics of its datasets, allowing it to quickly spot anomalies. We evaluate our technique with synthetic benchmarks and we show that our technique can detect up to 50% of injected errors while incurring only negligible overhead. Copyright is held by the author/owner(s).
KW - Bit flips
KW - Data entropy
KW - Fault tolerance
KW - Silent data corruption
KW - Soft errors
KW - Supercomputers
UR - http://www.scopus.com/inward/record.url?scp=84896891669&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84896891669&partnerID=8YFLogxK
U2 - 10.1145/2555243.2555279
DO - 10.1145/2555243.2555279
M3 - Conference contribution
AN - SCOPUS:84896891669
SN - 9781450326568
T3 - Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP
SP - 381
EP - 382
BT - PPoPP 2014 - Proceedings of the 2014 ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
T2 - 2014 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2014
Y2 - 15 February 2014 through 19 February 2014
ER -