TY - GEN
T1 - Exploiting spatial smoothness in HPC applications to detect silent data corruption
AU - Bautista-Gomez, Leonardo
AU - Cappello, Franck
N1 - Publisher Copyright:
© 2015 IEEE.
PY - 2015/11/23
Y1 - 2015/11/23
N2 - Next-generation supercomputers are expected to have more components and, at the same time, consume several times less energy per operation. This situation is pushing supercomputer constructors to the limits of miniaturization and energy-saving strategies. Consequently, the number of soft errors is expected to increase dramatically in the coming years. While mechanisms are in place to correct or at least detect soft errors, a percentage of those errors pass unnoticed by the system. Such silent errors are extremely damaging because they can make applications produce wrong results. In this paper we propose a technique that leverages certain properties of HPC applications in order to detect silent errors at the application level. Our technique detects corruption solely based on the data behavior and is algorithm-agnostic. We show that this strategy can detect up to 90% of injected errors in some regions while incurring less than 1% overhead.
AB - Next-generation supercomputers are expected to have more components and, at the same time, consume several times less energy per operation. This situation is pushing supercomputer constructors to the limits of miniaturization and energy-saving strategies. Consequently, the number of soft errors is expected to increase dramatically in the coming years. While mechanisms are in place to correct or at least detect soft errors, a percentage of those errors pass unnoticed by the system. Such silent errors are extremely damaging because they can make applications produce wrong results. In this paper we propose a technique that leverages certain properties of HPC applications in order to detect silent errors at the application level. Our technique detects corruption solely based on the data behavior and is algorithm-agnostic. We show that this strategy can detect up to 90% of injected errors in some regions while incurring less than 1% overhead.
KW - Detectors
KW - Entropy
KW - Error correction codes
KW - Random access memory
KW - Registers
KW - Reliability
KW - Supercomputers
UR - http://www.scopus.com/inward/record.url?scp=84961732802&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84961732802&partnerID=8YFLogxK
U2 - 10.1109/HPCC-CSS-ICESS.2015.9
DO - 10.1109/HPCC-CSS-ICESS.2015.9
M3 - Conference contribution
AN - SCOPUS:84961732802
T3 - Proceedings - 2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security and 2015 IEEE 12th International Conference on Embedded Software and Systems, HPCC-CSS-ICESS 2015
SP - 128
EP - 133
BT - Proceedings - 2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security and 2015 IEEE 12th International Conference on Embedded Software and Systems, HPCC-CSS-ICESS 2015
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 17th IEEE International Conference on High Performance Computing and Communications, IEEE 7th International Symposium on Cyberspace Safety and Security and IEEE 12th International Conference on Embedded Software and Systems, HPCC-ICESS-CSS 2015
Y2 - 24 August 2015 through 26 August 2015
ER -