TY - GEN
T1 - Detecting silent data corruption for extreme-scale MPI applications
AU - Bautista-Gomez, Leonardo
AU - Cappello, Franck
N1 - Publisher Copyright:
© 2015 ACM.
PY - 2015/9/21
Y1 - 2015/9/21
N2 - Next-generation supercomputers are expected to have more components and, at the same time, consume several times less energy per operation. These trends are pushing supercomputer construction to the limits of miniaturization and energy-saving strategies. Consequently, the number of soft errors is expected to increase dramatically in the coming years. While mechanisms are in place to correct or at least detect some soft errors, a significant percentage of those errors pass unnoticed by the hardware. Such silent errors are extremely damaging because they can make applications silently produce wrong results. In this work we propose a technique that leverages certain properties of high-performance computing applications in order to detect silent errors at the application level. Our technique detects corruption based solely on the behavior of the application datasets and is applicationagnostic. We propose multiple corruption detectors, and we couple them to work together in a fashion transparent to the user. We demonstrate that this strategy can detect over 80% of corruptions, while incurring less than 1% of overhead. We show that the false positive rate is less than 1% and that when multi-bit corruptions are taken into account, the detection recall increases to over 95%.
AB - Next-generation supercomputers are expected to have more components and, at the same time, consume several times less energy per operation. These trends are pushing supercomputer construction to the limits of miniaturization and energy-saving strategies. Consequently, the number of soft errors is expected to increase dramatically in the coming years. While mechanisms are in place to correct or at least detect some soft errors, a significant percentage of those errors pass unnoticed by the hardware. Such silent errors are extremely damaging because they can make applications silently produce wrong results. In this work we propose a technique that leverages certain properties of high-performance computing applications in order to detect silent errors at the application level. Our technique detects corruption based solely on the behavior of the application datasets and is applicationagnostic. We propose multiple corruption detectors, and we couple them to work together in a fashion transparent to the user. We demonstrate that this strategy can detect over 80% of corruptions, while incurring less than 1% of overhead. We show that the false positive rate is less than 1% and that when multi-bit corruptions are taken into account, the detection recall increases to over 95%.
KW - Anomaly detection
KW - Fault tolerance
KW - High-performance computing
KW - Silent data corruption
KW - Soft errors
KW - Supercomputers
UR - http://www.scopus.com/inward/record.url?scp=84983451407&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84983451407&partnerID=8YFLogxK
U2 - 10.1145/2802658.2802665
DO - 10.1145/2802658.2802665
M3 - Conference contribution
AN - SCOPUS:84983451407
T3 - ACM International Conference Proceeding Series
BT - Proceedings of the 22nd European MPI Users' Group Meeting, EuroMPI 2015
PB - Association for Computing Machinery
T2 - 22nd European MPI Users' Group Meeting, EuroMPI 2015
Y2 - 21 September 2015 through 23 September 2015
ER -