TY - GEN
T1 - IPAS
T2 - 14th Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO 2016
AU - Laguna, Ignacio
AU - Schulz, Martin
AU - Richards, David F.
AU - Calhoun, Jon
AU - Olson, Luke
N1 - Publisher Copyright:
© 2016 ACM.
PY - 2016/2/29
Y1 - 2016/2/29
N2 - This paper presents IPAS, an instruction duplication technique that protects scientific applications from silent data corruption (SDC) in their output. The motivation for IPAS is that, due to natural error masking, only a subset of SDC errors actually affects the output of scientific codes-we call these errors silent output corruption (SOC) errors. Thus applications require duplication only on code that, when affected by a fault, yields SOC. We use machine learning to learn code instructions that must be protected to avoid SOC, and, using a compiler, we protect only those vulnerable instructions by duplication, thus significantly reducing the overhead that is introduced by instruction duplication. In our experiments with five workloads, IPAS reduces the percentage of SOC by up to 90% with a slowdown that ranges between 1.04× and 1.35×, which corresponds to as much as 47% less slowdown than state-of-the-art instruction duplication techniques.
AB - This paper presents IPAS, an instruction duplication technique that protects scientific applications from silent data corruption (SDC) in their output. The motivation for IPAS is that, due to natural error masking, only a subset of SDC errors actually affects the output of scientific codes-we call these errors silent output corruption (SOC) errors. Thus applications require duplication only on code that, when affected by a fault, yields SOC. We use machine learning to learn code instructions that must be protected to avoid SOC, and, using a compiler, we protect only those vulnerable instructions by duplication, thus significantly reducing the overhead that is introduced by instruction duplication. In our experiments with five workloads, IPAS reduces the percentage of SOC by up to 90% with a slowdown that ranges between 1.04× and 1.35×, which corresponds to as much as 47% less slowdown than state-of-the-art instruction duplication techniques.
KW - Compiler analysis
KW - High-performance computing
KW - Machine learning
KW - Resilience
UR - http://www.scopus.com/inward/record.url?scp=84968854133&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84968854133&partnerID=8YFLogxK
U2 - 10.1145/2854038.2854059
DO - 10.1145/2854038.2854059
M3 - Conference contribution
AN - SCOPUS:84968854133
T3 - Proceedings of the 14th International Symposium on Code Generation and Optimization, CGO 2016
SP - 227
EP - 238
BT - Proceedings of the 14th International Symposium on Code Generation and Optimization, CGO 2016
PB - Association for Computing Machinery
Y2 - 12 March 2016 through 18 March 2016
ER -