TY - GEN
T1 - POSTER
T2 - 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2022
AU - Huang, Yafan
AU - Guo, Shengjian
AU - Di, Sheng
AU - Li, Guanpeng
AU - Cappello, Franck
N1 - Funding Information:
The material was supported by the U.S. Department of Energy, Office of Science under contract DE-AC02-06CH11357.
Publisher Copyright:
© 2022 Owner/Author.
PY - 2022/4/2
Y1 - 2022/4/2
N2 - With the ever-shrinking size of transistors and increasing scale of applications, silent data corruptions (SDCs) have become a common yet serious issue in HPC applications. Selective instruction duplication (SID) is a popular fault-tolerance technique that can obtain a high SDC coverage with low-performance overhead, as it selects the most vulnerable parts of a program for protection with priority. However, existing studies of SID are confined to single program input in the evaluation, assuming that the error resilience of the program remains similar across inputs, leading to a drastic loss of SDC coverage from SID when the protected program runs different inputs. Hence, we proposed Sentinel, an automated compiler-based framework to mitigate the loss of SDC coverage. Evaluation results show that Sentinel can effectively mitigate the loss of SDC coverage (up to 97.00%) across multiple inputs, which significantly hardens existing SID techniques.
AB - With the ever-shrinking size of transistors and increasing scale of applications, silent data corruptions (SDCs) have become a common yet serious issue in HPC applications. Selective instruction duplication (SID) is a popular fault-tolerance technique that can obtain a high SDC coverage with low-performance overhead, as it selects the most vulnerable parts of a program for protection with priority. However, existing studies of SID are confined to single program input in the evaluation, assuming that the error resilience of the program remains similar across inputs, leading to a drastic loss of SDC coverage from SID when the protected program runs different inputs. Hence, we proposed Sentinel, an automated compiler-based framework to mitigate the loss of SDC coverage. Evaluation results show that Sentinel can effectively mitigate the loss of SDC coverage (up to 97.00%) across multiple inputs, which significantly hardens existing SID techniques.
KW - compiler
KW - error resilience
KW - fault injection
KW - high performance computing
UR - http://www.scopus.com/inward/record.url?scp=85127572448&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85127572448&partnerID=8YFLogxK
U2 - 10.1145/3503221.3508414
DO - 10.1145/3503221.3508414
M3 - Conference contribution
AN - SCOPUS:85127572448
T3 - Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP
SP - 437
EP - 438
BT - PPoPP 2022 - Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
PB - Association for Computing Machinery
Y2 - 2 April 2022 through 6 April 2022
ER -