TY - GEN
T1 - Identifying the right replication level to detect and correct silent errors at scale
AU - Benoit, Anne
AU - Raghavan, Padma
AU - Cavelan, Aurelien
AU - Robert, Yves
AU - Cappello, Franck
AU - Sun, Hongyang
N1 - Publisher Copyright:
© 2017 ACM.
PY - 2017/6/26
Y1 - 2017/6/26
N2 - This paper provides a model and an analytical study of replication as a technique to detect and correct silent errors. Although other detection techniques exist for HPC applications, based on algorithms (ABFT), invariant preservation or data analytics, replication remains the most transparent and least intrusive technique. We explore the right level (duplication, triplication or more) of replication needed to efficiently detect and correct silent errors. Replication is combined with checkpointing and comes with two flavors: process replication and group replication. Process replication applies to message-passing applications with communicating processes. Each process is replicated, and the platform is composed of process pairs, or triplets. Group replication applies to black-box applications, whose parallel execution is replicated several times. The platform is partitioned into two halves (or three thirds). In both scenarios, results are compared before each checkpoint, which is taken only when both results (duplication) or two out of three results (triplication) coincide. If not, one or more silent errors have been detected, and the application rolls back to the last checkpoint. We provide a detailed analytical study of both scenarios, with formulas to decide, for each scenario, the optimal parameters as a function of the error rate, checkpoint cost, and platform size. We also report a set of extensive simulation results that corroborates the analytical model.
AB - This paper provides a model and an analytical study of replication as a technique to detect and correct silent errors. Although other detection techniques exist for HPC applications, based on algorithms (ABFT), invariant preservation or data analytics, replication remains the most transparent and least intrusive technique. We explore the right level (duplication, triplication or more) of replication needed to efficiently detect and correct silent errors. Replication is combined with checkpointing and comes with two flavors: process replication and group replication. Process replication applies to message-passing applications with communicating processes. Each process is replicated, and the platform is composed of process pairs, or triplets. Group replication applies to black-box applications, whose parallel execution is replicated several times. The platform is partitioned into two halves (or three thirds). In both scenarios, results are compared before each checkpoint, which is taken only when both results (duplication) or two out of three results (triplication) coincide. If not, one or more silent errors have been detected, and the application rolls back to the last checkpoint. We provide a detailed analytical study of both scenarios, with formulas to decide, for each scenario, the optimal parameters as a function of the error rate, checkpoint cost, and platform size. We also report a set of extensive simulation results that corroborates the analytical model.
UR - http://www.scopus.com/inward/record.url?scp=85025819231&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85025819231&partnerID=8YFLogxK
U2 - 10.1145/3086157.3086162
DO - 10.1145/3086157.3086162
M3 - Conference contribution
AN - SCOPUS:85025819231
T3 - FTXS 2017 - Proceedings of the 2017 Workshop on Fault-Tolerance for HPC at Extreme Scale, co-located with HPDC 2017
SP - 31
EP - 38
BT - FTXS 2017 - Proceedings of the 2017 Workshop on Fault-Tolerance for HPC at Extreme Scale, co-located with HPDC 2017
PB - Association for Computing Machinery
T2 - 7th Fault Tolerance for HPC at eXtreme Scale Workshop, FTXS 2017
Y2 - 26 June 2017 through 30 June 2017
ER -