TY - GEN
T1 - Uncoordinated checkpointing without domino effect for send-deterministic MPI applications
AU - Guermouche, Amina
AU - Ropars, Thomas
AU - Brunet, Elisabeth
AU - Snir, Marc
AU - Cappello, Franck
PY - 2011
Y1 - 2011
N2 - As reported by many recent studies, the mean time between failures of future post-petascale supercomputers is likely to reduce, compared to the current situation. The most popular fault tolerance approach for MPI applications on HPC Platforms relies on coordinated check pointing which raises two major issues: a) global restart wastes energy since all processes are forced to rollback even in the case of a single failure, b) checkpoint coordination may slow down the application execution because of congestions on I/O resources. Alternative approaches based on uncoordinated check pointing and message logging require logging all messages, imposing a high memory/storage occupation and a significant overhead on communications. It has recently been observed that many MPI HPC applications are send-deterministic, allowing to design new fault tolerance protocols. In this paper, we propose an uncoordinated check pointing protocol for send-deterministic MPI HPC applications that (i) logs only a subset of the application messages and (ii) does not require to restart systematically all processes when a failure occurs. We first describe our protocol and prove its correctness. Through experimental evaluations, we show that its implementation in MPICH2 has a negligible overhead on application performance. Then we perform a quantitative evaluation of the properties of our protocol using the NAS Benchmarks. Using a clustering approach, we demonstrate that this protocol actually succeeds to combine the two expected properties: a) it logs only a small fraction of the messages and b) it reduces by a factor approaching 2 the average number of processes to rollback compared to coordinated check pointing.
AB - As reported by many recent studies, the mean time between failures of future post-petascale supercomputers is likely to reduce, compared to the current situation. The most popular fault tolerance approach for MPI applications on HPC Platforms relies on coordinated check pointing which raises two major issues: a) global restart wastes energy since all processes are forced to rollback even in the case of a single failure, b) checkpoint coordination may slow down the application execution because of congestions on I/O resources. Alternative approaches based on uncoordinated check pointing and message logging require logging all messages, imposing a high memory/storage occupation and a significant overhead on communications. It has recently been observed that many MPI HPC applications are send-deterministic, allowing to design new fault tolerance protocols. In this paper, we propose an uncoordinated check pointing protocol for send-deterministic MPI HPC applications that (i) logs only a subset of the application messages and (ii) does not require to restart systematically all processes when a failure occurs. We first describe our protocol and prove its correctness. Through experimental evaluations, we show that its implementation in MPICH2 has a negligible overhead on application performance. Then we perform a quantitative evaluation of the properties of our protocol using the NAS Benchmarks. Using a clustering approach, we demonstrate that this protocol actually succeeds to combine the two expected properties: a) it logs only a small fraction of the messages and b) it reduces by a factor approaching 2 the average number of processes to rollback compared to coordinated check pointing.
UR - http://www.scopus.com/inward/record.url?scp=80053223509&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=80053223509&partnerID=8YFLogxK
U2 - 10.1109/IPDPS.2011.95
DO - 10.1109/IPDPS.2011.95
M3 - Conference contribution
AN - SCOPUS:80053223509
SN - 9780769543857
T3 - Proceedings - 25th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2011
SP - 989
EP - 1000
BT - Proceedings - 25th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2011
PB - IEEE Computer Society
T2 - 25th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2011
Y2 - 16 May 2011 through 20 May 2011
ER -