TY - GEN
T1 - On the use of cluster-based partial message logging to improve fault tolerance for MPI HPC applications
AU - Ropars, Thomas
AU - Guermouche, Amina
AU - Uçar, Bora
AU - Meneses, Esteban
AU - Kalé, Laxmikant V.
AU - Cappello, Franck
PY - 2011
Y1 - 2011
N2 - Fault tolerance is becoming a major concern in HPC systems. The two traditional approaches for message passing applications, coordinated checkpointing and message logging, have severe scalability issues. Coordinated checkpointing protocols make all processes roll back after a failure. Message logging protocols log a huge amount of data and can induce an overhead on communication performance. Hierarchical rollback-recovery protocols based on the combination of coordinated checkpointing and message logging are an alternative. These partial message logging protocols are based on process clustering: only messages between clusters are logged to limit the consequence of a failure to one cluster. These protocols would work efficiently only if one can find clusters of processes in the applications such that the ratio of logged messages is very low. We study the communication patterns of message passing HPC applications to show that partial message logging is suitable in most cases. We propose a partitioning algorithm to find suitable clusters of processes given the communication pattern of an application. Finally, we evaluate the efficiency of partial message logging using two state of the art protocols on a set of representative applications.
AB - Fault tolerance is becoming a major concern in HPC systems. The two traditional approaches for message passing applications, coordinated checkpointing and message logging, have severe scalability issues. Coordinated checkpointing protocols make all processes roll back after a failure. Message logging protocols log a huge amount of data and can induce an overhead on communication performance. Hierarchical rollback-recovery protocols based on the combination of coordinated checkpointing and message logging are an alternative. These partial message logging protocols are based on process clustering: only messages between clusters are logged to limit the consequence of a failure to one cluster. These protocols would work efficiently only if one can find clusters of processes in the applications such that the ratio of logged messages is very low. We study the communication patterns of message passing HPC applications to show that partial message logging is suitable in most cases. We propose a partitioning algorithm to find suitable clusters of processes given the communication pattern of an application. Finally, we evaluate the efficiency of partial message logging using two state of the art protocols on a set of representative applications.
UR - http://www.scopus.com/inward/record.url?scp=80052380100&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=80052380100&partnerID=8YFLogxK
U2 - 10.1007/978-3-642-23400-2_53
DO - 10.1007/978-3-642-23400-2_53
M3 - Conference contribution
AN - SCOPUS:80052380100
SN - 9783642233999
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 567
EP - 578
BT - Euro-Par 2011 Parallel Processing - 17th International Conference, Proceedings
PB - Springer
T2 - 17th International Conference on Parallel Processing, Euro-Par 2011
Y2 - 29 August 2011 through 2 September 2011
ER -