TY - GEN
T1 - A distributed and replicated service for checkpoint storage
AU - Bouabache, Fatiha
AU - Herault, Thomas
AU - Fedak, Gilles
AU - Cappello, Franck
PY - 2008
Y1 - 2008
N2 - As High Performance platforms (Clusters, Grids, etc.) continue to grow in size, the average time between failures decreases to a critical level. An efficient and reliable fault tolerance protocol plays a key role in High Performance Computing. Rollback recovery is the most common fault tolerance technique used in High Performance Computing and especially in MPI applications. This technique relies on the reliability of the checkpoint storage, most of the rollback recovery protocols assume that the checkpoint servers machines are reliable. However, in a grid environment any unit can fail at any moment, including components used to connect different administrative domains. Such a failure leads to the loss of a whole set of machines, including the more reliable machines used to store the checkpoints in this administrative domain. It is thus not safe to rely on the high MTBF of specific machines to store the checkpoint images. This paper introduces a new protocol that ensure the checkpoint storage reliability even if one or more Checkpoint Servers fail. To provide this reliability the protocol is based on a replication process. We evaluate our solution through simulations against several criteria: scalability, topology, and reliability of the nodes. We also compare between two replication strategies to decide which one should be used in the implementation.
AB - As High Performance platforms (Clusters, Grids, etc.) continue to grow in size, the average time between failures decreases to a critical level. An efficient and reliable fault tolerance protocol plays a key role in High Performance Computing. Rollback recovery is the most common fault tolerance technique used in High Performance Computing and especially in MPI applications. This technique relies on the reliability of the checkpoint storage, most of the rollback recovery protocols assume that the checkpoint servers machines are reliable. However, in a grid environment any unit can fail at any moment, including components used to connect different administrative domains. Such a failure leads to the loss of a whole set of machines, including the more reliable machines used to store the checkpoints in this administrative domain. It is thus not safe to rely on the high MTBF of specific machines to store the checkpoint images. This paper introduces a new protocol that ensure the checkpoint storage reliability even if one or more Checkpoint Servers fail. To provide this reliability the protocol is based on a replication process. We evaluate our solution through simulations against several criteria: scalability, topology, and reliability of the nodes. We also compare between two replication strategies to decide which one should be used in the implementation.
KW - Fault tolerance
KW - High performance computing
KW - Replication
KW - Rollback recovery
UR - http://www.scopus.com/inward/record.url?scp=84900609376&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84900609376&partnerID=8YFLogxK
U2 - 10.1007/978-0-387-78448-9_24
DO - 10.1007/978-0-387-78448-9_24
M3 - Conference contribution
AN - SCOPUS:84900609376
SN - 9780387784472
T3 - Making Grids Work - Proceedings of the CoreGRID Workshop on Programming Models Grid and P2P System Architecture Grid Systems, Tools and Environments
SP - 295
EP - 306
BT - Making Grids Work - Proceedings of the CoreGRID Workshop on Programming Models Grid and P2P System Architecture Grid Systems, Tools and Environments
PB - Springer
T2 - 2007 Joint CoreGRID Workshop on Programming Models Grid and P2P System Architecture Grid Systems, Tools and Environments
Y2 - 12 June 2007 through 13 June 2007
ER -