TY - GEN
T1 - Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant MPI
AU - Coti, Camille
AU - Herault, Thomas
AU - Lemarinier, Pierre
AU - Pilard, Laurence
AU - Rezmerita, Ala
AU - Rodriguez, Eric
AU - Cappello, Franck
PY - 2006
Y1 - 2006
N2 - A long-term trend in high-performance computing is the increasing number of nodes in parallel computing platforms, which entails a higher failure probability. Fault tolerant programming environments should be used to guarantee the safe execution of critical applications. Research in fault tolerant MPI has led to the development of several fault tolerant MPI environments. Different approaches are being proposed using a variety of fault tolerant message passing protocols based on coordinated checkpointing or message logging. The most popular approach is with coordinated checkpointing. In the literature, two different concepts of coordinated checkpointing have been proposed: blocking and nonblocking. However they have never been compared quantitatively and their respective scalability remains unknown. The contribution of this paper is to provide the first comparison between these two approaches and a study of their scalability. We have implemented the two approaches within the MPICH environments and evaluate their performance using the NAS parallel benchmarks.
AB - A long-term trend in high-performance computing is the increasing number of nodes in parallel computing platforms, which entails a higher failure probability. Fault tolerant programming environments should be used to guarantee the safe execution of critical applications. Research in fault tolerant MPI has led to the development of several fault tolerant MPI environments. Different approaches are being proposed using a variety of fault tolerant message passing protocols based on coordinated checkpointing or message logging. The most popular approach is with coordinated checkpointing. In the literature, two different concepts of coordinated checkpointing have been proposed: blocking and nonblocking. However they have never been compared quantitatively and their respective scalability remains unknown. The contribution of this paper is to provide the first comparison between these two approaches and a study of their scalability. We have implemented the two approaches within the MPICH environments and evaluate their performance using the NAS parallel benchmarks.
UR - http://www.scopus.com/inward/record.url?scp=34548282622&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=34548282622&partnerID=8YFLogxK
U2 - 10.1145/1188455.1188587
DO - 10.1145/1188455.1188587
M3 - Conference contribution
AN - SCOPUS:34548282622
SN - 0769527000
SN - 9780769527000
T3 - Proceedings of the 2006 ACM/IEEE Conference on Supercomputing, SC'06
BT - Proceedings of the 2006 ACM/IEEE Conference on Supercomputing, SC'06
ER -