TY - GEN
T1 - MPICh-V
T2 - 2002 IEEE/ACM Conference on Supercomputing, SC 2002
AU - Bosilca, George
AU - Bouteiller, Aurelien
AU - Cappello, Franck
AU - Djilali, Samir
AU - Fedak, Gilles
AU - Germain, Cecile
AU - Herault, Thomas
AU - Lemarinier, Pierre
AU - Lodygensky, Oleg
AU - Magniette, Frederic
AU - Neri, Vincent
AU - Selikhov, Anton
N1 - Funding Information:
XtremWeb and MPICH-V projects are partially funded, through the CGP2P project, by the French ACI initiative on GRID of the ministry of research. We thank its director, Prof. Michel Cosnard and the scientific committee members.
Publisher Copyright:
© 2002 IEEE.
PY - 2002
Y1 - 2002
N2 - Global Computing platforms, large scale clusters and future TeraGRID systems gather thousands of nodes for computing parallel scientific applications. At this scale, node failures or disconnections are frequent events. This Volatility reduces the MTBF of the whole system in the range of hours or minutes. We present MPICH-V, an automatic Volatility tolerant MPI environment based on uncoordinated checkpoint/rollback and distributed message logging. MPICH-V architecture relies on Channel Memories, Checkpoint servers and theoretically proven protocols to execute existing or new, SPMD and Master-Worker MPI applications on volatile nodes. To evaluate its capabilities, we run MPICH-V within a framework for which the number of nodes, Channels Memories and Checkpoint Servers can be completely configured as well as the node Volatility. We present a detailed performance evaluation of every component of MPICH-V and its global performance for non-trivial parallel applications. Experimental results demonstrate good scalability and high tolerance to node volatility.
AB - Global Computing platforms, large scale clusters and future TeraGRID systems gather thousands of nodes for computing parallel scientific applications. At this scale, node failures or disconnections are frequent events. This Volatility reduces the MTBF of the whole system in the range of hours or minutes. We present MPICH-V, an automatic Volatility tolerant MPI environment based on uncoordinated checkpoint/rollback and distributed message logging. MPICH-V architecture relies on Channel Memories, Checkpoint servers and theoretically proven protocols to execute existing or new, SPMD and Master-Worker MPI applications on volatile nodes. To evaluate its capabilities, we run MPICH-V within a framework for which the number of nodes, Channels Memories and Checkpoint Servers can be completely configured as well as the node Volatility. We present a detailed performance evaluation of every component of MPICH-V and its global performance for non-trivial parallel applications. Experimental results demonstrate good scalability and high tolerance to node volatility.
UR - http://www.scopus.com/inward/record.url?scp=84884662651&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84884662651&partnerID=8YFLogxK
U2 - 10.1109/SC.2002.10048
DO - 10.1109/SC.2002.10048
M3 - Conference contribution
AN - SCOPUS:84884662651
T3 - Proceedings of the International Conference on Supercomputing
BT - Proceedings of the IEEE/ACM SC 2002 Conference, SC 2002
PB - Association for Computing Machinery
Y2 - 16 November 2002 through 22 November 2002
ER -