TY - GEN
T1 - Proactive fault tolerance in MPI applications via task migration
AU - Chakravorty, Sayantan
AU - Mendes, Celso L.
AU - Kalé, Laxmikant V.
PY - 2006
Y1 - 2006
N2 - Failures are likely to be more frequent in systems with thousands of processors. Therefore, schemes for dealing with faults become increasingly important. In this paper, we present a fault tolerance solution for parallel applications that proactively migrates execution from processors where failure is imminent. Our approach assumes that some failures are predictable, and leverages the features in current hardware devices supporting early indication of faults. We use the concepts of processor virtualization and dynamic task migration, provided by Charm++ and Adaptive MPI (AMPI), to implement a mechanism that migrates tasks away from processors which are expected to fail. To demonstrate the feasibility of our approach, we present performance data from experiments with existing MPI applications. Our results show that proactive task migration is an effective technique to tolerate faults in MPI applications.
AB - Failures are likely to be more frequent in systems with thousands of processors. Therefore, schemes for dealing with faults become increasingly important. In this paper, we present a fault tolerance solution for parallel applications that proactively migrates execution from processors where failure is imminent. Our approach assumes that some failures are predictable, and leverages the features in current hardware devices supporting early indication of faults. We use the concepts of processor virtualization and dynamic task migration, provided by Charm++ and Adaptive MPI (AMPI), to implement a mechanism that migrates tasks away from processors which are expected to fail. To demonstrate the feasibility of our approach, we present performance data from experiments with existing MPI applications. Our results show that proactive task migration is an effective technique to tolerate faults in MPI applications.
UR - http://www.scopus.com/inward/record.url?scp=50649108554&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=50649108554&partnerID=8YFLogxK
U2 - 10.1007/11945918_47
DO - 10.1007/11945918_47
M3 - Conference contribution
AN - SCOPUS:50649108554
SN - 354068039X
SN - 9783540680390
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 485
EP - 496
BT - High Performance Computing - HiPC 2006 - 13th International Conference Proceedings
T2 - 13th International Conference on High Performance Computing, HiPC 2006
Y2 - 18 December 2006 through 21 December 2006
ER -