TY - GEN
T1 - Evaluating Online Global Recovery with Fenix Using Application-Aware In-Memory Checkpointing Techniques
AU - Gamell, Marc
AU - Katz, Daniel S.
AU - Teranishi, Keita
AU - Heroux, Michael A.
AU - Van Der Wijngaart, Rob F.
AU - Mattson, Timothy G.
AU - Parashar, Manish
N1 - Publisher Copyright:
© 2016 IEEE.
PY - 2016/9/23
Y1 - 2016/9/23
N2 - Exascale systems promise the potential for computation atunprecedented scales and resolutions, but achieving exascale by theend of this decade presents significant challenges. A key challenge isdue to the very large number of cores and components and the resultingmean time between failures (MTBF) in the order of hours orminutes. Since the typical run times of target scientific applicationsare longer than this MTBF, fault tolerance techniques will beessential. An important class of failures that must be addressed isprocess or node failures. While checkpoint/restart (C/R) is currentlythe most widely accepted technique for addressing processor failures, coordinated, stable-storage-based global C/R might be unfeasible atexascale when the time to checkpoint exceeds the expected MTBF. This paper explores transparent recovery via implicitly coordinated, diskless, application-driven checkpointing as a way to tolerateprocess failures in MPI applications at exascale. The discussedapproach leverages User Level Failure Mitigation (ULFM), which isbeing proposed as an MPI extension to allow applications to createpolicies for tolerating process failures. Specifically, this paper demonstrates how different implementations ofapplication-driven in-memory checkpoint storage and recovery comparein terms of performance and scalability. We also experimentally evaluate the effectiveness and scalability ofthe Fenix online global recovery framework on a production system-the Titan Cray XK7 at ORNL-and demonstrate the ability of Fenix totolerate dynamically injected failures using the execution of fourbenchmarks and mini-applications with different behaviors.
AB - Exascale systems promise the potential for computation atunprecedented scales and resolutions, but achieving exascale by theend of this decade presents significant challenges. A key challenge isdue to the very large number of cores and components and the resultingmean time between failures (MTBF) in the order of hours orminutes. Since the typical run times of target scientific applicationsare longer than this MTBF, fault tolerance techniques will beessential. An important class of failures that must be addressed isprocess or node failures. While checkpoint/restart (C/R) is currentlythe most widely accepted technique for addressing processor failures, coordinated, stable-storage-based global C/R might be unfeasible atexascale when the time to checkpoint exceeds the expected MTBF. This paper explores transparent recovery via implicitly coordinated, diskless, application-driven checkpointing as a way to tolerateprocess failures in MPI applications at exascale. The discussedapproach leverages User Level Failure Mitigation (ULFM), which isbeing proposed as an MPI extension to allow applications to createpolicies for tolerating process failures. Specifically, this paper demonstrates how different implementations ofapplication-driven in-memory checkpoint storage and recovery comparein terms of performance and scalability. We also experimentally evaluate the effectiveness and scalability ofthe Fenix online global recovery framework on a production system-the Titan Cray XK7 at ORNL-and demonstrate the ability of Fenix totolerate dynamically injected failures using the execution of fourbenchmarks and mini-applications with different behaviors.
KW - checksum-based checkpointing
KW - fault tolerance
KW - in-memory checkpointing
KW - neighbor-based checkpointing
KW - online recovery
KW - resilience
UR - http://www.scopus.com/inward/record.url?scp=84990923726&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84990923726&partnerID=8YFLogxK
U2 - 10.1109/ICPPW.2016.56
DO - 10.1109/ICPPW.2016.56
M3 - Conference contribution
AN - SCOPUS:84990923726
T3 - Proceedings of the International Conference on Parallel Processing Workshops
SP - 346
EP - 355
BT - Proceedings - 45th International Conference on Parallel Processing Workshops, ICPPW 2016
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 45th International Conference on Parallel Processing Workshops, ICPPW 2016
Y2 - 16 August 2016 through 19 August 2016
ER -