TY - GEN
T1 - Towards High Performance Resilience Using Performance Portable Abstractions
AU - Morales, Nicolas
AU - Teranishi, Keita
AU - Nicolae, Bogdan
AU - Trott, Christian
AU - Cappello, Franck
N1 - Funding Information:
Acknowledgments. This material is based upon work supported by the U.S. Department of Energy (DOE), Office of Science, Office of Advanced Scientific Computing Research, under Contract DE-AC02-06CH11357. Sandia National Laboratories is a multimission laboratory managed and operated by National Technology & Engineering Solutions of Sandia, LLC, a wholly owned subsidiary of Honeywell International Inc., for the U.S. Department of Energy’s National Nuclear Security Administration (NNSA) under contract DE-NA0003525. This work was funded by NNSA’s Advanced Simulation and Computing (ASC) Program. This paper describes objective technical results and analysis. Any subjective views or opinions that might be expressed in the paper do not necessarily represent the views of the U.S. Department of Energy or the United States Government.
Funding Information:
This material is based upon work supported by the U.S. Department of Energy (DOE), Office of Science, Office of Advanced Scientific Computing Research, under Contract DE-AC02-06CH11357. Sandia National Laboratories is a multimission laboratory managed and operated by National Technology & Engineering Solutions of Sandia, LLC, a wholly owned subsidiary of Honeywell International Inc., for the U.S. Department of Energy?s National Nuclear Security Administration (NNSA) under contract DE-NA0003525. This work was funded by NNSA?s Advanced Simulation and Computing (ASC) Program. This paper describes objective technical results and analysis. Any subjective views or opinions that might be expressed in the paper do not necessarily represent the views of the U.S. Department of Energy or the United States Government.
Publisher Copyright:
© 2021, This is a U.S. government work and not under copyright protection in the U.S.; foreign copyright protection may apply.
PY - 2021
Y1 - 2021
N2 - In the drive towards Exascale, the extreme heterogeneity of supercomputers at all levels places a major development burden on HPC applications. To this end, performance portable abstractions such as those advocated by Kokkos, RAJA and HPX are becoming increasingly popular. At the same time, the unprecedented scalability requirements of such heterogeneous components means higher failure rates, motivating the need for resilience in systems and applications. Unfortunately, state-of-art resilience techniques based on checkpoint/restart are lagging behind performance portability efforts: users still need to capture consistent states manually, which introduces the need for fine-tuning and customization. In this paper we aim to close this gap by introducing a set of abstractions that make it easier for the application developers to reason about resilience. To this end, we extend the existing abstractions proposed by performance portability efforts towards resilience. By marking critical data structures that need to be checkpointed, one can enable an optimized runtime to automate checkpoint-restart using high performance and scalable asynchronously techniques. We illustrate the feasibility of our proposal using a prototype that combines the Kokkos runtime (HPC performance portability), with the VELOC runtime (large-scale low overhead checkpoint-restart). Our experimental results show negligible performance overhead compared with a manually tuned implementation of checkpoint-restart while requiring minimal changes in the application code.
AB - In the drive towards Exascale, the extreme heterogeneity of supercomputers at all levels places a major development burden on HPC applications. To this end, performance portable abstractions such as those advocated by Kokkos, RAJA and HPX are becoming increasingly popular. At the same time, the unprecedented scalability requirements of such heterogeneous components means higher failure rates, motivating the need for resilience in systems and applications. Unfortunately, state-of-art resilience techniques based on checkpoint/restart are lagging behind performance portability efforts: users still need to capture consistent states manually, which introduces the need for fine-tuning and customization. In this paper we aim to close this gap by introducing a set of abstractions that make it easier for the application developers to reason about resilience. To this end, we extend the existing abstractions proposed by performance portability efforts towards resilience. By marking critical data structures that need to be checkpointed, one can enable an optimized runtime to automate checkpoint-restart using high performance and scalable asynchronously techniques. We illustrate the feasibility of our proposal using a prototype that combines the Kokkos runtime (HPC performance portability), with the VELOC runtime (large-scale low overhead checkpoint-restart). Our experimental results show negligible performance overhead compared with a manually tuned implementation of checkpoint-restart while requiring minimal changes in the application code.
KW - Checkpointing
KW - Fault tolerance
KW - Performance portability
KW - Programming models
KW - Resilience
UR - http://www.scopus.com/inward/record.url?scp=85115186434&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85115186434&partnerID=8YFLogxK
U2 - 10.1007/978-3-030-85665-6_28
DO - 10.1007/978-3-030-85665-6_28
M3 - Conference contribution
AN - SCOPUS:85115186434
SN - 9783030856649
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 451
EP - 465
BT - Euro-Par 2021
A2 - Sousa, Leonel
A2 - Roma, Nuno
A2 - Tomás, Pedro
PB - Springer
T2 - 27th International European Conference on Parallel and Distributed Computing, Euro-Par 2021
Y2 - 1 September 2021 through 3 September 2021
ER -