TY - GEN
T1 - Optimizing asynchronous multi-level checkpoint/restart configurations with machine learning
AU - Dey, Tonmoy
AU - Sato, Kento
AU - Nicolae, Bogdan
AU - Guo, Jian
AU - Domke, Jens
AU - Yu, Weikuan
AU - Cappello, Franck
AU - Mohror, Kathryn
N1 - Funding Information:
This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344 (LLNL-CONF-802941). This material was based upon work supported by the U.S. Department of Energy, Office of Science, under contract DE-AC02-06CH11357. This research used resources of the Argonne Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC02-06CH11357. This work is also supported in part by the National Science Foundation awards 1561041, 1564647, 1763547, and 1822737. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.
Publisher Copyright:
© 2020 IEEE.
PY - 2020/5
Y1 - 2020/5
N2 - With the emergence of versatile storage systems, multi-level checkpointing (MLC) has become a common approach to gain efficiency. However, multi-level checkpoint/restart can cause enormous I/O traffic on HPC systems. To use multilevel checkpointing efficiently, it is important to optimize check-point/restart configurations. Current approaches, namely modeling and simulation, are either inaccurate or slow in determining the optimal configuration for a large scale system. In this paper, we show that machine learning models can be used in combination with accurate simulation to determine the optimal checkpoint configurations. We also demonstrate that more advanced techniques such as neural networks can further improve the performance in optimizing checkpoint configurations.
AB - With the emergence of versatile storage systems, multi-level checkpointing (MLC) has become a common approach to gain efficiency. However, multi-level checkpoint/restart can cause enormous I/O traffic on HPC systems. To use multilevel checkpointing efficiently, it is important to optimize check-point/restart configurations. Current approaches, namely modeling and simulation, are either inaccurate or slow in determining the optimal configuration for a large scale system. In this paper, we show that machine learning models can be used in combination with accurate simulation to determine the optimal checkpoint configurations. We also demonstrate that more advanced techniques such as neural networks can further improve the performance in optimizing checkpoint configurations.
KW - Machine Learning
KW - MultiLevel Checkpointing (MLC)
KW - Neural Network
UR - http://www.scopus.com/inward/record.url?scp=85091578218&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85091578218&partnerID=8YFLogxK
U2 - 10.1109/IPDPSW50202.2020.00174
DO - 10.1109/IPDPSW50202.2020.00174
M3 - Conference contribution
AN - SCOPUS:85091578218
T3 - Proceedings - 2020 IEEE 34th International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2020
SP - 1036
EP - 1043
BT - Proceedings - 2020 IEEE 34th International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2020
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 34th IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2020
Y2 - 18 May 2020 through 22 May 2020
ER -