TY - GEN
T1 - Towards Efficient Cache Allocation for High-Frequency Checkpointing
AU - Maurya, Avinash
AU - Nicolae, Bogdan
AU - Rafique, M. Mustafa
AU - Elsayed, Amr M.
AU - Tonellot, Thierry
AU - Cappello, Franck
N1 - This work is supported in part by the ARAMCO Services Company and the U.S. Department of Energy (DOE), Office of Science, Office of Advanced Scientific Computing
This work is supported in part by the ARAMCO Services Company and the U.S. Department of Energy (DOE), Office of Science, Office of Advanced Scientific Computing Research and Argonne National Laboratory, under contract numbers PRJ1008127, Argonne: 0F-60169/DOE: DE-AC02-06CH11357. Results presented in this paper were obtained using ALCF’s Theta GPU [37]
PY - 2022
Y1 - 2022
N2 - While many HPC applications are known to have long runtimes, this is not always because of single large runs: in many cases, this is due to ensembles composed of many short runs (runtime in the order of minutes). When each such run needs to checkpoint frequently (e.g. adjoint computations using a checkpoint interval in the order of milliseconds), it is important to minimize both checkpointing overheads at each iteration, as well as initialization overheads. With the rising popularity of GPUs, minimizing both overheads simultaneously is challenging: while it is possible to take advantage of efficient asynchronous data transfers between GPU and host memory, this comes at the cost of high initialization overhead needed to allocate and pin host memory. In this paper, we contribute with an efficient technique to address this challenge. The key idea is to use an adaptive approach that delays the pinning of the host memory buffer holding the checkpoints until all memory pages are touched, which greatly reduces the overhead of registering the host memory with the CUDA driver. To this end, we use a combination of asynchronous touching of memory pages and direct writes of checkpoints to untouched and touched memory pages in order to minimize end-to-end checkpointing overheads based on performance modeling. Our evaluations show a significant improvement over a variety of alternative static allocation strategies and state-of-art approaches.
AB - While many HPC applications are known to have long runtimes, this is not always because of single large runs: in many cases, this is due to ensembles composed of many short runs (runtime in the order of minutes). When each such run needs to checkpoint frequently (e.g. adjoint computations using a checkpoint interval in the order of milliseconds), it is important to minimize both checkpointing overheads at each iteration, as well as initialization overheads. With the rising popularity of GPUs, minimizing both overheads simultaneously is challenging: while it is possible to take advantage of efficient asynchronous data transfers between GPU and host memory, this comes at the cost of high initialization overhead needed to allocate and pin host memory. In this paper, we contribute with an efficient technique to address this challenge. The key idea is to use an adaptive approach that delays the pinning of the host memory buffer holding the checkpoints until all memory pages are touched, which greatly reduces the overhead of registering the host memory with the CUDA driver. To this end, we use a combination of asynchronous touching of memory pages and direct writes of checkpoints to untouched and touched memory pages in order to minimize end-to-end checkpointing overheads based on performance modeling. Our evaluations show a significant improvement over a variety of alternative static allocation strategies and state-of-art approaches.
KW - GPU checkpointing
KW - fast initialization
KW - multi-level caching
UR - http://www.scopus.com/inward/record.url?scp=85158109101&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85158109101&partnerID=8YFLogxK
U2 - 10.1109/HiPC56025.2022.00043
DO - 10.1109/HiPC56025.2022.00043
M3 - Conference contribution
AN - SCOPUS:85158109101
T3 - Proceedings - 2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics, HiPC 2022
SP - 262
EP - 271
BT - Proceedings - 2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics, HiPC 2022
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 29th Annual IEEE International Conference on High Performance Computing, Data, and Analytics, HiPC 2022
Y2 - 18 December 2022 through 21 December 2022
ER -