Towards Efficient Cache Allocation for High-Frequency Checkpointing

Avinash Maurya, Bogdan Nicolae, M. Mustafa Rafique, Amr M. Elsayed, Thierry Tonellot, Franck Cappello

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

While many HPC applications are known to have long runtimes, this is not always because of single large runs: in many cases, this is due to ensembles composed of many short runs (runtime in the order of minutes). When each such run needs to checkpoint frequently (e.g. adjoint computations using a checkpoint interval in the order of milliseconds), it is important to minimize both checkpointing overheads at each iteration, as well as initialization overheads. With the rising popularity of GPUs, minimizing both overheads simultaneously is challenging: while it is possible to take advantage of efficient asynchronous data transfers between GPU and host memory, this comes at the cost of high initialization overhead needed to allocate and pin host memory. In this paper, we contribute with an efficient technique to address this challenge. The key idea is to use an adaptive approach that delays the pinning of the host memory buffer holding the checkpoints until all memory pages are touched, which greatly reduces the overhead of registering the host memory with the CUDA driver. To this end, we use a combination of asynchronous touching of memory pages and direct writes of checkpoints to untouched and touched memory pages in order to minimize end-to-end checkpointing overheads based on performance modeling. Our evaluations show a significant improvement over a variety of alternative static allocation strategies and state-of-art approaches.

Original languageEnglish (US)
Title of host publicationProceedings - 2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics, HiPC 2022
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages262-271
Number of pages10
ISBN (Electronic)9781665494236
DOIs
StatePublished - 2022
Externally publishedYes
Event29th Annual IEEE International Conference on High Performance Computing, Data, and Analytics, HiPC 2022 - Bangalore, India
Duration: Dec 18 2022Dec 21 2022

Publication series

NameProceedings - 2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics, HiPC 2022

Conference

Conference29th Annual IEEE International Conference on High Performance Computing, Data, and Analytics, HiPC 2022
Country/TerritoryIndia
CityBangalore
Period12/18/2212/21/22

Keywords

  • GPU checkpointing
  • fast initialization
  • multi-level caching

ASJC Scopus subject areas

  • Artificial Intelligence
  • Computer Science Applications
  • Hardware and Architecture
  • Information Systems
  • Information Systems and Management
  • Control and Optimization

Fingerprint

Dive into the research topics of 'Towards Efficient Cache Allocation for High-Frequency Checkpointing'. Together they form a unique fingerprint.

Cite this