GPU-Enabled Asynchronous Multi-level Checkpoint Caching and Prefetching

Avinash Maurya, M. Mustafa Rafique, Thierry Tonellot, Hussain J. Alsalem, Franck Cappello, Bogdan Nicolae

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Checkpointing is an I/O intensive operation increasingly used by High-Performance Computing (HPC) applications to revisit previous intermediate datasets at scale. Unlike the case of resilience, where only the last checkpoint is needed for application restart and rarely accessed to recover from failures, in this scenario, it is important to optimize frequent reads and writes of an entire history of checkpoints. State-of-the-art checkpointing approaches often rely on asynchronous multi-level techniques to hide I/O overheads by writing to fast local tiers (e.g. an SSD) and asynchronously flushing to slower, potentially remote tiers (e.g. a parallel file system) in the background, while the application keeps running. However, such approaches have two limitations. First, despite the fact that HPC infrastructures routinely rely on accelerators (e.g. GPUs), and therefore a majority of the checkpoints involve GPU memory, efficient asynchronous data movement between the GPU memory and host memory is lagging behind. Second, revisiting previous data often involves predictable access patterns, which are not exploited to accelerate read operations. In this paper, we address these limitations by proposing a scalable and asynchronous multi-level checkpointing approach optimized for both reading and writing of an arbitrarily long history of checkpoints. Our approach exploits GPU memory as a first-class citizen in the multi-level storage hierarchy to enable informed caching and prefetching of checkpoints by leveraging foreknowledge about the access order passed by the application as hints. Our evaluation using a variety of scenarios under I/O concurrency shows up to 74× faster checkpoint and restore throughput as compared to the state-of-art runtime and optimized unified virtual memory (UVM) based prefetching strategies and at least 2× shorter I/O wait time for the application across various workloads and configurations.

Original languageEnglish (US)
Title of host publicationHPDC 2023 - Proceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing
PublisherAssociation for Computing Machinery
Pages73-85
Number of pages13
ISBN (Electronic)9798400701559
DOIs
StatePublished - Aug 7 2023
Externally publishedYes
Event32nd International Symposium on High-Performance Parallel and Distributed Computing, HPDC 2023 - Orlando, United States
Duration: Jun 16 2023Jun 23 2023

Publication series

NameHPDC 2023 - Proceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing

Conference

Conference32nd International Symposium on High-Performance Parallel and Distributed Computing, HPDC 2023
Country/TerritoryUnited States
CityOrlando
Period6/16/236/23/23

Keywords

  • asynchronous multi-level checkpointing
  • graphics processing unit (GPU)
  • hierarchical cache management
  • high-performance computing (HPC)
  • prefetching

ASJC Scopus subject areas

  • Information Systems
  • Software
  • Safety, Risk, Reliability and Quality
  • Artificial Intelligence
  • Computer Networks and Communications
  • Computer Science Applications
  • Hardware and Architecture

Fingerprint

Dive into the research topics of 'GPU-Enabled Asynchronous Multi-level Checkpoint Caching and Prefetching'. Together they form a unique fingerprint.

Cite this