Veloc: Towards high performance adaptive asynchronous checkpointing at large scale

Bogdan Nicolae, Adam Moody, Elsa Gonsiorowski, Kathryn Mohror, Franck Cappello

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Global checkpointing to external storage (e.g., a parallel file system) is a common I/O pattern of many HPC applications. However, given the limited I/O throughput of external storage, global checkpointing can often lead to I/O bottlenecks. To address this issue, a shift from synchronous checkpointing (i.e., blocking until writes have finished) to asynchronous checkpointing (i.e., writing to faster local storage and flushing to external storage in the background) is increasingly being adopted. However, with rising core count per node and heterogeneity of both local and external storage, it is non-trivial to design efficient asynchronous checkpointing mechanisms due to the complex interplay between high concurrency and I/O performance variability at both the node-local and global levels. This problem is not well understood but highly important for modern supercomputing infrastructures. This paper proposes a versatile asynchronous checkpointing solution that addresses this problem. To this end, we introduce a concurrency-optimized technique that combines performance modeling with lightweight monitoring to make informed decisions about what local storage devices to use in order to dynamically adapt to background flushes and reduce the checkpointing overhead. We illustrate this technique using the VeloC prototype. Extensive experiments on a pre-Exascale supercomputing system show significant benefits.

Original languageEnglish (US)
Title of host publicationProceedings - 2019 IEEE 33rd International Parallel and Distributed Processing Symposium, IPDPS 2019
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages911-920
Number of pages10
ISBN (Electronic)9781728112466
DOIs
StatePublished - May 2019
Externally publishedYes
Event33rd IEEE International Parallel and Distributed Processing Symposium, IPDPS 2019 - Rio de Janeiro, Brazil
Duration: May 20 2019May 24 2019

Publication series

NameProceedings - 2019 IEEE 33rd International Parallel and Distributed Processing Symposium, IPDPS 2019

Conference

Conference33rd IEEE International Parallel and Distributed Processing Symposium, IPDPS 2019
Country/TerritoryBrazil
CityRio de Janeiro
Period5/20/195/24/19

Keywords

  • Adaptive multilevel asynchronous I/O
  • Checkpoint-restart
  • Immutable data
  • Parallel I/O

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Hardware and Architecture
  • Information Systems and Management

Fingerprint

Dive into the research topics of 'Veloc: Towards high performance adaptive asynchronous checkpointing at large scale'. Together they form a unique fingerprint.

Cite this