Towards Efficient I/O Scheduling for Collaborative Multi-Level Checkpointing

Avinash Maurya, Bogdan Nicolae, M. Mustafa Rafique, Thierry Tonellot, Franck Cappello

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Efficient checkpointing of distributed data structures periodically at key moments during runtime is a recurring fundamental pattern in a large number of uses cases: fault tolerance based on checkpoint-restart, in-situ or post-analytics, reproducibility, adjoint computations, etc. In this context, multilevel checkpointing is a popular technique: distributed processes can write their shard of the data independently to fast local storage tiers, then flush asynchronously to a shared, slower tier of higher capacity. However, given the limited capacity of fast tiers (e.g. GPU memory) and the increasing checkpoint frequency, the processes often run out of space and need to fall back to blocking writes to the slow tiers. To mitigate this problem, compression is often applied in order to reduce the checkpoint sizes. Unfortunately, this reduction is not uniform: some processes will have spare capacity left on the fast tiers, while others still run out of space. In this paper, we study the problem of how to leverage this imbalance in order to reduce I/O overheads for multi-level checkpointing. To this end, we solve an optimization problem of how much data to send from each process that runs out of space to the processes that have spare capacity in order to minimize the amount of time spent blocking in I/O. We propose two algorithms: one based on a greedy approach and the other based on modified minimum cost flows. We evaluate our proposal using synthetic and real-life application traces. Our evaluation shows that both algorithms achieve significant improvements in checkpoint performance over traditional multilevel checkpointing.

Original languageEnglish (US)
Title of host publicationProceedings - 29th International Symposium on the Modeling, Analysis, and Simulation of Computer and Telecommunication Systems, MASCOTS 2021
PublisherIEEE Computer Society
ISBN (Electronic)9781665458382
DOIs
StatePublished - 2021
Externally publishedYes
Event29th International Symposium on the Modeling, Analysis, and Simulation of Computer and Telecommunication Systems, MASCOTS 2021 - Houston, United States
Duration: Nov 3 2021Nov 5 2021

Publication series

NameProceedings - IEEE Computer Society's Annual International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunications Systems, MASCOTS
ISSN (Print)1526-7539

Conference

Conference29th International Symposium on the Modeling, Analysis, and Simulation of Computer and Telecommunication Systems, MASCOTS 2021
Country/TerritoryUnited States
CityHouston
Period11/3/2111/5/21

Keywords

  • GPU checkpointing
  • asynchronous I/O
  • multi-level checkpointing
  • peer-to-peer collaborative caching

ASJC Scopus subject areas

  • Electrical and Electronic Engineering
  • Computer Networks and Communications
  • Software
  • Modeling and Simulation

Fingerprint

Dive into the research topics of 'Towards Efficient I/O Scheduling for Collaborative Multi-Level Checkpointing'. Together they form a unique fingerprint.

Cite this