TY - GEN
T1 - Towards Efficient I/O Scheduling for Collaborative Multi-Level Checkpointing
AU - Maurya, Avinash
AU - Nicolae, Bogdan
AU - Rafique, M. Mustafa
AU - Tonellot, Thierry
AU - Cappello, Franck
N1 - Publisher Copyright:
© 2021 IEEE.
PY - 2021
Y1 - 2021
N2 - Efficient checkpointing of distributed data structures periodically at key moments during runtime is a recurring fundamental pattern in a large number of uses cases: fault tolerance based on checkpoint-restart, in-situ or post-analytics, reproducibility, adjoint computations, etc. In this context, multilevel checkpointing is a popular technique: distributed processes can write their shard of the data independently to fast local storage tiers, then flush asynchronously to a shared, slower tier of higher capacity. However, given the limited capacity of fast tiers (e.g. GPU memory) and the increasing checkpoint frequency, the processes often run out of space and need to fall back to blocking writes to the slow tiers. To mitigate this problem, compression is often applied in order to reduce the checkpoint sizes. Unfortunately, this reduction is not uniform: some processes will have spare capacity left on the fast tiers, while others still run out of space. In this paper, we study the problem of how to leverage this imbalance in order to reduce I/O overheads for multi-level checkpointing. To this end, we solve an optimization problem of how much data to send from each process that runs out of space to the processes that have spare capacity in order to minimize the amount of time spent blocking in I/O. We propose two algorithms: one based on a greedy approach and the other based on modified minimum cost flows. We evaluate our proposal using synthetic and real-life application traces. Our evaluation shows that both algorithms achieve significant improvements in checkpoint performance over traditional multilevel checkpointing.
AB - Efficient checkpointing of distributed data structures periodically at key moments during runtime is a recurring fundamental pattern in a large number of uses cases: fault tolerance based on checkpoint-restart, in-situ or post-analytics, reproducibility, adjoint computations, etc. In this context, multilevel checkpointing is a popular technique: distributed processes can write their shard of the data independently to fast local storage tiers, then flush asynchronously to a shared, slower tier of higher capacity. However, given the limited capacity of fast tiers (e.g. GPU memory) and the increasing checkpoint frequency, the processes often run out of space and need to fall back to blocking writes to the slow tiers. To mitigate this problem, compression is often applied in order to reduce the checkpoint sizes. Unfortunately, this reduction is not uniform: some processes will have spare capacity left on the fast tiers, while others still run out of space. In this paper, we study the problem of how to leverage this imbalance in order to reduce I/O overheads for multi-level checkpointing. To this end, we solve an optimization problem of how much data to send from each process that runs out of space to the processes that have spare capacity in order to minimize the amount of time spent blocking in I/O. We propose two algorithms: one based on a greedy approach and the other based on modified minimum cost flows. We evaluate our proposal using synthetic and real-life application traces. Our evaluation shows that both algorithms achieve significant improvements in checkpoint performance over traditional multilevel checkpointing.
KW - GPU checkpointing
KW - asynchronous I/O
KW - multi-level checkpointing
KW - peer-to-peer collaborative caching
UR - http://www.scopus.com/inward/record.url?scp=85123191382&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85123191382&partnerID=8YFLogxK
U2 - 10.1109/MASCOTS53633.2021.9614284
DO - 10.1109/MASCOTS53633.2021.9614284
M3 - Conference contribution
AN - SCOPUS:85123191382
T3 - Proceedings - IEEE Computer Society's Annual International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunications Systems, MASCOTS
BT - Proceedings - 29th International Symposium on the Modeling, Analysis, and Simulation of Computer and Telecommunication Systems, MASCOTS 2021
PB - IEEE Computer Society
T2 - 29th International Symposium on the Modeling, Analysis, and Simulation of Computer and Telecommunication Systems, MASCOTS 2021
Y2 - 3 November 2021 through 5 November 2021
ER -