TY - GEN
T1 - DeepFreeze
T2 - 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing, CCGRID 2020
AU - Nicolae, Bogdan
AU - Li, Jiali
AU - Wozniak, Justin M.
AU - Bosilca, George
AU - Dorier, Matthieu
AU - Cappello, Franck
N1 - Funding Information:
This research was funded by Argonne National Laboratory, under Contract LDRD-1007397. It used resources of the Argonne Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC02-06CH11357.
Publisher Copyright:
© 2020 IEEE.
PY - 2020/5
Y1 - 2020/5
N2 - In the age of big data, deep learning has emerged as a powerful tool to extract insight and exploit its value, both in industry and scientific applications. One common pattern emerging in such applications is frequent checkpointing of the state of the learning model during training, needed in a variety of scenarios: analysis of intermediate states to explain features and correlations with training data, exploration strategies involving alternative models that share a common ancestor, knowledge transfer, resilience, etc. However, with increasing size of the learning models and popularity of distributed data-parallel training approaches, simple checkpointing techniques used so far face several limitations: low serialization performance, blocking I/O, stragglers due to the fact that only a single process is involved in checkpointing. This paper proposes a checkpointing technique specifically designed to address the aforementioned limitations, introducing efficient asynchronous techniques to hide the overhead of serialization and I/O, and distribute the load over all participating processes. Experiments with two deep learning applications (CANDLE and ResNet) on a pre-Exascale HPC platform (Theta) shows significant improvement over state-of-art, both in terms of checkpointing duration and runtime overhead.
AB - In the age of big data, deep learning has emerged as a powerful tool to extract insight and exploit its value, both in industry and scientific applications. One common pattern emerging in such applications is frequent checkpointing of the state of the learning model during training, needed in a variety of scenarios: analysis of intermediate states to explain features and correlations with training data, exploration strategies involving alternative models that share a common ancestor, knowledge transfer, resilience, etc. However, with increasing size of the learning models and popularity of distributed data-parallel training approaches, simple checkpointing techniques used so far face several limitations: low serialization performance, blocking I/O, stragglers due to the fact that only a single process is involved in checkpointing. This paper proposes a checkpointing technique specifically designed to address the aforementioned limitations, introducing efficient asynchronous techniques to hide the overhead of serialization and I/O, and distribute the load over all participating processes. Experiments with two deep learning applications (CANDLE and ResNet) on a pre-Exascale HPC platform (Theta) shows significant improvement over state-of-art, both in terms of checkpointing duration and runtime overhead.
KW - checkpointing
KW - deep learning
KW - fine-grain asynchronous I/O
KW - multi-level data persistence
UR - http://www.scopus.com/inward/record.url?scp=85089102687&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85089102687&partnerID=8YFLogxK
U2 - 10.1109/CCGrid49817.2020.00-76
DO - 10.1109/CCGrid49817.2020.00-76
M3 - Conference contribution
AN - SCOPUS:85089102687
T3 - Proceedings - 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing, CCGRID 2020
SP - 172
EP - 181
BT - Proceedings - 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing, CCGRID 2020
A2 - Lefevre, Laurent
A2 - Varela, Carlos A.
A2 - Pallis, George
A2 - Toosi, Adel N.
A2 - Rana, Omer
A2 - Buyya, Rajkumar
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 11 May 2020 through 14 May 2020
ER -