TY - GEN
T1 - DeepClone
T2 - 22nd IEEE International Conference on Cluster Computing, CLUSTER 2020
AU - Nicolae, Bogdan
AU - Wozniak, Justin M.
AU - Dorier, Matthieu
AU - Cappello, Franck
N1 - Funding Information:
This material is based upon work supported by the U.S. Department of Energy (DOE), Office of Science, Office of Advanced Scientific Computing Research, under Contract DE-AC02-06CH11357 and Argonne National Laboratory, under Contract LDRD-1007397.
Publisher Copyright:
© 2020 IEEE.
PY - 2020/9
Y1 - 2020/9
N2 - Training modern deep neural network (DNN) models involves complex workflows triggered by model exploration, sensitivity analysis, explainability, etc. A key primitive in this context is the ability to clone a model training instance, i.e. 'fork' the training process in a potentially different direction, which enables comparisons of different evolution paths using variations of training data and model parameters. However, in a quest improve the training throughput, a mix of data parallel, model parallel, pipeline parallel and layer-wise parallel approaches are making the problem of cloning highly complex. In this paper, we explore the problem of efficient cloning under such circumstances. To this end, we leverage several properties of data-parallel training and layer-wise parallelism to design DeepClone, a cloning approach based on augmenting the execution graph to gain direct access to tensors, which are then sharded and reconstructed asynchronously in order to minimize runtime overhead, standby duration, readiness duration. Compared with state-of-art approaches, DeepClone shows orders of magnitude improvement for several classes of DNN models.
AB - Training modern deep neural network (DNN) models involves complex workflows triggered by model exploration, sensitivity analysis, explainability, etc. A key primitive in this context is the ability to clone a model training instance, i.e. 'fork' the training process in a potentially different direction, which enables comparisons of different evolution paths using variations of training data and model parameters. However, in a quest improve the training throughput, a mix of data parallel, model parallel, pipeline parallel and layer-wise parallel approaches are making the problem of cloning highly complex. In this paper, we explore the problem of efficient cloning under such circumstances. To this end, we leverage several properties of data-parallel training and layer-wise parallelism to design DeepClone, a cloning approach based on augmenting the execution graph to gain direct access to tensors, which are then sharded and reconstructed asynchronously in order to minimize runtime overhead, standby duration, readiness duration. Compared with state-of-art approaches, DeepClone shows orders of magnitude improvement for several classes of DNN models.
KW - data-parallel training
KW - deep learning
KW - layer-wise parallelism
KW - model cloning
KW - state replication
UR - http://www.scopus.com/inward/record.url?scp=85096205846&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85096205846&partnerID=8YFLogxK
U2 - 10.1109/CLUSTER49012.2020.00033
DO - 10.1109/CLUSTER49012.2020.00033
M3 - Conference contribution
AN - SCOPUS:85096205846
T3 - Proceedings - IEEE International Conference on Cluster Computing, ICCC
SP - 226
EP - 236
BT - Proceedings - 2020 IEEE International Conference on Cluster Computing, CLUSTER 2020
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 14 September 2020 through 17 September 2020
ER -