DeepClone: Lightweight State Replication of Deep Learning Models for Data Parallel Training

Bogdan Nicolae, Justin M. Wozniak, Matthieu Dorier, Franck Cappello

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Training modern deep neural network (DNN) models involves complex workflows triggered by model exploration, sensitivity analysis, explainability, etc. A key primitive in this context is the ability to clone a model training instance, i.e. 'fork' the training process in a potentially different direction, which enables comparisons of different evolution paths using variations of training data and model parameters. However, in a quest improve the training throughput, a mix of data parallel, model parallel, pipeline parallel and layer-wise parallel approaches are making the problem of cloning highly complex. In this paper, we explore the problem of efficient cloning under such circumstances. To this end, we leverage several properties of data-parallel training and layer-wise parallelism to design DeepClone, a cloning approach based on augmenting the execution graph to gain direct access to tensors, which are then sharded and reconstructed asynchronously in order to minimize runtime overhead, standby duration, readiness duration. Compared with state-of-art approaches, DeepClone shows orders of magnitude improvement for several classes of DNN models.

Original languageEnglish (US)
Title of host publicationProceedings - 2020 IEEE International Conference on Cluster Computing, CLUSTER 2020
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages226-236
Number of pages11
ISBN (Electronic)9781728166773
DOIs
StatePublished - Sep 2020
Externally publishedYes
Event22nd IEEE International Conference on Cluster Computing, CLUSTER 2020 - Kobe, Japan
Duration: Sep 14 2020Sep 17 2020

Publication series

NameProceedings - IEEE International Conference on Cluster Computing, ICCC
Volume2020-September
ISSN (Print)1552-5244

Conference

Conference22nd IEEE International Conference on Cluster Computing, CLUSTER 2020
Country/TerritoryJapan
CityKobe
Period9/14/209/17/20

Keywords

  • data-parallel training
  • deep learning
  • layer-wise parallelism
  • model cloning
  • state replication

ASJC Scopus subject areas

  • Software
  • Hardware and Architecture
  • Signal Processing

Fingerprint

Dive into the research topics of 'DeepClone: Lightweight State Replication of Deep Learning Models for Data Parallel Training'. Together they form a unique fingerprint.

Cite this