ModelKeeper: Accelerating DNN Training via Automated Training Warmup

Fan Lai, Yinwei Dai, Harsha V. Madhyastha, Mosharaf Chowdhury

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

With growing deployment of machine learning (ML) models, ML developers are training or re-training increasingly more deep neural networks (DNNs). They do so to find the most suitable model that meets their accuracy requirement while satisfying the resource and timeliness constraints of the target environment. In large shared clusters, the growing number of neural architecture search (NAS) and training jobs often result in models sharing architectural similarities with others from the same or a different ML developer. However, existing solutions do not provide a systematic mechanism to identify and leverage such similarities. We present ModelKeeper, the first automated training warmup system that accelerates DNN training by repurposing previously-trained models in a shared cluster. Our key insight is that initializing a training job's model by transforming an already-trained model's weights can jump-start it and reduce the total amount of training needed. However, models submitted over time can differ in their architectures and accuracy. Given a new model to train, ModelKeeper scalably identifies its architectural similarity with previously trained models, selects a parent model with high similarity and good model accuracy, and performs structure-aware transformation of weights to preserve maximal information from the parent model during the warmup of new model weights. Our evaluations across thousands of CV and NLP models show that ModelKeeper achieves 1.3×-4.3× faster training completion with little overhead and no reduction in model accuracy.

Original languageEnglish (US)
Title of host publicationProceedings of the 20th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2023
PublisherUSENIX Association
Pages769-785
Number of pages17
ISBN (Electronic)9781939133335
StatePublished - 2023
Externally publishedYes
Event20th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2023 - Boston, United States
Duration: Apr 17 2023Apr 19 2023

Publication series

NameProceedings of the 20th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2023

Conference

Conference20th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2023
Country/TerritoryUnited States
CityBoston
Period4/17/234/19/23

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Control and Systems Engineering

Fingerprint

Dive into the research topics of 'ModelKeeper: Accelerating DNN Training via Automated Training Warmup'. Together they form a unique fingerprint.

Cite this