Doing more with less: Training large DNN models on commodity servers for the masses

Youjie Li, Amar Phanishayee, Derek Murray, Nam Sung Kim

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Deep neural networks (DNNs) have grown exponentially in complexity and size over the past decade, leaving only the elite who have access to massive datacenter-based resources with the ability to develop and train such models. One of the main challenges for the long tail of researchers who might have access to only limited resources (e.g., a single multi-GPU server) is limited GPU memory capacity compared to model size. The problem is so acute that the memory requirement of training large DNN models can often exceed the aggregate capacity of all available GPUs on commodity servers; this problem only gets worse with the trend of ever-growing model sizes. Current solutions that rely on virtualizing GPU memory (by swapping to/from CPU memory) incur excessive swapping overhead. In this paper, we advocate rethinking how DNN frameworks schedule computation and move data to push the boundaries of training large models efficiently on modest multi-GPU deployments.

Original languageEnglish (US)
Title of host publicationHotOS 2021 - Proceedings of the 2021 Workshop on Hot Topics in Operating Systems
PublisherAssociation for Computing Machinery, Inc
Pages119-127
Number of pages9
ISBN (Electronic)9781450384384
DOIs
StatePublished - Jun 1 2021
Event18th Workshop on Hot Topics in Operating Systems, HotOS 2021 - Virtual, Online, United States
Duration: Jun 1 2021Jun 3 2021

Publication series

NameHotOS 2021 - Proceedings of the 2021 Workshop on Hot Topics in Operating Systems

Conference

Conference18th Workshop on Hot Topics in Operating Systems, HotOS 2021
Country/TerritoryUnited States
CityVirtual, Online
Period6/1/216/3/21

ASJC Scopus subject areas

  • Information Systems
  • Computer Networks and Communications
  • Hardware and Architecture

Fingerprint

Dive into the research topics of 'Doing more with less: Training large DNN models on commodity servers for the masses'. Together they form a unique fingerprint.

Cite this