Baechi: Fast device placement of machine learning graphs

Beomyeol Jeon, Linda Cai, Pallavi Srivastava, Jintao Jiang, Xiaolan Ke, Yitao Meng, Cong Xie, Indranil Gupta

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Machine Learning graphs (or models) can be challenging or impossible to train when either devices have limited memory, or the models are large. Splitting the model graph across multiple devices, today, largely relies on learning-based approaches to generate this placement. While it results in models that train fast on data (i.e., with low step times), learning-based model-parallelism is time-consuming, taking many hours or days to create a placement plan of operators on devices. We present the Baechi system, where we adopt an algorithmic approach to the placement problem for running machine learning training graphs on a small cluster of memory-constrained devices. We implemented Baechi so that it works modularly with TensorFlow. Our experimental results using GPUs show that Baechi generates placement plans in time 654X - 206K X faster than today's learning-based approaches, and the placed model's step time is only up to 6.2% higher than expert-based placements.

Original languageEnglish (US)
Title of host publicationSoCC 2020 - Proceedings of the 2020 ACM Symposium on Cloud Computing
PublisherAssociation for Computing Machinery, Inc
Pages416-430
Number of pages15
ISBN (Electronic)9781450381376
DOIs
StatePublished - Oct 12 2020
Event11th ACM Symposium on Cloud Computing, SoCC 2020 - Virtual, Online, United States
Duration: Oct 19 2020Oct 21 2020

Publication series

NameSoCC 2020 - Proceedings of the 2020 ACM Symposium on Cloud Computing

Conference

Conference11th ACM Symposium on Cloud Computing, SoCC 2020
CountryUnited States
CityVirtual, Online
Period10/19/2010/21/20

Keywords

  • TensorFlow
  • constrained memory
  • distributed systems
  • machine learning systems
  • placement algorithms

ASJC Scopus subject areas

  • Information Systems
  • Software
  • Artificial Intelligence

Fingerprint Dive into the research topics of 'Baechi: Fast device placement of machine learning graphs'. Together they form a unique fingerprint.

Cite this