TY - GEN
T1 - Baechi
T2 - 11th ACM Symposium on Cloud Computing, SoCC 2020
AU - Jeon, Beomyeol
AU - Cai, Linda
AU - Srivastava, Pallavi
AU - Jiang, Jintao
AU - Ke, Xiaolan
AU - Meng, Yitao
AU - Xie, Cong
AU - Gupta, Indranil
N1 - Funding Information:
This work was supported in part by the following grants: NSF IIS 1909577, and NSF CNS 1908888; as well as by generous gifts from Capital One, Schlumberger, and Microsoft.
Publisher Copyright:
© 2020 ACM.
PY - 2020/10/12
Y1 - 2020/10/12
N2 - Machine Learning graphs (or models) can be challenging or impossible to train when either devices have limited memory, or the models are large. Splitting the model graph across multiple devices, today, largely relies on learning-based approaches to generate this placement. While it results in models that train fast on data (i.e., with low step times), learning-based model-parallelism is time-consuming, taking many hours or days to create a placement plan of operators on devices. We present the Baechi system, where we adopt an algorithmic approach to the placement problem for running machine learning training graphs on a small cluster of memory-constrained devices. We implemented Baechi so that it works modularly with TensorFlow. Our experimental results using GPUs show that Baechi generates placement plans in time 654X - 206K X faster than today's learning-based approaches, and the placed model's step time is only up to 6.2% higher than expert-based placements.
AB - Machine Learning graphs (or models) can be challenging or impossible to train when either devices have limited memory, or the models are large. Splitting the model graph across multiple devices, today, largely relies on learning-based approaches to generate this placement. While it results in models that train fast on data (i.e., with low step times), learning-based model-parallelism is time-consuming, taking many hours or days to create a placement plan of operators on devices. We present the Baechi system, where we adopt an algorithmic approach to the placement problem for running machine learning training graphs on a small cluster of memory-constrained devices. We implemented Baechi so that it works modularly with TensorFlow. Our experimental results using GPUs show that Baechi generates placement plans in time 654X - 206K X faster than today's learning-based approaches, and the placed model's step time is only up to 6.2% higher than expert-based placements.
KW - TensorFlow
KW - constrained memory
KW - distributed systems
KW - machine learning systems
KW - placement algorithms
UR - http://www.scopus.com/inward/record.url?scp=85095436267&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85095436267&partnerID=8YFLogxK
U2 - 10.1145/3419111.3421302
DO - 10.1145/3419111.3421302
M3 - Conference contribution
AN - SCOPUS:85095436267
T3 - SoCC 2020 - Proceedings of the 2020 ACM Symposium on Cloud Computing
SP - 416
EP - 430
BT - SoCC 2020 - Proceedings of the 2020 ACM Symposium on Cloud Computing
PB - Association for Computing Machinery, Inc
Y2 - 19 October 2020 through 21 October 2020
ER -