TY - JOUR
T1 - Analysis of DAWNBench, a Time-to-Accuracy Machine Learning Performance Benchmark
AU - Coleman, Cody
AU - Kang, Daniel
AU - Narayanan, Deepak
AU - Nardi, Luigi
AU - Zhao, Tian
AU - Zhang, Jian
AU - Bailis, Peter
AU - Olukotun, Kunle
AU - Ré, Chris
AU - Zaharia, Matei
N1 - Funding Information:
We thank Jeremy Howard, the Google Cloud TPU team (including Sourabh Bajaj, Frank Chen, Brennan Saeta, and Chris Ying), and the many other teams that submitted to DAWNBENCH. We thank Juan Manuel Camacho, Shoumik Palkar, Kexin Rong, Keshav Santhanam, Sahaana Suri, Pratiksha Thaker, and James Thomas for their assistance in labeling. We also thank Amazon and Google for cloud credits. This research was supported in part by affiliate members and other supporters of the Stanford DAWN project-Ant Financial, Facebook, Google, Infosys, Intel, Microsoft, NEC, SAP, Ter-adata, and VMware-as well as Toyota Research Institute, Keysight Technologies, Amazon Web Services, Cisco, and the NSF under grants DGE-1656518, DGE-1147470, and CNS-1651570. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.
Publisher Copyright:
© Copyright held by the owner/author(s). Publication rights licensed to ACM.
PY - 2019/7
Y1 - 2019/7
N2 - Researchers have proposed hardware, software, and algorithmic optimizations to improve the computational performance of deep learning. While some of these optimizations perform the same operations faster (e.g., increasing GPU clock speed), many others modify the semantics of the training procedure (e.g., reduced precision), and can impact the final model's accuracy on unseen data. Due to a lack of standard evaluation criteria that considers these trade-offs, it is difficult to directly compare these optimizations. To address this problem, we recently introduced DAWNBENCH, a benchmark competition focused on end-to-end training time to achieve near-state-of-the-art accuracy on an unseen dataset-a combined metric called time-to-accuracy (TTA). In this work, we analyze the entries from DAWNBENCH, which received optimized submissions from multiple industrial groups, to investigate the behavior of TTA as a metric as well as trends in the best-performing entries. We show that TTA has a low coefficient of variation and that models optimized for TTA generalize nearly as well as those trained using standard methods. Additionally, even though DAWNBENCH entries were able to train ImageNet models in under 3 minutes, we find they still underutilize hardware capabilities such as Tensor Cores. Furthermore, we find that distributed entries can spend more than half of their time on communication. We show similar findings with entries to the MLPERF v0.5 benchmark.
AB - Researchers have proposed hardware, software, and algorithmic optimizations to improve the computational performance of deep learning. While some of these optimizations perform the same operations faster (e.g., increasing GPU clock speed), many others modify the semantics of the training procedure (e.g., reduced precision), and can impact the final model's accuracy on unseen data. Due to a lack of standard evaluation criteria that considers these trade-offs, it is difficult to directly compare these optimizations. To address this problem, we recently introduced DAWNBENCH, a benchmark competition focused on end-to-end training time to achieve near-state-of-the-art accuracy on an unseen dataset-a combined metric called time-to-accuracy (TTA). In this work, we analyze the entries from DAWNBENCH, which received optimized submissions from multiple industrial groups, to investigate the behavior of TTA as a metric as well as trends in the best-performing entries. We show that TTA has a low coefficient of variation and that models optimized for TTA generalize nearly as well as those trained using standard methods. Additionally, even though DAWNBENCH entries were able to train ImageNet models in under 3 minutes, we find they still underutilize hardware capabilities such as Tensor Cores. Furthermore, we find that distributed entries can spend more than half of their time on communication. We show similar findings with entries to the MLPERF v0.5 benchmark.
UR - http://www.scopus.com/inward/record.url?scp=85071332053&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85071332053&partnerID=8YFLogxK
U2 - 10.1145/3352020.3352024
DO - 10.1145/3352020.3352024
M3 - Article
AN - SCOPUS:85071332053
SN - 0163-5980
VL - 53
SP - 14
EP - 25
JO - Operating Systems Review (ACM)
JF - Operating Systems Review (ACM)
IS - 1
ER -