TY - GEN
T1 - HAL
T2 - 2020 Conference on Practice and Experience in Advanced Research Computing: Catch the Wave, PEARC 2020
AU - Kindratenko, Volodymyr
AU - Mu, Dawei
AU - Zhan, Yan
AU - Maloney, John
AU - Hashemi, Sayed Hadi
AU - Rabe, Benjamin
AU - Xu, Ke
AU - Campbell, Roy
AU - Peng, Jian
AU - Gropp, William
N1 - Funding Information:
This work is supported by the National Science Foundation?s Major Research Instrumentation program, grant No. 1725729, aswell as the University of Illinois at Urbana-Champaign.We would like to thank Daniel Lapine from NCSA, Steve Zehner, Trish Froeschle, Thomas Prokop, Jamie Syptak, and Terry Leatherland from IBM for their expert advise and invaluable contributions to the development and operation of the system. We would also like to thank our student intern Nishant Dash for his contribution to the development of swsuite software.
Funding Information:
This work is supported by the National Science Foundation’s Major Research Instrumentation program, grant No. 1725729, as well as the University of Illinois at Urbana-Champaign. We would like to thank Daniel Lapine from NCSA, Steve Zehner, Trish Froeschle, Thomas Prokop, Jamie Syptak, and Terry Leatherland from IBM for their expert advise and invaluable contributions to the development and operation of the system. We would also like to thank our student intern Nishant Dash for his contribution to the development of swsuite software.
Funding Information:
In 2017, National Center for Supercomputing Applications (NCSA) was funded by the National Science Foundation’s (NSF) Major Research Instrumentation (MRI) program to develop and deploy a computational "instrument" for supporting deep learning (DL) applications at scale1. The main motivation for building such an system was an apparent lack of sufficient computational resources on the University campus designated to support a growing number of researchers applying DL methodology in their work. We surveyed the campus research community and identified over 30 faculty actively applying DL who struggled to find adequate computing resources to train deep neural networks (DNNs). A typical mode of operation was to use a student-managed workstation outfitted with one or two consumer-grade NVIDIA GPUs running a sub-optimal software stack. These resources were inadequate as even simple networks of any practical use required days or weeks of time to train.
Publisher Copyright:
© 2020 ACM.
PY - 2020/7/26
Y1 - 2020/7/26
N2 - We describe the design, deployment and operation of a computer system built to efficiently run deep learning frameworks. The system consists of 16 IBM POWER9 servers with 4 NVIDIA V100 GPUs each, interconnected with Mellanox EDR InfiniBand fabric, and a DDN all-flash storage array. The system is tailored towards efficient execution of the IBM Watson Machine Learning enterprise software stack that combines popular open-source deep learning frameworks. We build a custom management software stack to enable an efficient use of the system by a diverse community of users and provide guides and recipes for running deep learning workloads at scale utilizing all available GPUs. We demonstrate scaling of a PyTorch and TensorFlow based deep neural networks to produce state-of-the-art performance results.
AB - We describe the design, deployment and operation of a computer system built to efficiently run deep learning frameworks. The system consists of 16 IBM POWER9 servers with 4 NVIDIA V100 GPUs each, interconnected with Mellanox EDR InfiniBand fabric, and a DDN all-flash storage array. The system is tailored towards efficient execution of the IBM Watson Machine Learning enterprise software stack that combines popular open-source deep learning frameworks. We build a custom management software stack to enable an efficient use of the system by a diverse community of users and provide guides and recipes for running deep learning workloads at scale utilizing all available GPUs. We demonstrate scaling of a PyTorch and TensorFlow based deep neural networks to produce state-of-the-art performance results.
KW - cluster architecture
KW - deep learning
KW - high-performance computing
UR - http://www.scopus.com/inward/record.url?scp=85089267739&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85089267739&partnerID=8YFLogxK
U2 - 10.1145/3311790.3396649
DO - 10.1145/3311790.3396649
M3 - Conference contribution
AN - SCOPUS:85089267739
T3 - ACM International Conference Proceeding Series
SP - 41
EP - 48
BT - PEARC 2020 - Practice and Experience in Advanced Research Computing 2020
PB - Association for Computing Machinery
Y2 - 27 July 2020 through 31 July 2020
ER -