HAL: Computer System for Scalable Deep Learning

Volodymyr Kindratenko, Dawei Mu, Yan Zhan, John Maloney, Sayed Hadi Hashemi, Benjamin Rabe, Ke Xu, Roy Campbell, Jian Peng, William Gropp

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

We describe the design, deployment and operation of a computer system built to efficiently run deep learning frameworks. The system consists of 16 IBM POWER9 servers with 4 NVIDIA V100 GPUs each, interconnected with Mellanox EDR InfiniBand fabric, and a DDN all-flash storage array. The system is tailored towards efficient execution of the IBM Watson Machine Learning enterprise software stack that combines popular open-source deep learning frameworks. We build a custom management software stack to enable an efficient use of the system by a diverse community of users and provide guides and recipes for running deep learning workloads at scale utilizing all available GPUs. We demonstrate scaling of a PyTorch and TensorFlow based deep neural networks to produce state-of-the-art performance results.

Original languageEnglish (US)
Title of host publicationPEARC 2020 - Practice and Experience in Advanced Research Computing 2020
Subtitle of host publicationCatch the Wave
PublisherAssociation for Computing Machinery
Pages41-48
Number of pages8
ISBN (Electronic)9781450366892
DOIs
StatePublished - Jul 26 2020
Event2020 Conference on Practice and Experience in Advanced Research Computing: Catch the Wave, PEARC 2020 - Virtual, Online, United States
Duration: Jul 27 2020Jul 31 2020

Publication series

NameACM International Conference Proceeding Series

Conference

Conference2020 Conference on Practice and Experience in Advanced Research Computing: Catch the Wave, PEARC 2020
CountryUnited States
CityVirtual, Online
Period7/27/207/31/20

Keywords

  • cluster architecture
  • deep learning
  • high-performance computing

ASJC Scopus subject areas

  • Software
  • Human-Computer Interaction
  • Computer Vision and Pattern Recognition
  • Computer Networks and Communications

Fingerprint Dive into the research topics of 'HAL: Computer System for Scalable Deep Learning'. Together they form a unique fingerprint.

Cite this