Aluminum

An Asynchronous, GPU-Aware Communication Library Optimized for Large-Scale Training of Deep Neural Networks on HPC Systems

Nikoli Dryden, Naoya Maruyama, Tim Moon, Tom Benson, Andy Yoo, Marc Snir, Brian Van Essen

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

We identify communication as a major bottleneck for training deep neural networks on large-scale GPU clusters, taking over 10x as long as computation. To reduce this overhead, we discuss techniques to overlap communication and computation as much as possible. This leads to much of the communication being latency-bound instead of bandwidth-bound, and we find that using a combination of latency- and bandwidth-optimized allreduce algorithms significantly reduces communication costs. We also discuss a semantic mismatch between MPI and CUDA that increases overheads and limits asynchrony, and propose a solution that enables communication to be aware of CUDA streams. We implement these optimizations in the open-source Aluminum communication library, enabling optimized, asynchronous, GPU-aware communication. Aluminum demonstrates improved performance in benchmarks and end-to-end training of deep networks, for both strong and weak scaling.

Original languageEnglish (US)
Title of host publicationProceedings of MLHPC 2018
Subtitle of host publicationMachine Learning in HPC Environments, Held in conjunction with SC 2018: The International Conference for High Performance Computing, Networking, Storage and Analysis
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages1-13
Number of pages13
ISBN (Electronic)9781728101804
DOIs
StatePublished - Feb 8 2019
Event2018 IEEE/ACM Machine Learning in HPC Environments, MLHPC 2018 - Dallas, United States
Duration: Nov 12 2018 → …

Publication series

NameProceedings of MLHPC 2018: Machine Learning in HPC Environments, Held in conjunction with SC 2018: The International Conference for High Performance Computing, Networking, Storage and Analysis

Conference

Conference2018 IEEE/ACM Machine Learning in HPC Environments, MLHPC 2018
CountryUnited States
CityDallas
Period11/12/18 → …

Fingerprint

Aluminum
Communication
Bandwidth
Graphics processing unit
Deep neural networks
Semantics
Costs

Keywords

  • Collective algorithms
  • Communication optimization
  • Deep learning
  • HPC
  • Machine learning

ASJC Scopus subject areas

  • Artificial Intelligence
  • Computer Networks and Communications

Cite this

Dryden, N., Maruyama, N., Moon, T., Benson, T., Yoo, A., Snir, M., & Van Essen, B. (2019). Aluminum: An Asynchronous, GPU-Aware Communication Library Optimized for Large-Scale Training of Deep Neural Networks on HPC Systems. In Proceedings of MLHPC 2018: Machine Learning in HPC Environments, Held in conjunction with SC 2018: The International Conference for High Performance Computing, Networking, Storage and Analysis (pp. 1-13). [8638639] (Proceedings of MLHPC 2018: Machine Learning in HPC Environments, Held in conjunction with SC 2018: The International Conference for High Performance Computing, Networking, Storage and Analysis). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/MLHPC.2018.8638639

Aluminum : An Asynchronous, GPU-Aware Communication Library Optimized for Large-Scale Training of Deep Neural Networks on HPC Systems. / Dryden, Nikoli; Maruyama, Naoya; Moon, Tim; Benson, Tom; Yoo, Andy; Snir, Marc; Van Essen, Brian.

Proceedings of MLHPC 2018: Machine Learning in HPC Environments, Held in conjunction with SC 2018: The International Conference for High Performance Computing, Networking, Storage and Analysis. Institute of Electrical and Electronics Engineers Inc., 2019. p. 1-13 8638639 (Proceedings of MLHPC 2018: Machine Learning in HPC Environments, Held in conjunction with SC 2018: The International Conference for High Performance Computing, Networking, Storage and Analysis).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Dryden, N, Maruyama, N, Moon, T, Benson, T, Yoo, A, Snir, M & Van Essen, B 2019, Aluminum: An Asynchronous, GPU-Aware Communication Library Optimized for Large-Scale Training of Deep Neural Networks on HPC Systems. in Proceedings of MLHPC 2018: Machine Learning in HPC Environments, Held in conjunction with SC 2018: The International Conference for High Performance Computing, Networking, Storage and Analysis., 8638639, Proceedings of MLHPC 2018: Machine Learning in HPC Environments, Held in conjunction with SC 2018: The International Conference for High Performance Computing, Networking, Storage and Analysis, Institute of Electrical and Electronics Engineers Inc., pp. 1-13, 2018 IEEE/ACM Machine Learning in HPC Environments, MLHPC 2018, Dallas, United States, 11/12/18. https://doi.org/10.1109/MLHPC.2018.8638639
Dryden N, Maruyama N, Moon T, Benson T, Yoo A, Snir M et al. Aluminum: An Asynchronous, GPU-Aware Communication Library Optimized for Large-Scale Training of Deep Neural Networks on HPC Systems. In Proceedings of MLHPC 2018: Machine Learning in HPC Environments, Held in conjunction with SC 2018: The International Conference for High Performance Computing, Networking, Storage and Analysis. Institute of Electrical and Electronics Engineers Inc. 2019. p. 1-13. 8638639. (Proceedings of MLHPC 2018: Machine Learning in HPC Environments, Held in conjunction with SC 2018: The International Conference for High Performance Computing, Networking, Storage and Analysis). https://doi.org/10.1109/MLHPC.2018.8638639
Dryden, Nikoli ; Maruyama, Naoya ; Moon, Tim ; Benson, Tom ; Yoo, Andy ; Snir, Marc ; Van Essen, Brian. / Aluminum : An Asynchronous, GPU-Aware Communication Library Optimized for Large-Scale Training of Deep Neural Networks on HPC Systems. Proceedings of MLHPC 2018: Machine Learning in HPC Environments, Held in conjunction with SC 2018: The International Conference for High Performance Computing, Networking, Storage and Analysis. Institute of Electrical and Electronics Engineers Inc., 2019. pp. 1-13 (Proceedings of MLHPC 2018: Machine Learning in HPC Environments, Held in conjunction with SC 2018: The International Conference for High Performance Computing, Networking, Storage and Analysis).
@inproceedings{3bfa96a9f8ca467db84014358cea7204,
title = "Aluminum: An Asynchronous, GPU-Aware Communication Library Optimized for Large-Scale Training of Deep Neural Networks on HPC Systems",
abstract = "We identify communication as a major bottleneck for training deep neural networks on large-scale GPU clusters, taking over 10x as long as computation. To reduce this overhead, we discuss techniques to overlap communication and computation as much as possible. This leads to much of the communication being latency-bound instead of bandwidth-bound, and we find that using a combination of latency- and bandwidth-optimized allreduce algorithms significantly reduces communication costs. We also discuss a semantic mismatch between MPI and CUDA that increases overheads and limits asynchrony, and propose a solution that enables communication to be aware of CUDA streams. We implement these optimizations in the open-source Aluminum communication library, enabling optimized, asynchronous, GPU-aware communication. Aluminum demonstrates improved performance in benchmarks and end-to-end training of deep networks, for both strong and weak scaling.",
keywords = "Collective algorithms, Communication optimization, Deep learning, HPC, Machine learning",
author = "Nikoli Dryden and Naoya Maruyama and Tim Moon and Tom Benson and Andy Yoo and Marc Snir and {Van Essen}, Brian",
year = "2019",
month = "2",
day = "8",
doi = "10.1109/MLHPC.2018.8638639",
language = "English (US)",
series = "Proceedings of MLHPC 2018: Machine Learning in HPC Environments, Held in conjunction with SC 2018: The International Conference for High Performance Computing, Networking, Storage and Analysis",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
pages = "1--13",
booktitle = "Proceedings of MLHPC 2018",
address = "United States",

}

TY - GEN

T1 - Aluminum

T2 - An Asynchronous, GPU-Aware Communication Library Optimized for Large-Scale Training of Deep Neural Networks on HPC Systems

AU - Dryden, Nikoli

AU - Maruyama, Naoya

AU - Moon, Tim

AU - Benson, Tom

AU - Yoo, Andy

AU - Snir, Marc

AU - Van Essen, Brian

PY - 2019/2/8

Y1 - 2019/2/8

N2 - We identify communication as a major bottleneck for training deep neural networks on large-scale GPU clusters, taking over 10x as long as computation. To reduce this overhead, we discuss techniques to overlap communication and computation as much as possible. This leads to much of the communication being latency-bound instead of bandwidth-bound, and we find that using a combination of latency- and bandwidth-optimized allreduce algorithms significantly reduces communication costs. We also discuss a semantic mismatch between MPI and CUDA that increases overheads and limits asynchrony, and propose a solution that enables communication to be aware of CUDA streams. We implement these optimizations in the open-source Aluminum communication library, enabling optimized, asynchronous, GPU-aware communication. Aluminum demonstrates improved performance in benchmarks and end-to-end training of deep networks, for both strong and weak scaling.

AB - We identify communication as a major bottleneck for training deep neural networks on large-scale GPU clusters, taking over 10x as long as computation. To reduce this overhead, we discuss techniques to overlap communication and computation as much as possible. This leads to much of the communication being latency-bound instead of bandwidth-bound, and we find that using a combination of latency- and bandwidth-optimized allreduce algorithms significantly reduces communication costs. We also discuss a semantic mismatch between MPI and CUDA that increases overheads and limits asynchrony, and propose a solution that enables communication to be aware of CUDA streams. We implement these optimizations in the open-source Aluminum communication library, enabling optimized, asynchronous, GPU-aware communication. Aluminum demonstrates improved performance in benchmarks and end-to-end training of deep networks, for both strong and weak scaling.

KW - Collective algorithms

KW - Communication optimization

KW - Deep learning

KW - HPC

KW - Machine learning

UR - http://www.scopus.com/inward/record.url?scp=85063027170&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85063027170&partnerID=8YFLogxK

U2 - 10.1109/MLHPC.2018.8638639

DO - 10.1109/MLHPC.2018.8638639

M3 - Conference contribution

T3 - Proceedings of MLHPC 2018: Machine Learning in HPC Environments, Held in conjunction with SC 2018: The International Conference for High Performance Computing, Networking, Storage and Analysis

SP - 1

EP - 13

BT - Proceedings of MLHPC 2018

PB - Institute of Electrical and Electronics Engineers Inc.

ER -