Accelerating reduction and scan using tensor core units

Abdul Dakkak, Cheng Li, Jinjun Xiong, Isaac Gelado, Wen Mei Hwu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Driven by deep learning, there has been a surge of specialized processors for matrix multiplication, referred to as Tensor Core Units (TCUs). These TCUs are capable of performing matrix multiplications on small matrices (usually 4 × 4 or 16 × 16) to accelerate HPC and deep learning workloads. Although TCUs are prevalent and promise increase in performance and/or energy efficiency, they suffer from over specialization as only matrix multiplication on small matrices is supported. In this paper we express both reduction and scan in terms of matrix multiplication operations and map them onto TCUs. To our knowledge, this paper is the first to try to broaden the class of algorithms expressible as TCU operations and is the first to show benefits of this mapping in terms of: program simplicity, efficiency, and performance. We implemented the reduction and scan algorithms using NVIDIA's V100 TCUs and achieved 89% - 98% of peak memory copy bandwidth. Our results are orders of magnitude faster (up to 100 × for reduction and 3 × for scan) than state-of-the-art methods for small segment sizes (common in HPC and deep learning applications). Our implementation achieves this speedup while decreasing the power consumption by up to 22% for reduction and 16% for scan.

Original languageEnglish (US)
Title of host publicationICS 2019 - International Conference on Supercomputing
PublisherAssociation for Computing Machinery
Pages46-57
Number of pages12
ISBN (Electronic)9781450360791
DOIs
StatePublished - Jun 26 2019
Event33rd ACM International Conference on Supercomputing, ICS 2019, held in conjunction with the Federated Computing Research Conference, FCRC 2019 - Phoenix, United States
Duration: Jun 26 2019 → …

Publication series

NameProceedings of the International Conference on Supercomputing

Conference

Conference33rd ACM International Conference on Supercomputing, ICS 2019, held in conjunction with the Federated Computing Research Conference, FCRC 2019
CountryUnited States
CityPhoenix
Period6/26/19 → …

Fingerprint

Tensors
Energy efficiency
Electric power utilization
Bandwidth
Data storage equipment
Deep learning

ASJC Scopus subject areas

  • Computer Science(all)

Cite this

Dakkak, A., Li, C., Xiong, J., Gelado, I., & Hwu, W. M. (2019). Accelerating reduction and scan using tensor core units. In ICS 2019 - International Conference on Supercomputing (pp. 46-57). (Proceedings of the International Conference on Supercomputing). Association for Computing Machinery. https://doi.org/10.1145/3330345.3331057

Accelerating reduction and scan using tensor core units. / Dakkak, Abdul; Li, Cheng; Xiong, Jinjun; Gelado, Isaac; Hwu, Wen Mei.

ICS 2019 - International Conference on Supercomputing. Association for Computing Machinery, 2019. p. 46-57 (Proceedings of the International Conference on Supercomputing).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Dakkak, A, Li, C, Xiong, J, Gelado, I & Hwu, WM 2019, Accelerating reduction and scan using tensor core units. in ICS 2019 - International Conference on Supercomputing. Proceedings of the International Conference on Supercomputing, Association for Computing Machinery, pp. 46-57, 33rd ACM International Conference on Supercomputing, ICS 2019, held in conjunction with the Federated Computing Research Conference, FCRC 2019, Phoenix, United States, 6/26/19. https://doi.org/10.1145/3330345.3331057
Dakkak A, Li C, Xiong J, Gelado I, Hwu WM. Accelerating reduction and scan using tensor core units. In ICS 2019 - International Conference on Supercomputing. Association for Computing Machinery. 2019. p. 46-57. (Proceedings of the International Conference on Supercomputing). https://doi.org/10.1145/3330345.3331057
Dakkak, Abdul ; Li, Cheng ; Xiong, Jinjun ; Gelado, Isaac ; Hwu, Wen Mei. / Accelerating reduction and scan using tensor core units. ICS 2019 - International Conference on Supercomputing. Association for Computing Machinery, 2019. pp. 46-57 (Proceedings of the International Conference on Supercomputing).
@inproceedings{718c798db60140aaa1e3aae16a014f71,
title = "Accelerating reduction and scan using tensor core units",
abstract = "Driven by deep learning, there has been a surge of specialized processors for matrix multiplication, referred to as Tensor Core Units (TCUs). These TCUs are capable of performing matrix multiplications on small matrices (usually 4 × 4 or 16 × 16) to accelerate HPC and deep learning workloads. Although TCUs are prevalent and promise increase in performance and/or energy efficiency, they suffer from over specialization as only matrix multiplication on small matrices is supported. In this paper we express both reduction and scan in terms of matrix multiplication operations and map them onto TCUs. To our knowledge, this paper is the first to try to broaden the class of algorithms expressible as TCU operations and is the first to show benefits of this mapping in terms of: program simplicity, efficiency, and performance. We implemented the reduction and scan algorithms using NVIDIA's V100 TCUs and achieved 89{\%} - 98{\%} of peak memory copy bandwidth. Our results are orders of magnitude faster (up to 100 × for reduction and 3 × for scan) than state-of-the-art methods for small segment sizes (common in HPC and deep learning applications). Our implementation achieves this speedup while decreasing the power consumption by up to 22{\%} for reduction and 16{\%} for scan.",
author = "Abdul Dakkak and Cheng Li and Jinjun Xiong and Isaac Gelado and Hwu, {Wen Mei}",
year = "2019",
month = "6",
day = "26",
doi = "10.1145/3330345.3331057",
language = "English (US)",
series = "Proceedings of the International Conference on Supercomputing",
publisher = "Association for Computing Machinery",
pages = "46--57",
booktitle = "ICS 2019 - International Conference on Supercomputing",

}

TY - GEN

T1 - Accelerating reduction and scan using tensor core units

AU - Dakkak, Abdul

AU - Li, Cheng

AU - Xiong, Jinjun

AU - Gelado, Isaac

AU - Hwu, Wen Mei

PY - 2019/6/26

Y1 - 2019/6/26

N2 - Driven by deep learning, there has been a surge of specialized processors for matrix multiplication, referred to as Tensor Core Units (TCUs). These TCUs are capable of performing matrix multiplications on small matrices (usually 4 × 4 or 16 × 16) to accelerate HPC and deep learning workloads. Although TCUs are prevalent and promise increase in performance and/or energy efficiency, they suffer from over specialization as only matrix multiplication on small matrices is supported. In this paper we express both reduction and scan in terms of matrix multiplication operations and map them onto TCUs. To our knowledge, this paper is the first to try to broaden the class of algorithms expressible as TCU operations and is the first to show benefits of this mapping in terms of: program simplicity, efficiency, and performance. We implemented the reduction and scan algorithms using NVIDIA's V100 TCUs and achieved 89% - 98% of peak memory copy bandwidth. Our results are orders of magnitude faster (up to 100 × for reduction and 3 × for scan) than state-of-the-art methods for small segment sizes (common in HPC and deep learning applications). Our implementation achieves this speedup while decreasing the power consumption by up to 22% for reduction and 16% for scan.

AB - Driven by deep learning, there has been a surge of specialized processors for matrix multiplication, referred to as Tensor Core Units (TCUs). These TCUs are capable of performing matrix multiplications on small matrices (usually 4 × 4 or 16 × 16) to accelerate HPC and deep learning workloads. Although TCUs are prevalent and promise increase in performance and/or energy efficiency, they suffer from over specialization as only matrix multiplication on small matrices is supported. In this paper we express both reduction and scan in terms of matrix multiplication operations and map them onto TCUs. To our knowledge, this paper is the first to try to broaden the class of algorithms expressible as TCU operations and is the first to show benefits of this mapping in terms of: program simplicity, efficiency, and performance. We implemented the reduction and scan algorithms using NVIDIA's V100 TCUs and achieved 89% - 98% of peak memory copy bandwidth. Our results are orders of magnitude faster (up to 100 × for reduction and 3 × for scan) than state-of-the-art methods for small segment sizes (common in HPC and deep learning applications). Our implementation achieves this speedup while decreasing the power consumption by up to 22% for reduction and 16% for scan.

UR - http://www.scopus.com/inward/record.url?scp=85074468758&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85074468758&partnerID=8YFLogxK

U2 - 10.1145/3330345.3331057

DO - 10.1145/3330345.3331057

M3 - Conference contribution

AN - SCOPUS:85074468758

T3 - Proceedings of the International Conference on Supercomputing

SP - 46

EP - 57

BT - ICS 2019 - International Conference on Supercomputing

PB - Association for Computing Machinery

ER -