TY - GEN
T1 - Accelerating reduction and scan using tensor core units
AU - Dakkak, Abdul
AU - Li, Cheng
AU - Xiong, Jinjun
AU - Gelado, Isaac
AU - Hwu, Wen Mei
N1 - Publisher Copyright:
© 2019 ACM.
PY - 2019/6/26
Y1 - 2019/6/26
N2 - Driven by deep learning, there has been a surge of specialized processors for matrix multiplication, referred to as Tensor Core Units (TCUs). These TCUs are capable of performing matrix multiplications on small matrices (usually 4 × 4 or 16 × 16) to accelerate HPC and deep learning workloads. Although TCUs are prevalent and promise increase in performance and/or energy efficiency, they suffer from over specialization as only matrix multiplication on small matrices is supported. In this paper we express both reduction and scan in terms of matrix multiplication operations and map them onto TCUs. To our knowledge, this paper is the first to try to broaden the class of algorithms expressible as TCU operations and is the first to show benefits of this mapping in terms of: program simplicity, efficiency, and performance. We implemented the reduction and scan algorithms using NVIDIA's V100 TCUs and achieved 89% - 98% of peak memory copy bandwidth. Our results are orders of magnitude faster (up to 100 × for reduction and 3 × for scan) than state-of-the-art methods for small segment sizes (common in HPC and deep learning applications). Our implementation achieves this speedup while decreasing the power consumption by up to 22% for reduction and 16% for scan.
AB - Driven by deep learning, there has been a surge of specialized processors for matrix multiplication, referred to as Tensor Core Units (TCUs). These TCUs are capable of performing matrix multiplications on small matrices (usually 4 × 4 or 16 × 16) to accelerate HPC and deep learning workloads. Although TCUs are prevalent and promise increase in performance and/or energy efficiency, they suffer from over specialization as only matrix multiplication on small matrices is supported. In this paper we express both reduction and scan in terms of matrix multiplication operations and map them onto TCUs. To our knowledge, this paper is the first to try to broaden the class of algorithms expressible as TCU operations and is the first to show benefits of this mapping in terms of: program simplicity, efficiency, and performance. We implemented the reduction and scan algorithms using NVIDIA's V100 TCUs and achieved 89% - 98% of peak memory copy bandwidth. Our results are orders of magnitude faster (up to 100 × for reduction and 3 × for scan) than state-of-the-art methods for small segment sizes (common in HPC and deep learning applications). Our implementation achieves this speedup while decreasing the power consumption by up to 22% for reduction and 16% for scan.
UR - http://www.scopus.com/inward/record.url?scp=85074468758&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85074468758&partnerID=8YFLogxK
U2 - 10.1145/3330345.3331057
DO - 10.1145/3330345.3331057
M3 - Conference contribution
AN - SCOPUS:85074468758
T3 - Proceedings of the International Conference on Supercomputing
SP - 46
EP - 57
BT - ICS 2019 - International Conference on Supercomputing
PB - Association for Computing Machinery
T2 - 33rd ACM International Conference on Supercomputing, ICS 2019, held in conjunction with the Federated Computing Research Conference, FCRC 2019
Y2 - 26 June 2019
ER -