TY - GEN
T1 - ExTensor
T2 - 52nd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2019
AU - Hegde, Kartik
AU - Asghari-Moghaddam, Hadi
AU - Pellauer, Michael
AU - Crago, Neal
AU - Jaleel, Aamer
AU - Solomonik, Edgar
AU - Emer, Joel
AU - Fletcher, Christopher W.
N1 - Publisher Copyright:
© 2019 Association for Computing Machinery.
PY - 2019/10/12
Y1 - 2019/10/12
N2 - Generalized tensor algebra is a prime candidate for acceleration via customized ASICs. Modern tensors feature a wide range of data sparsity, with the density of non-zero elements ranging from 10-6% to 50%. This paper proposes a novel approach to accelerate tensor kernels based on the principle of hierarchical elimination of computation in the presence of sparsity. This approach relies on rapidly inding intersectionsDsituations where both operands of a multiplication are non-zeroDenabling new data fetching mechanisms and avoiding memory latency overheads associated with sparse kernels implemented in software. We propose the ExTensor accelerator, which builds these novel ideas on handling sparsity into hardware to enable better bandwidth utilization and compute throughput. We evaluate ExTensor on several kernels relative to industry libraries (Intel MKL) and state-of-the-art tensor algebra compilers (TACO). When bandwidth normalized, we demonstrate an average speedup of 3.4, 1.3, 2.8, 24.9, and 2.7 on SpMSpM, SpMM, TTV, TTM, and SDDMM kernels respectively over a server class CPU.
AB - Generalized tensor algebra is a prime candidate for acceleration via customized ASICs. Modern tensors feature a wide range of data sparsity, with the density of non-zero elements ranging from 10-6% to 50%. This paper proposes a novel approach to accelerate tensor kernels based on the principle of hierarchical elimination of computation in the presence of sparsity. This approach relies on rapidly inding intersectionsDsituations where both operands of a multiplication are non-zeroDenabling new data fetching mechanisms and avoiding memory latency overheads associated with sparse kernels implemented in software. We propose the ExTensor accelerator, which builds these novel ideas on handling sparsity into hardware to enable better bandwidth utilization and compute throughput. We evaluate ExTensor on several kernels relative to industry libraries (Intel MKL) and state-of-the-art tensor algebra compilers (TACO). When bandwidth normalized, we demonstrate an average speedup of 3.4, 1.3, 2.8, 24.9, and 2.7 on SpMSpM, SpMM, TTV, TTM, and SDDMM kernels respectively over a server class CPU.
KW - Hardware acceleration
KW - Sparse computation
KW - Tensor algebra
UR - http://www.scopus.com/inward/record.url?scp=85074449854&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85074449854&partnerID=8YFLogxK
U2 - 10.1145/3352460.3358275
DO - 10.1145/3352460.3358275
M3 - Conference contribution
AN - SCOPUS:85074449854
T3 - Proceedings of the Annual International Symposium on Microarchitecture, MICRO
SP - 319
EP - 333
BT - MICRO 2019 - 52nd Annual IEEE/ACM International Symposium on Microarchitecture, Proceedings
PB - IEEE Computer Society
Y2 - 12 October 2019 through 16 October 2019
ER -