TY - JOUR
T1 - The next 700 accelerated layers
T2 - From mathematical expressions of network computation graphs to accelerated GPU kernels, automatically
AU - Vasilache, Nicolas
AU - Zinenko, Oleksandr
AU - Theodoridis, Theodoros
AU - Goyal, Priya
AU - Devito, Zachary
AU - Moses, William S.
AU - Verdoolaege, Sven
AU - Adams, Andrew
AU - Cohen, Albert
N1 - Publisher Copyright:
© 2019 Association for Computing Machinery.
PY - 2019/10
Y1 - 2019/10
N2 - Deep learning frameworks automate the deployment, distribution, synchronization, memory allocation, and hardware acceleration of models represented as graphs of computational operators. These operators wrap high-performance libraries such as cuDNN or NNPACK. When the computation does not match any predefined library call, custom operators must be implemented, often at high engineering cost and performance penalty, limiting the pace of innovation. To address this productivity gap, we propose and evaluate: (1) a domain-specific language with a tensor notation close to the mathematics of deep learning; (2) a Just-In- Time optimizing compiler based on the polyhedral framework; (3) carefully coordinated linear optimization and evolutionary algorithms to synthesize high-performance CUDA kernels; (4) the transparent integration of our flow into PyTorch and Caffe2, providing the fully automatic synthesis of high-performance GPU kernels from simple tensor algebra. The performance is comparable to, and often exceeds the performance of, highly tuned libraries.
AB - Deep learning frameworks automate the deployment, distribution, synchronization, memory allocation, and hardware acceleration of models represented as graphs of computational operators. These operators wrap high-performance libraries such as cuDNN or NNPACK. When the computation does not match any predefined library call, custom operators must be implemented, often at high engineering cost and performance penalty, limiting the pace of innovation. To address this productivity gap, we propose and evaluate: (1) a domain-specific language with a tensor notation close to the mathematics of deep learning; (2) a Just-In- Time optimizing compiler based on the polyhedral framework; (3) carefully coordinated linear optimization and evolutionary algorithms to synthesize high-performance CUDA kernels; (4) the transparent integration of our flow into PyTorch and Caffe2, providing the fully automatic synthesis of high-performance GPU kernels from simple tensor algebra. The performance is comparable to, and often exceeds the performance of, highly tuned libraries.
KW - Deep learning layers
KW - GPU acceleration
KW - Polyhedral compilation
UR - http://www.scopus.com/inward/record.url?scp=85073740356&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85073740356&partnerID=8YFLogxK
U2 - 10.1145/3355606
DO - 10.1145/3355606
M3 - Article
AN - SCOPUS:85073740356
SN - 1544-3566
VL - 16
JO - ACM Transactions on Architecture and Code Optimization
JF - ACM Transactions on Architecture and Code Optimization
IS - 4
M1 - A38
ER -