TY - GEN
T1 - SumMerge
T2 - 35th ACM International Conference on Supercomputing, ICS 2021
AU - Prabhakar, Rohan Baskar
AU - Kuhar, Sachit
AU - Agrawal, Rohit
AU - Hughes, Christopher J.
AU - Fletcher, Christopher W.
N1 - Publisher Copyright:
© 2021 Association for Computing Machinery.
PY - 2021/6/3
Y1 - 2021/6/3
N2 - Deep Neural Network (DNN) inference efficiency is a key concern across the myriad of domains now relying on Deep Learning. A recent promising direction to speed-up inference is to exploit weight repetition. The key observation is that due to DNN quantization schemes-which attempt to reduce DNN storage requirements by reducing the number of bits needed to represent each weight-the same weight is bound to repeat many times within and across filters. This enables a weight-repetition aware inference kernel to factorize and memoize out common sub-computations, reducing arithmetic per inference while still maintaining the compression benefits of quantization. Yet, significant challenges remain. For instance, weight repetition introduces significant irregularity in the inference operation and hence (up to this point) has required custom hardware accelerators to derive net benefit. This paper proposes SumMerge: a new algorithm and set of implementation techniques to make weight repetition practical on general-purpose devices such as CPUs. The key idea is to formulate inference as traversing a sequence of data-flow graphs with weight-dependent structure. We develop an offline heuristic to select a data-flow graph structure that minimizes arithmetic operations per inference (given trained weight values) and use an efficient online procedure to traverse each data-flow graph and compute the inference result given DNN inputs. We implement the above as an optimized C++ routine that runs on a commercial multicore processor with vector extensions and evaluate performance relative to Intel's optimized library oneDNN and the prior-art weight repetition algorithm (AGR). When applied on top of six different quantization schemes, SumMerge achieves a speedup of between 1.09×-2.05× and 1.04×-1.51× relative to oneDNN and AGR, respectively, while simultaneously compressing the DNN model by 8.7× to 15.4×.
AB - Deep Neural Network (DNN) inference efficiency is a key concern across the myriad of domains now relying on Deep Learning. A recent promising direction to speed-up inference is to exploit weight repetition. The key observation is that due to DNN quantization schemes-which attempt to reduce DNN storage requirements by reducing the number of bits needed to represent each weight-the same weight is bound to repeat many times within and across filters. This enables a weight-repetition aware inference kernel to factorize and memoize out common sub-computations, reducing arithmetic per inference while still maintaining the compression benefits of quantization. Yet, significant challenges remain. For instance, weight repetition introduces significant irregularity in the inference operation and hence (up to this point) has required custom hardware accelerators to derive net benefit. This paper proposes SumMerge: a new algorithm and set of implementation techniques to make weight repetition practical on general-purpose devices such as CPUs. The key idea is to formulate inference as traversing a sequence of data-flow graphs with weight-dependent structure. We develop an offline heuristic to select a data-flow graph structure that minimizes arithmetic operations per inference (given trained weight values) and use an efficient online procedure to traverse each data-flow graph and compute the inference result given DNN inputs. We implement the above as an optimized C++ routine that runs on a commercial multicore processor with vector extensions and evaluate performance relative to Intel's optimized library oneDNN and the prior-art weight repetition algorithm (AGR). When applied on top of six different quantization schemes, SumMerge achieves a speedup of between 1.09×-2.05× and 1.04×-1.51× relative to oneDNN and AGR, respectively, while simultaneously compressing the DNN model by 8.7× to 15.4×.
KW - Convolutional neural networks
KW - Deep neural networks
KW - Inference
KW - Weight quantization
KW - Weight repetition
UR - http://www.scopus.com/inward/record.url?scp=85107504165&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85107504165&partnerID=8YFLogxK
U2 - 10.1145/3447818.3460375
DO - 10.1145/3447818.3460375
M3 - Conference contribution
AN - SCOPUS:85107504165
T3 - Proceedings of the International Conference on Supercomputing
SP - 279
EP - 290
BT - ICS 2021 - Proceedings of the 2021 ACM International Conference on Supercomputing
PB - Association for Computing Machinery
Y2 - 14 June 2021 through 17 June 2021
ER -