SumMerge: An efficient algorithm and implementation for weight repetition-aware DNN inference

Rohan Baskar Prabhakar, Sachit Kuhar, Rohit Agrawal, Christopher J. Hughes, Christopher W. Fletcher

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Deep Neural Network (DNN) inference efficiency is a key concern across the myriad of domains now relying on Deep Learning. A recent promising direction to speed-up inference is to exploit weight repetition. The key observation is that due to DNN quantization schemes-which attempt to reduce DNN storage requirements by reducing the number of bits needed to represent each weight-the same weight is bound to repeat many times within and across filters. This enables a weight-repetition aware inference kernel to factorize and memoize out common sub-computations, reducing arithmetic per inference while still maintaining the compression benefits of quantization. Yet, significant challenges remain. For instance, weight repetition introduces significant irregularity in the inference operation and hence (up to this point) has required custom hardware accelerators to derive net benefit. This paper proposes SumMerge: a new algorithm and set of implementation techniques to make weight repetition practical on general-purpose devices such as CPUs. The key idea is to formulate inference as traversing a sequence of data-flow graphs with weight-dependent structure. We develop an offline heuristic to select a data-flow graph structure that minimizes arithmetic operations per inference (given trained weight values) and use an efficient online procedure to traverse each data-flow graph and compute the inference result given DNN inputs. We implement the above as an optimized C++ routine that runs on a commercial multicore processor with vector extensions and evaluate performance relative to Intel's optimized library oneDNN and the prior-art weight repetition algorithm (AGR). When applied on top of six different quantization schemes, SumMerge achieves a speedup of between 1.09×-2.05× and 1.04×-1.51× relative to oneDNN and AGR, respectively, while simultaneously compressing the DNN model by 8.7× to 15.4×.

Original languageEnglish (US)
Title of host publicationICS 2021 - Proceedings of the 2021 ACM International Conference on Supercomputing
PublisherAssociation for Computing Machinery
Pages279-290
Number of pages12
ISBN (Electronic)9781450383356
DOIs
StatePublished - Jun 3 2021
Event35th ACM International Conference on Supercomputing, ICS 2021 - Virtual, Online, United States
Duration: Jun 14 2021Jun 17 2021

Publication series

NameProceedings of the International Conference on Supercomputing

Conference

Conference35th ACM International Conference on Supercomputing, ICS 2021
Country/TerritoryUnited States
CityVirtual, Online
Period6/14/216/17/21

Keywords

  • Convolutional neural networks
  • Deep neural networks
  • Inference
  • Weight quantization
  • Weight repetition

ASJC Scopus subject areas

  • General Computer Science

Fingerprint

Dive into the research topics of 'SumMerge: An efficient algorithm and implementation for weight repetition-aware DNN inference'. Together they form a unique fingerprint.

Cite this