TY - GEN
T1 - Bit-parallel vector composability for neural acceleration
AU - Ghodrati, Soroush
AU - Sharma, Hardik
AU - Young, Cliff
AU - Kim, Nam Sung
AU - Esmaeilzadeh, Hadi
N1 - Funding Information:
such as data-level parallelism, tolerance for reduced precision and sparsification, and redundancy in computations. To realize the hardware accelerators, prior efforts have built upon isolated compute units that operate on all the bits of individual operands, and have used multiple compute units operating together to extract data-level parallelism. This work introduces a different design RNN 145.5112.52130.1354.92 styleLtShaTt Mexplores1t6he61i.n2t2erl5ea.vi16ng64o7f4.b51it-0lev.el8op4erations across VI. CONCLUSION homogenous bitwidths(a) P-eWartf-tpCeormpTaXr i2s0o8n0w TiIt hwRitThrad itionally, neural accelerators have relied on extracting DLP BPeCV with DBDPeRCV4 with HBM2using isolated and self-sufficient compute units that process all the bits of operands. This work introduced a different design style, bit-parallel vector-composability, that operates on operand bit-slices to interleave and combine the traditional data-level parallelism with bit-level parallelism. Across a range of deep models the results show that the proposed design style offers significant performance and efficiency compared to even bit-flexible accelerators. Acknowledgement. This work was in part supported by generous gifts from Google, Qualcomm, Microsoft, Xilinx as well as the National Science Foundation (NSF) awards CNS#1703812, ECCS#1609823, CCF#1553192, Air Force Office of Scientific Research (AFOSR) Young Investigator Program (YIP) award 2D-Digital 2D-Analog 3D-Digital 3D-Analog #FAA9le55x0N-111e7-t.11-002267452, .0N75a81ti81o3n18.a5l823I8n86s5t6i.t32u4t9e34o2f83H9e9al3th3(NIH) award #R01EB028350, and AirForce Research Laboratory (AFRL) and ResNet-18 7.3561874.127717459.38914511382.007095429549 Defense Advanced Research Project Agency (DARPA) under agreement number #FA8650-20-2-7009 and #HR0011-18-C-0020. The U.S. Government is authorized to reproduce and distribute
PY - 2020/7
Y1 - 2020/7
N2 - Conventional neural accelerators rely on isolated self-sufficient functional units that perform an atomic operation while communicating the results through an operand delivery-aggregation logic. Each single unit processes all the bits of their operands atomically and produce all the bits of the results in isolation. This paper explores a different design style, where each unit is only responsible for a slice of the bit-level operations to interleave and combine the benefits of bit-level parallelism with the abundant data-level parallelism in deep neural networks. A dynamic collection of these units cooperate at runtime to generate bits of the results, collectively. Such cooperation requires extracting new grouping between the bits, which is only possible if the operands and operations are vectorizable. The abundance of Data-Level Parallelism and mostly repeated execution patterns, provides a unique opportunity to define and leverage this new dimension of Bit-Parallel Vector Composability. This design intersperses bit parallelism within data-level parallelism and dynamically interweaves the two together. As such, the building block of our neural accelerator is a Composable Vector Unit that is a collection of Narrower-Bitwidth Vector Engines, which are dynamically composed or decomposed at the bit granularity. Using six diverse CNN and LSTM deep networks, we evaluate this design style across four design points: with and without algorithmic bitwidth heterogeneity and with and without availability of a high-bandwidth off-chip memory. Across these four design points, Bit-Parallel Vector Composability brings (1.4× to 3.5×) speedup and (1.1× to 2.7×) energy reduction. We also comprehensively compare our design style to the Nvidia's RTX 2080 TI GPU, which also supports INT-4 execution. The benefits range between 28.0× and 33.7× improvement in Performance-per-Watt.
AB - Conventional neural accelerators rely on isolated self-sufficient functional units that perform an atomic operation while communicating the results through an operand delivery-aggregation logic. Each single unit processes all the bits of their operands atomically and produce all the bits of the results in isolation. This paper explores a different design style, where each unit is only responsible for a slice of the bit-level operations to interleave and combine the benefits of bit-level parallelism with the abundant data-level parallelism in deep neural networks. A dynamic collection of these units cooperate at runtime to generate bits of the results, collectively. Such cooperation requires extracting new grouping between the bits, which is only possible if the operands and operations are vectorizable. The abundance of Data-Level Parallelism and mostly repeated execution patterns, provides a unique opportunity to define and leverage this new dimension of Bit-Parallel Vector Composability. This design intersperses bit parallelism within data-level parallelism and dynamically interweaves the two together. As such, the building block of our neural accelerator is a Composable Vector Unit that is a collection of Narrower-Bitwidth Vector Engines, which are dynamically composed or decomposed at the bit granularity. Using six diverse CNN and LSTM deep networks, we evaluate this design style across four design points: with and without algorithmic bitwidth heterogeneity and with and without availability of a high-bandwidth off-chip memory. Across these four design points, Bit-Parallel Vector Composability brings (1.4× to 3.5×) speedup and (1.1× to 2.7×) energy reduction. We also comprehensively compare our design style to the Nvidia's RTX 2080 TI GPU, which also supports INT-4 execution. The benefits range between 28.0× and 33.7× improvement in Performance-per-Watt.
KW - Acceleration
KW - Bit-flexibility
KW - Neural networks
UR - http://www.scopus.com/inward/record.url?scp=85093921148&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85093921148&partnerID=8YFLogxK
U2 - 10.1109/DAC18072.2020.9218656
DO - 10.1109/DAC18072.2020.9218656
M3 - Conference contribution
AN - SCOPUS:85093921148
T3 - Proceedings - Design Automation Conference
BT - 2020 57th ACM/IEEE Design Automation Conference, DAC 2020
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 57th ACM/IEEE Design Automation Conference, DAC 2020
Y2 - 20 July 2020 through 24 July 2020
ER -