TY - JOUR
T1 - PANTHER
T2 - A Programmable Architecture for Neural Network Training Harnessing Energy-Efficient ReRAM
AU - Ankit, Aayush
AU - Hajj, Izzat El
AU - Chalamalasetti, Sai Rahul
AU - Agarwal, Sapan
AU - Marinella, Matthew
AU - Foltin, Martin
AU - Strachan, John Paul
AU - Milojicic, Dejan
AU - Hwu, Wen Mei
AU - Roy, Kaushik
N1 - Funding Information:
This work was supported by the Center for Brain-inspired Computing (C-BRIC), one of six centers in JUMP, a DARPA sponsored Semiconductor Research Corporation (SRC) program; and Hewlett Packard Labs. Sandia National Laboratories is a multimission laboratory managed and operated by National Technology & Engineering Solutions of Sandia, LLC, a wholly owned subsidiary of Honeywell International Inc., for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-NA0003525. This article describes objective technical results and analysis. Any subjective views or opinions that might be expressed in this article do not necessarily represent the views of the U.S. Department of Energy or the United States Government.
PY - 2020/8/1
Y1 - 2020/8/1
N2 - The wide adoption of deep neural networks has been accompanied by ever-increasing energy and performance demands due to the expensive nature of training them. Numerous special-purpose architectures have been proposed to accelerate training: both digital and hybrid digital-analog using resistive RAM (ReRAM) crossbars. ReRAM-based accelerators have demonstrated the effectiveness of ReRAM crossbars at performing matrix-vector multiplication operations that are prevalent in training. However, they still suffer from inefficiency due to the use of serial reads and writes for performing the weight gradient and update step. A few works have demonstrated the possibility of performing outer products in crossbars, which can be used to realize the weight gradient and update step without the use of serial reads and writes. However, these works have been limited to low precision operations which are not sufficient for typical training workloads. Moreover, they have been confined to a limited set of training algorithms for fully-connected layers only. To address these limitations, we propose a bit-slicing technique for enhancing the precision of ReRAM-based outer products, which is substantially different from bit-slicing for matrix-vector multiplication only. We incorporate this technique into a crossbar architecture with three variants catered to different training algorithms. To evaluate our design on different types of layers in neural networks (fully-connected, convolutional, etc.) and training algorithms, we develop PANTHER, an ISA-programmable training accelerator with compiler support. Our design can also be integrated into other accelerators in the literature to enhance their efficiency. Our evaluation shows that PANTHER achieves up to 8.02×, 54.21×, and 103× energy reductions as well as 7.16×, 4.02×, and 16× execution time reductions compared to digital accelerators, ReRAM-based accelerators, and GPUs, respectively.
AB - The wide adoption of deep neural networks has been accompanied by ever-increasing energy and performance demands due to the expensive nature of training them. Numerous special-purpose architectures have been proposed to accelerate training: both digital and hybrid digital-analog using resistive RAM (ReRAM) crossbars. ReRAM-based accelerators have demonstrated the effectiveness of ReRAM crossbars at performing matrix-vector multiplication operations that are prevalent in training. However, they still suffer from inefficiency due to the use of serial reads and writes for performing the weight gradient and update step. A few works have demonstrated the possibility of performing outer products in crossbars, which can be used to realize the weight gradient and update step without the use of serial reads and writes. However, these works have been limited to low precision operations which are not sufficient for typical training workloads. Moreover, they have been confined to a limited set of training algorithms for fully-connected layers only. To address these limitations, we propose a bit-slicing technique for enhancing the precision of ReRAM-based outer products, which is substantially different from bit-slicing for matrix-vector multiplication only. We incorporate this technique into a crossbar architecture with three variants catered to different training algorithms. To evaluate our design on different types of layers in neural networks (fully-connected, convolutional, etc.) and training algorithms, we develop PANTHER, an ISA-programmable training accelerator with compiler support. Our design can also be integrated into other accelerators in the literature to enhance their efficiency. Our evaluation shows that PANTHER achieves up to 8.02×, 54.21×, and 103× energy reductions as well as 7.16×, 4.02×, and 16× execution time reductions compared to digital accelerators, ReRAM-based accelerators, and GPUs, respectively.
KW - Accelerators
KW - neural networks
KW - resistive random-access memory (ReRam)
KW - training
UR - http://www.scopus.com/inward/record.url?scp=85086727604&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85086727604&partnerID=8YFLogxK
U2 - 10.1109/TC.2020.2998456
DO - 10.1109/TC.2020.2998456
M3 - Article
AN - SCOPUS:85086727604
VL - 69
SP - 1128
EP - 1142
JO - IEEE Transactions on Computers
JF - IEEE Transactions on Computers
SN - 0018-9340
IS - 8
M1 - 9104022
ER -