TY - GEN
T1 - An FPGA-based RNN-T inference accelerator with PIM-HBM
AU - Kang, Shin Haeng
AU - Lee, Sukhan
AU - Kim, Byeongho
AU - Kim, Hweesoo
AU - Sohn, Kyomin
AU - Kim, Nam Sung
AU - Lee, Eojin
N1 - Publisher Copyright:
© 2022 ACM.
PY - 2022/2/13
Y1 - 2022/2/13
N2 - In this paper, we implemented a world-first RNN-T inference accelerator using FPGA with PIM-HBM that can multiply the internal bandwidth of the memory. The accelerator offloads matrix-vector multiplication (GEMV) operations of LSTM layers in RNN-T into PIM-HBM, and PIM-HBM reduces the execution time of GEMV significantly by exploiting HBM internal bandwidth. To ensure that the memory commands are issued in a pre-defined order, which is one of the most important constraints in exploiting PIM-HBM, we implement a direct memory access (DMA) module and change configuration of the on-chip memory controller by utilizing the flexibility and reconfigurability of the FPGA. In addition, we design the other hardware modules for acceleration such as non-linear functions (i.e., sigmoid and hyperbolic tangent), element-wise operation, and ReLU module, to operate these compute-bound RNN-T operations on FPGA. For this, we prepare FP16 quantized weight and MLPerf input datasets, and modify the PCIe device driver and C++ based control codes. On our evaluation, our accelerator with PIM-HBM reduces the execution time of RNN-T by 2.5 × on average with 11.09% reduced LUT size and improves energy efficiency up to 2.6 × compared to the baseline.
AB - In this paper, we implemented a world-first RNN-T inference accelerator using FPGA with PIM-HBM that can multiply the internal bandwidth of the memory. The accelerator offloads matrix-vector multiplication (GEMV) operations of LSTM layers in RNN-T into PIM-HBM, and PIM-HBM reduces the execution time of GEMV significantly by exploiting HBM internal bandwidth. To ensure that the memory commands are issued in a pre-defined order, which is one of the most important constraints in exploiting PIM-HBM, we implement a direct memory access (DMA) module and change configuration of the on-chip memory controller by utilizing the flexibility and reconfigurability of the FPGA. In addition, we design the other hardware modules for acceleration such as non-linear functions (i.e., sigmoid and hyperbolic tangent), element-wise operation, and ReLU module, to operate these compute-bound RNN-T operations on FPGA. For this, we prepare FP16 quantized weight and MLPerf input datasets, and modify the PCIe device driver and C++ based control codes. On our evaluation, our accelerator with PIM-HBM reduces the execution time of RNN-T by 2.5 × on average with 11.09% reduced LUT size and improves energy efficiency up to 2.6 × compared to the baseline.
KW - accelerating vector-matrix multiplication
KW - processing-in-memory
KW - speech recognition
UR - http://www.scopus.com/inward/record.url?scp=85125656395&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85125656395&partnerID=8YFLogxK
U2 - 10.1145/3490422.3502355
DO - 10.1145/3490422.3502355
M3 - Conference contribution
AN - SCOPUS:85125656395
T3 - FPGA 2022 - Proceedings of the 2022 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays
SP - 146
EP - 152
BT - FPGA 2022 - Proceedings of the 2022 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays
PB - Association for Computing Machinery, Inc
T2 - 2022 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, FPGA 2022
Y2 - 27 February 2022 through 1 March 2022
ER -