TY - GEN
T1 - RACER
T2 - 54th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2021
AU - Truong, Minh S.Q.
AU - Chen, Eric
AU - Su, Deanyone
AU - Glass, Alexander
AU - Shen, Liting
AU - Carley, L. Richard
AU - Bain, James A.
AU - Ghose, Saugata
N1 - Funding Information:
We thank Raghav Gupta, Yuezhang Zou, and Shivani Prasad for their feedback on this work. This work was funded in part by a seed grant from the Wilton E. Scott Institute for Energy Innovation, and by the Data Storage Systems Center at Carnegie Mellon University.
Funding Information:
their feedback on this work. This work was funded in part by a seed
Publisher Copyright:
© 2021 Association for Computing Machinery.
PY - 2021/10/18
Y1 - 2021/10/18
N2 - To combat the high energy costs of moving data between main memory and the CPU, recent works have proposed to perform processing-using-memory (PUM), a type of processing-in-memory where operations are performed on data in situ (i.e., right at the memory cells holding the data). Several common and emerging memory technologies offer the ability to perform bitwise Boolean primitive functions by having interconnected cells interact with each other, eliminating the need to use discrete CMOS compute units for several common operations. Recent PUM architectures extend upon these Boolean primitives to perform bit-serial computation using memory. Unfortunately, several practical limitations of the underlying memory devices restrict how large emerging memory arrays can be, which hinders the ability of conventional bit-serial computation approaches to deliver high performance in addition to large energy savings. In this paper, we propose RACER, a cost-effective PUM architecture that delivers high performance and large energy savings using small arrays of resistive memories. RACER makes use of a bit-pipelining execution model, which can pipeline bit-serial w-bit computation across w small tiles. We fully design efficient control and peripheral circuitry, whose area can be amortized over small memory tiles without sacrificing memory density, and we propose an ISA abstraction for RACER to allow for easy program/compiler integration. We evaluate an implementation of RACER using NORcapable ReRAM cells across a range of microbenchmarks extracted from data-intensive applications, and find that RACER provides 107×, 12×, and 7× the performance of a 16-core CPU, a 2304-shadercore GPU, and a state-of-the-art in-SRAM compute substrate, respectively, with energy savings of 189×, 17×, and 1.3×.
AB - To combat the high energy costs of moving data between main memory and the CPU, recent works have proposed to perform processing-using-memory (PUM), a type of processing-in-memory where operations are performed on data in situ (i.e., right at the memory cells holding the data). Several common and emerging memory technologies offer the ability to perform bitwise Boolean primitive functions by having interconnected cells interact with each other, eliminating the need to use discrete CMOS compute units for several common operations. Recent PUM architectures extend upon these Boolean primitives to perform bit-serial computation using memory. Unfortunately, several practical limitations of the underlying memory devices restrict how large emerging memory arrays can be, which hinders the ability of conventional bit-serial computation approaches to deliver high performance in addition to large energy savings. In this paper, we propose RACER, a cost-effective PUM architecture that delivers high performance and large energy savings using small arrays of resistive memories. RACER makes use of a bit-pipelining execution model, which can pipeline bit-serial w-bit computation across w small tiles. We fully design efficient control and peripheral circuitry, whose area can be amortized over small memory tiles without sacrificing memory density, and we propose an ISA abstraction for RACER to allow for easy program/compiler integration. We evaluate an implementation of RACER using NORcapable ReRAM cells across a range of microbenchmarks extracted from data-intensive applications, and find that RACER provides 107×, 12×, and 7× the performance of a 16-core CPU, a 2304-shadercore GPU, and a state-of-the-art in-SRAM compute substrate, respectively, with energy savings of 189×, 17×, and 1.3×.
UR - http://www.scopus.com/inward/record.url?scp=85118863825&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85118863825&partnerID=8YFLogxK
U2 - 10.1145/3466752.3480071
DO - 10.1145/3466752.3480071
M3 - Conference contribution
AN - SCOPUS:85118863825
T3 - Proceedings of the Annual International Symposium on Microarchitecture, MICRO
SP - 100
EP - 116
BT - MICRO 2021 - 54th Annual IEEE/ACM International Symposium on Microarchitecture, Proceedings
PB - IEEE Computer Society
Y2 - 18 October 2021 through 22 October 2021
ER -