TY - GEN
T1 - Application-Transparent near-memory processing architecture with memory channel network
AU - Alian, Mohammad
AU - Min, Seung Won
AU - Asgharimoghaddam, Hadi
AU - Dhar, Ashutosh
AU - Wang, Dong Kai
AU - Roewer, Thomas
AU - McPadden, Adam
AU - O'Halloran, Oliver
AU - Chen, Deming
AU - Xiong, Jinjun
AU - Kim, Daehoon
AU - Hwu, Wen Mei
AU - Kim, Nam Sung
N1 - Funding Information:
This work is supported in part by grants from NSF (CNS-1557244 and CNS-1705047)
Funding Information:
memory (PIM) architectures integrate a processor and DRAM onto a single die [13], [56]–[61]. These architectures can reduce energy consumption and increase the throughput of data transfers between the processor and DRAM, but suffer from high fabrication costs and low yields [14], [62]. The integration issue was mitigated by the emerging 3D die-stacking technology, reopening opportunities for near-DRAM processing architectures [2], [6]–[8], [24], [63]–[70]. Among these architectures, NDA [65] 3D-stacks accelerators atop a DRAM device and is similar to MCN because both build on a standard DRAM interface and DIMM architecture. However, NDA requires a programmer to manually handle the communication between the host processor and accelerators using a dedicated programming model for the accelerators. As a cheaper alternative to using 3D die-stacking technology and providing large memory capacity, Chameleon [3] proposes to place accelerators in the buffer devices of DIMMs. It is similar to MCN because accelerators are integrated in a buffer device, but it suffers from the same limitation as NDA. In contrast, MCN is unique because it does not require technology integration, 3D die-stacking, or a new programming model. Cache coherent interconnect. IBM’s Coherent Accelerator Processor Interface (CAPI) [71] is a high speed communication standard, designed for I/O attached accelerators to work in a cache-coherent fashion with traditional processors. As CAPI leverages existing point-to-point PCIe I/O channels, it needs to rely on a customized hardware to manage the coherency and it cannot provide the bandwidth scaling benefit with more accelerator modules like MCN. Besides, CAPI relies on kernel extensions and a CAPI application library to expose the accelerator to the host application, thus requiring the user to modify or rewrite the application in order to leverage the accelerators. Intel Quick Path Interconnect(QPI) [72] supports a cache coherency protocol for attached devices.Although QPI is a high speed point-to-point interconnect, it suffers from the same limitations as PCIe (long latency, limited to one device per channel, etc.). Intel HARP [73] architecture leverages the QPI and couples an FPGA with an Intel processor. Like CAPI, accelerators in HARP have access to the cache coherency mechanisms and unified address space, but leveraging accelerators in HARP also requires using Intel provided APIs and libraries or using OpenCL, which would once again require modifying or rewriting the target application. IX. CONCLUSION In this paper, we propose MCN consisting of MCN DIMMs and MCN drivers. MCN allows us to run applications based on distributed computing frameworks, such as MPI, without any change in the host processor hardware, distributed computing middleware and application software, while offering the benefits of high-bandwidth/low-latency communication between host and MCN processors. Furthermore, MCN can serve as an application-transparent near-DRAM processing platform since the memory bandwidth for processing multiplies with the number of MCN DIMMs. As such, MCN can unify the near-DRAM processing in a node with the distributed computing across multiple nodes. Our evaluation shows that a node with MCN can provide up to 58.7% higher performance than multiple conventional nodes connected by a 10GbE network when running variousMPI-based distributed applications. Lastly, we demonstrate a proof of MCN concept with an IBM POWER8 system and an experimental buffered DIMM. X. ACKNOWLEDGEMENT This work is supported in part by grants from NSF (CNS-1557244 and CNS-1705047) and IBM-ILLINOIS Center for Cognitive Computing Systems Research (C3SR). Nam Sung Kim and Daeehoon Kim are the co-corresponding authors.
Publisher Copyright:
© 2018 IEEE.
PY - 2018/12/12
Y1 - 2018/12/12
N2 - The physical memory capacity of servers is expected to increase drastically with deployment of the forthcoming non-volatile memory technologies. This is a welcomed improvement for emerging data-intensive applications. For such servers to be cost-effective, nonetheless, we must cost-effectively increase compute throughput and memory bandwidth commensurate with the increase in memory capacity without compromising application readiness. Tackling this challenge, we present Memory Channel Network (MCN) architecture in this paper. Specifically, first, we propose an MCN DIMM, an extension of a buffered DIMM where a small but capable processor called MCN processor is integrated with a buffer device on the DIMM for near-memory processing. Second, we implement device drivers to give the host and MCN processors in a server an illusion that they are independent heterogeneous nodes connected through an Ethernet link. These allow the host and MCN processors in a server to run a given data-intensive application together based on popular distributed computing frameworks such as MPI and Spark without any change in the host processor hardware and its application software, while offering the benefits of high-bandwidth and low-latency communications between the host and the MCN processors over memory channels. As such, MCN can serve as an application-Transparent framework which can seamlessly unify near-memory processing within a server and distributed computing across such servers for data-intensive applications. Our simulation running the full software stack shows that a server with 8 MCN DIMMs offers 4.56X higher throughput and consume 47.5% less energy than a cluster with 9 conventional nodes connected through Ethernet links, as it facilitates up to 8.17X higher aggregate DRAM bandwidth utilization. Lastly, we demonstrate the feasibility of MCN with an IBM POWER8 system and an experimental buffered DIMM.
AB - The physical memory capacity of servers is expected to increase drastically with deployment of the forthcoming non-volatile memory technologies. This is a welcomed improvement for emerging data-intensive applications. For such servers to be cost-effective, nonetheless, we must cost-effectively increase compute throughput and memory bandwidth commensurate with the increase in memory capacity without compromising application readiness. Tackling this challenge, we present Memory Channel Network (MCN) architecture in this paper. Specifically, first, we propose an MCN DIMM, an extension of a buffered DIMM where a small but capable processor called MCN processor is integrated with a buffer device on the DIMM for near-memory processing. Second, we implement device drivers to give the host and MCN processors in a server an illusion that they are independent heterogeneous nodes connected through an Ethernet link. These allow the host and MCN processors in a server to run a given data-intensive application together based on popular distributed computing frameworks such as MPI and Spark without any change in the host processor hardware and its application software, while offering the benefits of high-bandwidth and low-latency communications between the host and the MCN processors over memory channels. As such, MCN can serve as an application-Transparent framework which can seamlessly unify near-memory processing within a server and distributed computing across such servers for data-intensive applications. Our simulation running the full software stack shows that a server with 8 MCN DIMMs offers 4.56X higher throughput and consume 47.5% less energy than a cluster with 9 conventional nodes connected through Ethernet links, as it facilitates up to 8.17X higher aggregate DRAM bandwidth utilization. Lastly, we demonstrate the feasibility of MCN with an IBM POWER8 system and an experimental buffered DIMM.
KW - Application Transparent
KW - Buffer Device
KW - DRAM
KW - Distributed Systems
KW - Ethernet
KW - Memory Channel
KW - Mobile Processors
KW - Near Memory Processing
KW - Processing In Memory
KW - TCP IP
UR - http://www.scopus.com/inward/record.url?scp=85060012121&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85060012121&partnerID=8YFLogxK
U2 - 10.1109/MICRO.2018.00070
DO - 10.1109/MICRO.2018.00070
M3 - Conference contribution
AN - SCOPUS:85060012121
T3 - Proceedings of the Annual International Symposium on Microarchitecture, MICRO
SP - 802
EP - 814
BT - Proceedings - 51st Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2018
PB - IEEE Computer Society
T2 - 51st Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2018
Y2 - 20 October 2018 through 24 October 2018
ER -