Performance improvements from DRAM technology scaling have been lagging behind the improvements from logic technology scaling for many years. As application demand for main memory continues to grow, DRAM-based main memory is increasingly becoming a larger system bottleneck in terms of both performance and energy consumption. A major reason for poor memory performance and energy efficiency is memory’s inability to perform computation. Instead, data stored within DRAM memory must be moved into the CPU before any computation can take place. This data movement is costly, as it requires a high latency and consumes significant energy to transfer the data across the pin-limited memory channel. Moreover, the data moved to the CPU is often not reused, and thus does not benefit from being cached within the CPU, which makes it difficult to amortize the overhead of data movement. Modern 3D-stacked DRAM architectures provide an opportunity to avoid unnecessary data movement between memory and the CPU. These multi-layer architectures include a logic layer, where compute logic can be integrated underneath multiple layers of DRAM cell arrays (i.e., thememory layers) within the same chip. Architects can take advantage of the logic layer to perform processing-in-memory (PIM), or near-data processing, where some of the computation is moved from the CPU to the logic layer underneath the memory layer. In a PIM architecture, the logic layer within DRAM has access to the high internal bandwidth available within 3D-stacked DRAM (which is much greater than the bandwidth available in the narrow memory channel between DRAM and the CPU). Thus, PIM architectures can effectively free up valuable bandwidth on the bandwidth-limited memory channel while at the same time reducing system energy consumption. A number of important issues arise when we add compute logic to DRAM. In particular, logic within DRAM does not have low-latency access to common CPU structures that are essential for modern application execution, such as the virtual memory mechanisms, e.g., the translation lookaside buffer (TLB) or the page table walker, and the cache coherence mechanisms, e.g., the coherence directory. To ease the widespread adoption of PIM, we ideally would like to maintain traditional virtual memory abstractions and the shared memory programming model. This requires efficient mechanisms that can provide logic in DRAM with access to virtual memory and cache coherence without having to communicate frequently with the CPU, as off-chip communication between the CPU and DRAM consumes much of the limited bandwidth that PIM aims to avoid using. To this end, we propose and evaluate two general-purpose solutions that can be used by PIM architectures to minimize unnecessary off-chip communication. The first, IMPICA, is an efficient in-memory accelerator for pointer chasing, which can handle address translation entirely within DRAM. The second, LazyPIM, provides coherence supportwithout the need to continually communicate with the CPU. We show that both of these mechanisms provide a significant benefit for a number of important memory-intensive applications, thereby both improving performance and reducing energy consumption.
|Original language||English (US)|
|Title of host publication||Beyond-CMOS Technologies for Next Generation Computer Design|
|Number of pages||62|
|State||Published - Jan 1 2018|
ASJC Scopus subject areas
- Computer Science(all)