TY - GEN
T1 - Fine-grained task migration for graph algorithms using processing in memory
AU - Aguilera, Paula
AU - Zhang, Dong Ping
AU - Kim, Nam Sung
AU - Jayasena, Nuwan
PY - 2016/7/18
Y1 - 2016/7/18
N2 - Graphs are used in a wide variety of applicationdomains, from social science to machine learning. Graphalgorithms present large numbers of irregular accesses with littledata reuse to amortize the high cost of memory accesses, requiring high memory bandwidth. Processing in memory (PIM) implemented through 3D die-stacking can deliver this highmemory bandwidth. In a system with multiple memory moduleswith PIM, the in-memory compute logic has low latency and highbandwidth access to its local memory, while accesses to remotememory introduce high latency and energy consumption. Ideally, in such a system, computation and data are partitioned amongthe PIM devices to maximize data locality. But the irregularmemory access patterns present in graph applications make itdifficult to guarantee that the computation in each PIM devicewill only access its local data. A large number of remote memoryaccesses can negate the benefits of using PIM. In this paper, we examine the feasibility and potential of finegrainedwork migration to reduce remote data accesses insystems with multiple PIM devices. First, we propose a datadrivenimplementation of our study algorithms: breadth-firstsearch (BFS), single source shortest path (SSSP) and betweennesscentrality (BC) where each PIM has a queue where the verticesthat it needs to process are held. New vertices that need to beprocessed are enqueued at the PIM device co-located with thememory that stores those vertices. Second, we propose hardwaresupport that takes advantage of PIM to implement highlyefficient queues that improve the performance of the queuingframework by up to 16.7%. Third, we develop a timing model forthe queueing framework to explore the benefits of workmigration vs. remote memory accesses. And, finally, our analysisusing the above framework shows that naïve task migration canlead to performance degradations and identifies trade-offsamong data locality, redundant computation, and load balanceamong PIM devices that must be taken into account to realize thepotential benefits of fine-grain task migration.
AB - Graphs are used in a wide variety of applicationdomains, from social science to machine learning. Graphalgorithms present large numbers of irregular accesses with littledata reuse to amortize the high cost of memory accesses, requiring high memory bandwidth. Processing in memory (PIM) implemented through 3D die-stacking can deliver this highmemory bandwidth. In a system with multiple memory moduleswith PIM, the in-memory compute logic has low latency and highbandwidth access to its local memory, while accesses to remotememory introduce high latency and energy consumption. Ideally, in such a system, computation and data are partitioned amongthe PIM devices to maximize data locality. But the irregularmemory access patterns present in graph applications make itdifficult to guarantee that the computation in each PIM devicewill only access its local data. A large number of remote memoryaccesses can negate the benefits of using PIM. In this paper, we examine the feasibility and potential of finegrainedwork migration to reduce remote data accesses insystems with multiple PIM devices. First, we propose a datadrivenimplementation of our study algorithms: breadth-firstsearch (BFS), single source shortest path (SSSP) and betweennesscentrality (BC) where each PIM has a queue where the verticesthat it needs to process are held. New vertices that need to beprocessed are enqueued at the PIM device co-located with thememory that stores those vertices. Second, we propose hardwaresupport that takes advantage of PIM to implement highlyefficient queues that improve the performance of the queuingframework by up to 16.7%. Third, we develop a timing model forthe queueing framework to explore the benefits of workmigration vs. remote memory accesses. And, finally, our analysisusing the above framework shows that naïve task migration canlead to performance degradations and identifies trade-offsamong data locality, redundant computation, and load balanceamong PIM devices that must be taken into account to realize thepotential benefits of fine-grain task migration.
KW - Graph Algorithms
KW - Processing In Memory
UR - http://www.scopus.com/inward/record.url?scp=84991650727&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84991650727&partnerID=8YFLogxK
U2 - 10.1109/IPDPSW.2016.205
DO - 10.1109/IPDPSW.2016.205
M3 - Conference contribution
AN - SCOPUS:84991650727
T3 - Proceedings - 2016 IEEE 30th International Parallel and Distributed Processing Symposium, IPDPS 2016
SP - 489
EP - 498
BT - Proceedings - 2016 IEEE 30th International Parallel and Distributed Processing Symposium, IPDPS 2016
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 30th IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2016
Y2 - 23 May 2016 through 27 May 2016
ER -