TY - GEN
T1 - Accelerating data movement on future chip multi-processors
AU - Gu, Junli
AU - Kumar, Rakesh
AU - Lumetta, Steven S.
AU - Sun, Yihe
N1 - Funding Information:
This work was made possible with the support of the NSF/IBM Blue Waters Project, NSF CCF, NSFC, Intel, GSRC, an Arnold O Beckman Research Award, the Information Trust Institute of the University of Illinois at Urbana-Champaign, and the Hew-lett-Packard Company through its Adaptive Enterprise Grid Program.
PY - 2010
Y1 - 2010
N2 - Moving data between cores on hardware coherent architectures suffers from memory latency and causes cache misses and coherence traffic, which are obstacles to achieving high performance. In this paper, we evaluate the potential for hardware optimization of message data transfer on chip multiprocessors with a combination of NAS parallel MPI benchmarks, Intel IMB MPI benchmarks, and a few microbenchmarks on a full-system simulator based on Simics and FeS2. We show that while passive hardware driven by cores can reduce cache traffic, it provides limited performance gains. We propose a data movement manager (DMM) that uses the on-chip coherence protocols to implement zero-copy message passing between separate address spaces and to remove synchronization and copy overheads from the processors. We also discuss methods for managing data placement in caches to reduce latency. We show that such a design shows substantial promise for both cache traffic reduction and performance improvements.
AB - Moving data between cores on hardware coherent architectures suffers from memory latency and causes cache misses and coherence traffic, which are obstacles to achieving high performance. In this paper, we evaluate the potential for hardware optimization of message data transfer on chip multiprocessors with a combination of NAS parallel MPI benchmarks, Intel IMB MPI benchmarks, and a few microbenchmarks on a full-system simulator based on Simics and FeS2. We show that while passive hardware driven by cores can reduce cache traffic, it provides limited performance gains. We propose a data movement manager (DMM) that uses the on-chip coherence protocols to implement zero-copy message passing between separate address spaces and to remove synchronization and copy overheads from the processors. We also discuss methods for managing data placement in caches to reduce latency. We show that such a design shows substantial promise for both cache traffic reduction and performance improvements.
KW - cache hierarchy
KW - data movement
KW - memory hierarchy
KW - multi-core
KW - multi-core architecture design
UR - http://www.scopus.com/inward/record.url?scp=78650861119&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=78650861119&partnerID=8YFLogxK
U2 - 10.1145/1882453.1882457
DO - 10.1145/1882453.1882457
M3 - Conference contribution
AN - SCOPUS:78650861119
SN - 9781450300087
T3 - ACM International Conference Proceeding Series
BT - Proceedings of the 2nd International Forum on Next-Generation Multicore/Manycore Technologies, IFMT'2010 - In Conjunction with the 37th Intl. Symposium on Computer Architecture, ISCA 2010
T2 - 2nd International Forum on Next Generation Multicore/Manycore Technologies, IFMT'2010, Co-located with the 37th International Symposium on Computer Architecture, ISCA 2010
Y2 - 19 June 2010 through 19 June 2010
ER -