TY - GEN
T1 - A memory heterogeneity-aware runtime system for bandwidth-sensitive hpc applications
AU - Chandrasekar, Kavitha
AU - Ni, Xiang
AU - Kale, Laxmikant V.
N1 - Publisher Copyright:
© 2017 IEEE.
PY - 2017/6/30
Y1 - 2017/6/30
N2 - Today's supercomputers are moving towards deployment of many-core processors like Intel Xeon Phi Knights Landing (KNL), to deliver high compute and memory capacity. Applications executing on such many-core platforms with improved vectorization require high memory bandwidth. To improve performance, architectures like Knights Landing include a high bandwidth and low capacity in-package high bandwidth memory (HBM) in addition to the high capacity but low bandwidth DDR4. Other architectures like Nvidia's Pascal GPU also expose similar stacked DRAM. In architectures with heterogeneity in memory types within a node, efficient allocation and data movement can result in improved performance and energy savings in future systems if all the data requests are served from the high bandwidth memory. In this paper, we propose a memory-heterogeneity aware runtime system which guides data prefetch and eviction such that data can be accessed at high bandwidth for applications whose entire working set does not fit within the high bandwidth memory and data needs to be moved among different memory types. We implement a data movement mechanism managed by the runtime system which allows applications to run efficiently on architectures with heterogeneous memory hierarchy, with trivial code changes. We show upto 2X improvement in execution time for Stencil3D and Matrix Multiplication which are important HPC kernels.
AB - Today's supercomputers are moving towards deployment of many-core processors like Intel Xeon Phi Knights Landing (KNL), to deliver high compute and memory capacity. Applications executing on such many-core platforms with improved vectorization require high memory bandwidth. To improve performance, architectures like Knights Landing include a high bandwidth and low capacity in-package high bandwidth memory (HBM) in addition to the high capacity but low bandwidth DDR4. Other architectures like Nvidia's Pascal GPU also expose similar stacked DRAM. In architectures with heterogeneity in memory types within a node, efficient allocation and data movement can result in improved performance and energy savings in future systems if all the data requests are served from the high bandwidth memory. In this paper, we propose a memory-heterogeneity aware runtime system which guides data prefetch and eviction such that data can be accessed at high bandwidth for applications whose entire working set does not fit within the high bandwidth memory and data needs to be moved among different memory types. We implement a data movement mechanism managed by the runtime system which allows applications to run efficiently on architectures with heterogeneous memory hierarchy, with trivial code changes. We show upto 2X improvement in execution time for Stencil3D and Matrix Multiplication which are important HPC kernels.
KW - HPC
KW - Memory Heterogeneity
KW - Runtime System
KW - Scheduling
UR - http://www.scopus.com/inward/record.url?scp=85028091107&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85028091107&partnerID=8YFLogxK
U2 - 10.1109/IPDPSW.2017.168
DO - 10.1109/IPDPSW.2017.168
M3 - Conference contribution
AN - SCOPUS:85028091107
T3 - Proceedings - 2017 IEEE 31st International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2017
SP - 1293
EP - 1300
BT - Proceedings - 2017 IEEE 31st International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2017
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 31st IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2017
Y2 - 29 May 2017 through 2 June 2017
ER -