TY - GEN
T1 - Locality-centric thread scheduling for bulk-synchronous programming models on CPU architectures
AU - Kim, Hee Seok
AU - Hajj, Izzat El
AU - Stratton, John
AU - Lumetta, Steven
AU - Hwu, Wen Mei
N1 - Publisher Copyright:
© 2015 IEEE.
PY - 2015/3/3
Y1 - 2015/3/3
N2 - With heterogeneous computing on the rise, executing programs efficiently on different devices from a single source code has become increasingly important. OpenCL, having a bulk-synchronous programming model, has been proposed as a framework for writing such performance-portable programs. Execution order of work-items in a program is unconstrained except at barrier synchronization events, giving some freedom to an implementation when scheduling work-items between synchronization points. Many OpenCL (and CUDA) compilers have been designed for targeting multicore CPU architectures. However, scheduling work-items in prior work has been done with primary focus on correctness and vectorization. To the best of our knowledge, no existing implementations consider the impact of work-item scheduling on data locality. We propose an OpenCL compiler that performs data-locality-centric work-item scheduling. By analyzing the memory addresses accessed in loops within a kernel, our technique can make better decisions on how to schedule work-items to construct better memory access patterns, thereby improving performance. Our approach achieves geomean speedups of 3.32× over AMD's and 1.71 × over Intel's implementations on Parboil and Rodinia benchmarks.
AB - With heterogeneous computing on the rise, executing programs efficiently on different devices from a single source code has become increasingly important. OpenCL, having a bulk-synchronous programming model, has been proposed as a framework for writing such performance-portable programs. Execution order of work-items in a program is unconstrained except at barrier synchronization events, giving some freedom to an implementation when scheduling work-items between synchronization points. Many OpenCL (and CUDA) compilers have been designed for targeting multicore CPU architectures. However, scheduling work-items in prior work has been done with primary focus on correctness and vectorization. To the best of our knowledge, no existing implementations consider the impact of work-item scheduling on data locality. We propose an OpenCL compiler that performs data-locality-centric work-item scheduling. By analyzing the memory addresses accessed in loops within a kernel, our technique can make better decisions on how to schedule work-items to construct better memory access patterns, thereby improving performance. Our approach achieves geomean speedups of 3.32× over AMD's and 1.71 × over Intel's implementations on Parboil and Rodinia benchmarks.
UR - http://www.scopus.com/inward/record.url?scp=84961314978&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84961314978&partnerID=8YFLogxK
U2 - 10.1109/CGO.2015.7054205
DO - 10.1109/CGO.2015.7054205
M3 - Conference contribution
AN - SCOPUS:84961314978
T3 - Proceedings of the 2015 IEEE/ACM International Symposium on Code Generation and Optimization, CGO 2015
SP - 257
EP - 268
BT - Proceedings of the 2015 IEEE/ACM International Symposium on Code Generation and Optimization, CGO 2015
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2015 IEEE/ACM International Symposium on Code Generation and Optimization, CGO 2015
Y2 - 7 February 2015 through 11 February 2015
ER -