Locality-centric thread scheduling for bulk-synchronous programming models on CPU architectures

Hee Seok Kim, Izzat El Hajj, John Stratton, Steven Sam Lumetta, Wen-Mei W Hwu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

With heterogeneous computing on the rise, executing programs efficiently on different devices from a single source code has become increasingly important. OpenCL, having a bulk-synchronous programming model, has been proposed as a framework for writing such performance-portable programs. Execution order of work-items in a program is unconstrained except at barrier synchronization events, giving some freedom to an implementation when scheduling work-items between synchronization points. Many OpenCL (and CUDA) compilers have been designed for targeting multicore CPU architectures. However, scheduling work-items in prior work has been done with primary focus on correctness and vectorization. To the best of our knowledge, no existing implementations consider the impact of work-item scheduling on data locality. We propose an OpenCL compiler that performs data-locality-centric work-item scheduling. By analyzing the memory addresses accessed in loops within a kernel, our technique can make better decisions on how to schedule work-items to construct better memory access patterns, thereby improving performance. Our approach achieves geomean speedups of 3.32× over AMD's and 1.71 × over Intel's implementations on Parboil and Rodinia benchmarks.

Original languageEnglish (US)
Title of host publicationProceedings of the 2015 IEEE/ACM International Symposium on Code Generation and Optimization, CGO 2015
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages257-268
Number of pages12
ISBN (Electronic)9781479981618
DOIs
StatePublished - Mar 3 2015
Event2015 IEEE/ACM International Symposium on Code Generation and Optimization, CGO 2015 - San Francisco, United States
Duration: Feb 7 2015Feb 11 2015

Publication series

NameProceedings of the 2015 IEEE/ACM International Symposium on Code Generation and Optimization, CGO 2015

Other

Other2015 IEEE/ACM International Symposium on Code Generation and Optimization, CGO 2015
CountryUnited States
CitySan Francisco
Period2/7/152/11/15

Fingerprint

Computer programming
Locality
Thread
Programming Model
Program processors
Scheduling
Data Locality
Compiler
Synchronization
Data storage equipment
Heterogeneous Computing
Vectorization
Correctness
Schedule
Benchmark
kernel
Architecture

ASJC Scopus subject areas

  • Applied Mathematics
  • Control and Optimization
  • Computer Science Applications
  • Computational Theory and Mathematics

Cite this

Kim, H. S., Hajj, I. E., Stratton, J., Lumetta, S. S., & Hwu, W-M. W. (2015). Locality-centric thread scheduling for bulk-synchronous programming models on CPU architectures. In Proceedings of the 2015 IEEE/ACM International Symposium on Code Generation and Optimization, CGO 2015 (pp. 257-268). [7054205] (Proceedings of the 2015 IEEE/ACM International Symposium on Code Generation and Optimization, CGO 2015). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/CGO.2015.7054205

Locality-centric thread scheduling for bulk-synchronous programming models on CPU architectures. / Kim, Hee Seok; Hajj, Izzat El; Stratton, John; Lumetta, Steven Sam; Hwu, Wen-Mei W.

Proceedings of the 2015 IEEE/ACM International Symposium on Code Generation and Optimization, CGO 2015. Institute of Electrical and Electronics Engineers Inc., 2015. p. 257-268 7054205 (Proceedings of the 2015 IEEE/ACM International Symposium on Code Generation and Optimization, CGO 2015).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Kim, HS, Hajj, IE, Stratton, J, Lumetta, SS & Hwu, W-MW 2015, Locality-centric thread scheduling for bulk-synchronous programming models on CPU architectures. in Proceedings of the 2015 IEEE/ACM International Symposium on Code Generation and Optimization, CGO 2015., 7054205, Proceedings of the 2015 IEEE/ACM International Symposium on Code Generation and Optimization, CGO 2015, Institute of Electrical and Electronics Engineers Inc., pp. 257-268, 2015 IEEE/ACM International Symposium on Code Generation and Optimization, CGO 2015, San Francisco, United States, 2/7/15. https://doi.org/10.1109/CGO.2015.7054205
Kim HS, Hajj IE, Stratton J, Lumetta SS, Hwu W-MW. Locality-centric thread scheduling for bulk-synchronous programming models on CPU architectures. In Proceedings of the 2015 IEEE/ACM International Symposium on Code Generation and Optimization, CGO 2015. Institute of Electrical and Electronics Engineers Inc. 2015. p. 257-268. 7054205. (Proceedings of the 2015 IEEE/ACM International Symposium on Code Generation and Optimization, CGO 2015). https://doi.org/10.1109/CGO.2015.7054205
Kim, Hee Seok ; Hajj, Izzat El ; Stratton, John ; Lumetta, Steven Sam ; Hwu, Wen-Mei W. / Locality-centric thread scheduling for bulk-synchronous programming models on CPU architectures. Proceedings of the 2015 IEEE/ACM International Symposium on Code Generation and Optimization, CGO 2015. Institute of Electrical and Electronics Engineers Inc., 2015. pp. 257-268 (Proceedings of the 2015 IEEE/ACM International Symposium on Code Generation and Optimization, CGO 2015).
@inproceedings{3715804b40934048a54148b776a328d1,
title = "Locality-centric thread scheduling for bulk-synchronous programming models on CPU architectures",
abstract = "With heterogeneous computing on the rise, executing programs efficiently on different devices from a single source code has become increasingly important. OpenCL, having a bulk-synchronous programming model, has been proposed as a framework for writing such performance-portable programs. Execution order of work-items in a program is unconstrained except at barrier synchronization events, giving some freedom to an implementation when scheduling work-items between synchronization points. Many OpenCL (and CUDA) compilers have been designed for targeting multicore CPU architectures. However, scheduling work-items in prior work has been done with primary focus on correctness and vectorization. To the best of our knowledge, no existing implementations consider the impact of work-item scheduling on data locality. We propose an OpenCL compiler that performs data-locality-centric work-item scheduling. By analyzing the memory addresses accessed in loops within a kernel, our technique can make better decisions on how to schedule work-items to construct better memory access patterns, thereby improving performance. Our approach achieves geomean speedups of 3.32× over AMD's and 1.71 × over Intel's implementations on Parboil and Rodinia benchmarks.",
author = "Kim, {Hee Seok} and Hajj, {Izzat El} and John Stratton and Lumetta, {Steven Sam} and Hwu, {Wen-Mei W}",
year = "2015",
month = "3",
day = "3",
doi = "10.1109/CGO.2015.7054205",
language = "English (US)",
series = "Proceedings of the 2015 IEEE/ACM International Symposium on Code Generation and Optimization, CGO 2015",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
pages = "257--268",
booktitle = "Proceedings of the 2015 IEEE/ACM International Symposium on Code Generation and Optimization, CGO 2015",
address = "United States",

}

TY - GEN

T1 - Locality-centric thread scheduling for bulk-synchronous programming models on CPU architectures

AU - Kim, Hee Seok

AU - Hajj, Izzat El

AU - Stratton, John

AU - Lumetta, Steven Sam

AU - Hwu, Wen-Mei W

PY - 2015/3/3

Y1 - 2015/3/3

N2 - With heterogeneous computing on the rise, executing programs efficiently on different devices from a single source code has become increasingly important. OpenCL, having a bulk-synchronous programming model, has been proposed as a framework for writing such performance-portable programs. Execution order of work-items in a program is unconstrained except at barrier synchronization events, giving some freedom to an implementation when scheduling work-items between synchronization points. Many OpenCL (and CUDA) compilers have been designed for targeting multicore CPU architectures. However, scheduling work-items in prior work has been done with primary focus on correctness and vectorization. To the best of our knowledge, no existing implementations consider the impact of work-item scheduling on data locality. We propose an OpenCL compiler that performs data-locality-centric work-item scheduling. By analyzing the memory addresses accessed in loops within a kernel, our technique can make better decisions on how to schedule work-items to construct better memory access patterns, thereby improving performance. Our approach achieves geomean speedups of 3.32× over AMD's and 1.71 × over Intel's implementations on Parboil and Rodinia benchmarks.

AB - With heterogeneous computing on the rise, executing programs efficiently on different devices from a single source code has become increasingly important. OpenCL, having a bulk-synchronous programming model, has been proposed as a framework for writing such performance-portable programs. Execution order of work-items in a program is unconstrained except at barrier synchronization events, giving some freedom to an implementation when scheduling work-items between synchronization points. Many OpenCL (and CUDA) compilers have been designed for targeting multicore CPU architectures. However, scheduling work-items in prior work has been done with primary focus on correctness and vectorization. To the best of our knowledge, no existing implementations consider the impact of work-item scheduling on data locality. We propose an OpenCL compiler that performs data-locality-centric work-item scheduling. By analyzing the memory addresses accessed in loops within a kernel, our technique can make better decisions on how to schedule work-items to construct better memory access patterns, thereby improving performance. Our approach achieves geomean speedups of 3.32× over AMD's and 1.71 × over Intel's implementations on Parboil and Rodinia benchmarks.

UR - http://www.scopus.com/inward/record.url?scp=84961314978&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84961314978&partnerID=8YFLogxK

U2 - 10.1109/CGO.2015.7054205

DO - 10.1109/CGO.2015.7054205

M3 - Conference contribution

AN - SCOPUS:84961314978

T3 - Proceedings of the 2015 IEEE/ACM International Symposium on Code Generation and Optimization, CGO 2015

SP - 257

EP - 268

BT - Proceedings of the 2015 IEEE/ACM International Symposium on Code Generation and Optimization, CGO 2015

PB - Institute of Electrical and Electronics Engineers Inc.

ER -