TY - JOUR
T1 - Data layout transformation exploiting memory-level parallelism in structured grid many-core applications
AU - Sung, I. Jui
AU - Anssari, Nasser
AU - Stratton, John A.
AU - Hwu, Wen Mei W.
N1 - Funding Information:
Acknowledgments This work was funded by the Universal Parallel Computing Research Center at the University of Illinois at Urbana-Champaign. The Center is sponsored by Intel Corporation and Microsoft Corporation. This work utilized the AC cluster [14] operated by the Innovative Systems Laboratory (ISL) at the National Center for Supercomputing Applications (NCSA) at the University of Illinois. The cluster was funded by NSF SCI 05-25308 and CNS 05-51665 grants along with generous donations of hardware from NVIDIA, Nallatech, and AMD. We would like to thank Chris Rodrigues, Nady Obeid, and anonymous reviewers for their comments.
PY - 2012/2
Y1 - 2012/2
N2 - We present automatic data layout transformation as an effective compiler performance optimization for memory-bound structured grid applications. Structured grid applications include stencil codes and other code structures using a dense, regular grid as the primary data structure. Fluid dynamics and heat distribution, which both solve partial differential equations on a discretized representation of space, are representative of many important structured grid applications. Using the information available through variable-length array syntax, standardized in C99 and other modern languages, we enable automatic data layout transformations for structured grid codes with dynamically allocated arrays. We also present how a tool can guide these transformations to statically choose a good layout given a model of the memory system, using a modern GPU as an example. A transformed layout that distributes concurrent memory requests among parallel memory system components provides substantial speedup for structured grid applications by improving their achieved memory-level parallelism. Even with the overhead of more complex address calculations, we observe up to 10.94X speedup over the original layout, and a 1.16X performance gain in the worst case.
AB - We present automatic data layout transformation as an effective compiler performance optimization for memory-bound structured grid applications. Structured grid applications include stencil codes and other code structures using a dense, regular grid as the primary data structure. Fluid dynamics and heat distribution, which both solve partial differential equations on a discretized representation of space, are representative of many important structured grid applications. Using the information available through variable-length array syntax, standardized in C99 and other modern languages, we enable automatic data layout transformations for structured grid codes with dynamically allocated arrays. We also present how a tool can guide these transformations to statically choose a good layout given a model of the memory system, using a modern GPU as an example. A transformed layout that distributes concurrent memory requests among parallel memory system components provides substantial speedup for structured grid applications by improving their achieved memory-level parallelism. Even with the overhead of more complex address calculations, we observe up to 10.94X speedup over the original layout, and a 1.16X performance gain in the worst case.
KW - Data layout transformation
KW - GPU
KW - Parallel programming
UR - http://www.scopus.com/inward/record.url?scp=84856256277&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84856256277&partnerID=8YFLogxK
U2 - 10.1007/s10766-011-0182-5
DO - 10.1007/s10766-011-0182-5
M3 - Article
AN - SCOPUS:84856256277
SN - 0885-7458
VL - 40
SP - 4
EP - 24
JO - International Journal of Parallel Programming
JF - International Journal of Parallel Programming
IS - 1
ER -