TY - GEN
T1 - Data layout transformation exploiting memory-level parallelism in structured grid many-core applications
AU - Sung, I. Jui
AU - Stratton, John A.
AU - Hwu, Wen Mei W.
PY - 2010
Y1 - 2010
N2 - We present automatic data layout transformation as an effective compiler performance optimization for memory-bound structured grid applications. Structured grid applications include stencil codes and other code structures using a dense, regular grid as the primary data structure. Fluid dynamics and heat distribution, which both solve partial differential equations on a discretized representation of space, are representative of many important structured grid applications. Using the information available through variable-length array syntax, standardized in C99 and other modern languages, we have enabled automatic data layout transformations for structured grid codes with dynamically allocated arrays. We also present how a tool can guide these transformations to statically choose a good layout given a model of the memory system, using a modern GPU as an example. A transformed layout that distributes concurrent memory requests among parallel memory system components provides substantial speedup for structured grid applications by improving their achieved memory-level parallelism. Even with the overhead of more complex address calculations, we observe up to 560% performance increases over the language-defined layout, and a 7% performance gain in the worst case, in which the language-defined layout and access pattern is already well-vectorizable by the underlying hardware.
AB - We present automatic data layout transformation as an effective compiler performance optimization for memory-bound structured grid applications. Structured grid applications include stencil codes and other code structures using a dense, regular grid as the primary data structure. Fluid dynamics and heat distribution, which both solve partial differential equations on a discretized representation of space, are representative of many important structured grid applications. Using the information available through variable-length array syntax, standardized in C99 and other modern languages, we have enabled automatic data layout transformations for structured grid codes with dynamically allocated arrays. We also present how a tool can guide these transformations to statically choose a good layout given a model of the memory system, using a modern GPU as an example. A transformed layout that distributes concurrent memory requests among parallel memory system components provides substantial speedup for structured grid applications by improving their achieved memory-level parallelism. Even with the overhead of more complex address calculations, we observe up to 560% performance increases over the language-defined layout, and a 7% performance gain in the worst case, in which the language-defined layout and access pattern is already well-vectorizable by the underlying hardware.
KW - GPU
KW - data layout transformation
KW - parallel programming
UR - http://www.scopus.com/inward/record.url?scp=78149251414&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=78149251414&partnerID=8YFLogxK
U2 - 10.1145/1854273.1854336
DO - 10.1145/1854273.1854336
M3 - Conference contribution
AN - SCOPUS:78149251414
SN - 9781450301787
T3 - Parallel Architectures and Compilation Techniques - Conference Proceedings, PACT
SP - 513
EP - 522
BT - PACT'10 - Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 19th International Conference on Parallel Architectures and Compilation Techniques, PACT 2010
Y2 - 11 September 2010 through 15 September 2010
ER -