TY - GEN
T1 - Implementing a GPU programming model on a non-GPU accelerator architecture
AU - Kofsky, Stephen M.
AU - Johnson, Daniel R.
AU - Stratton, John A.
AU - Hwu, Wen Mei W.
AU - Patel, Sanjay J.
AU - Lumetta, Steven S.
N1 - Funding Information:
Acknowledgments. The authors gratefully acknowledge generous donations by Advanced Micro Devices, Intel Corporation and Microsoft Corporation as well as support from the Information Trust Institute of the University of Illinois at Urbana-Champaign and the Hewlett-Packard Company through its Adaptive Enterprise Grid Program. Lumetta was supported in part by a Faculty Fellowship from the National Center for Supercomputing Applications. The content of this paper does not necessarily reflect the position nor the policies of any of these organizations.
Funding Information:
The authors gratefully acknowledge generous donations by Advanced Micro Devices, Intel Corporation and Microsoft Corporation as well as support from the Information Trust Institute of the University of Illinois at Urbana-Champaign and the Hewlett-Packard Company through its Adaptive Enterprise Grid Program. Lumetta was supported in part by a Faculty Fellowship from the National Center for Supercomputing Applications. The content of this paper does not necessarily reflect the position nor the policies of any of these organizations.
PY - 2012
Y1 - 2012
N2 - Parallel codes are written primarily for the purpose of performance. It is highly desirable that parallel codes be portable between parallel architectures without significant performance degradation or code rewrites. While performance portability and its limits have been studied thoroughly on single processor systems, this goal has been less extensively studied and is more difficult to achieve for parallel systems. Emerging single-chip parallel platforms are no exception; writing code that obtains good performance across GPUs and other many-core CMPs can be challenging. In this paper, we focus on CUDA codes, noting that programs must obey a number of constraints to achieve high performance on an NVIDIA GPU. Under such constraints, we develop optimizations that improve the performance of CUDA code on a MIMD accelerator architecture that we are developing called Rigel. We demonstrate performance improvements with these optimizations over naïve translations, and final performance results comparable to those of codes that were hand-optimized for Rigel.
AB - Parallel codes are written primarily for the purpose of performance. It is highly desirable that parallel codes be portable between parallel architectures without significant performance degradation or code rewrites. While performance portability and its limits have been studied thoroughly on single processor systems, this goal has been less extensively studied and is more difficult to achieve for parallel systems. Emerging single-chip parallel platforms are no exception; writing code that obtains good performance across GPUs and other many-core CMPs can be challenging. In this paper, we focus on CUDA codes, noting that programs must obey a number of constraints to achieve high performance on an NVIDIA GPU. Under such constraints, we develop optimizations that improve the performance of CUDA code on a MIMD accelerator architecture that we are developing called Rigel. We demonstrate performance improvements with these optimizations over naïve translations, and final performance results comparable to those of codes that were hand-optimized for Rigel.
UR - http://www.scopus.com/inward/record.url?scp=84863247559&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84863247559&partnerID=8YFLogxK
U2 - 10.1007/978-3-642-24322-6_5
DO - 10.1007/978-3-642-24322-6_5
M3 - Conference contribution
AN - SCOPUS:84863247559
SN - 9783642243219
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 40
EP - 51
BT - Computer Architecture - ISCA 2010 International Workshops, A4MMC, AMAS-BT, EAMA, WEED, WIOSCA, Revised Selected Papers
T2 - ACM IEEE International Symposium on Computer Architecture, ISCA 2011
Y2 - 19 June 2010 through 23 June 2010
ER -