TY - GEN
T1 - An adaptive performance modeling tool for GPU architectures
AU - Baghsorkhi, Sara S.
AU - Delahaye, Matthieu
AU - Patel, Sanjay J.
AU - Gropp, William D.
AU - Hwu, Wen Mei W.
PY - 2010
Y1 - 2010
N2 - This paper presents an analytical model to predict the performance of general-purpose applications on a GPU architecture. The model is designed to provide performance information to an auto-tuning compiler and assist it in narrowing down the search to the more promising implementations. It can also be incorporated into a tool to help programmers better assess the performance bottlenecks in their code. We analyze each GPU kernel and identify how the kernel exercises major GPU microarchitecture features. To identify the performance bottlenecks accurately, we introduce an abstract interpretation of a GPU kernel, work flow graph, based on which we estimate the execution time of a GPU kernel. We validated our performance model on the NVIDIA GPUs using CUDA (Compute Unified Device Architecture). For this purpose, we used data parallel benchmarks that stress different GPU microarchitecture events such as uncoalesced memory accesses, scratch-pad memory bank conflicts, and control flow divergence, which must be accurately modeled but represent challenges to the analytical performance models. The proposed model captures full system complexity and shows high accuracy in predicting the performance trends of different optimized kernel implementations. We also describe our approach to extracting the performance model automatically from a kernel code.
AB - This paper presents an analytical model to predict the performance of general-purpose applications on a GPU architecture. The model is designed to provide performance information to an auto-tuning compiler and assist it in narrowing down the search to the more promising implementations. It can also be incorporated into a tool to help programmers better assess the performance bottlenecks in their code. We analyze each GPU kernel and identify how the kernel exercises major GPU microarchitecture features. To identify the performance bottlenecks accurately, we introduce an abstract interpretation of a GPU kernel, work flow graph, based on which we estimate the execution time of a GPU kernel. We validated our performance model on the NVIDIA GPUs using CUDA (Compute Unified Device Architecture). For this purpose, we used data parallel benchmarks that stress different GPU microarchitecture events such as uncoalesced memory accesses, scratch-pad memory bank conflicts, and control flow divergence, which must be accurately modeled but represent challenges to the analytical performance models. The proposed model captures full system complexity and shows high accuracy in predicting the performance trends of different optimized kernel implementations. We also describe our approach to extracting the performance model automatically from a kernel code.
KW - Analytical model
KW - GPU
KW - Parallel programming
KW - Performance estimation
UR - http://www.scopus.com/inward/record.url?scp=77749337497&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=77749337497&partnerID=8YFLogxK
U2 - 10.1145/1693453.1693470
DO - 10.1145/1693453.1693470
M3 - Conference contribution
AN - SCOPUS:77749337497
SN - 9781605587080
T3 - Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP
SP - 105
EP - 114
BT - PPoPP'10 - Proceedings of the 2010 ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
T2 - 2010 ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP'10
Y2 - 9 January 2010 through 14 January 2010
ER -