TY - GEN
T1 - Register and thread structure optimization for GPUs
AU - Liang, Yun
AU - Cui, Zheng
AU - Rupnow, Kyle
AU - Chen, Deming
PY - 2013
Y1 - 2013
N2 - GPUs are an increasingly popular implementation platform for a variety of general purpose applications from mobile and embedded devices to high performance computing. The CUDA and OpenCL parallel programming models enable easy utilization of the GPU's resources. However, tuning GPU applications' performance is a complex and labor intensive task. Software programmers employ a variety of optimization techniques to explore tradeoffs between the thread parallelism and performance of a single thread. However, prior techniques ignore register allocation, a significant factor in single thread performance and, indirectly affects the number of simultaneously active threads. In this paper, we show that joint optimization of register allocation and thread structure has great potential to significantly improve performance. However, the design space for this joint optimization can be large; therefore, we develop performance metrics appropriate for evaluation within a compiler's inner loop and efficient design space exploration techniques that use the metrics to narrow the search space. Across a range of GPU applications, we achieve average performance speedup of 1.33X (up to 1.73X) with design space exploration 355X faster than the exhaustive search.
AB - GPUs are an increasingly popular implementation platform for a variety of general purpose applications from mobile and embedded devices to high performance computing. The CUDA and OpenCL parallel programming models enable easy utilization of the GPU's resources. However, tuning GPU applications' performance is a complex and labor intensive task. Software programmers employ a variety of optimization techniques to explore tradeoffs between the thread parallelism and performance of a single thread. However, prior techniques ignore register allocation, a significant factor in single thread performance and, indirectly affects the number of simultaneously active threads. In this paper, we show that joint optimization of register allocation and thread structure has great potential to significantly improve performance. However, the design space for this joint optimization can be large; therefore, we develop performance metrics appropriate for evaluation within a compiler's inner loop and efficient design space exploration techniques that use the metrics to narrow the search space. Across a range of GPU applications, we achieve average performance speedup of 1.33X (up to 1.73X) with design space exploration 355X faster than the exhaustive search.
UR - http://www.scopus.com/inward/record.url?scp=84877777934&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84877777934&partnerID=8YFLogxK
U2 - 10.1109/ASPDAC.2013.6509639
DO - 10.1109/ASPDAC.2013.6509639
M3 - Conference contribution
AN - SCOPUS:84877777934
SN - 9781467330299
T3 - Proceedings of the Asia and South Pacific Design Automation Conference, ASP-DAC
SP - 461
EP - 466
BT - 2013 18th Asia and South Pacific Design Automation Conference, ASP-DAC 2013
T2 - 2013 18th Asia and South Pacific Design Automation Conference, ASP-DAC 2013
Y2 - 22 January 2013 through 25 January 2013
ER -