TY - GEN
T1 - Retargeting and Respecializing GPU Workloads for Performance Portability
AU - Ivanov, Ivan R.
AU - Zinenko, Oleksandr
AU - Domke, Jens
AU - Endo, Toshio
AU - Moses, William S.
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - In order to come close to peak performance, accelerators like GPUs require significant architecture-specific tuning that understand the availability of shared memory, parallelism, tensor cores, etc. Unfortunately, the pursuit of higher performance and lower costs have led to a significant diversification of architecture designs, even from the same vendor. This creates the need for performance portability across different GPUs, especially important for programs in a particular programming model with a certain architecture in mind. Even when the program can be seamlessly executed on a different architecture, it may suffer a performance penalty due to it not being sized appropriately to the available hardware resources such as fast memory and registers, let alone not using newer advanced features of the architecture. We propose a new approach to improving performance of (legacy) CUDA programs for modern machines by automatically adjusting the amount of work each parallel thread does, and the amount of memory and register resources it requires. By operating within the MLIR compiler infrastructure, we are able to also target AMD GPUs by performing automatic translation from CUDA and simultaneously adjust the program granularity to fit the size of target GPUs. Combined with autotuning assisted by the platform-specific compiler, our approach demonstrates 27% geomean speedup on the Rodinia benchmark suite over baseline CUDA implementation as well as performance parity between similar NVIDIA and AMD GPUs executing the same CUDA program.
AB - In order to come close to peak performance, accelerators like GPUs require significant architecture-specific tuning that understand the availability of shared memory, parallelism, tensor cores, etc. Unfortunately, the pursuit of higher performance and lower costs have led to a significant diversification of architecture designs, even from the same vendor. This creates the need for performance portability across different GPUs, especially important for programs in a particular programming model with a certain architecture in mind. Even when the program can be seamlessly executed on a different architecture, it may suffer a performance penalty due to it not being sized appropriately to the available hardware resources such as fast memory and registers, let alone not using newer advanced features of the architecture. We propose a new approach to improving performance of (legacy) CUDA programs for modern machines by automatically adjusting the amount of work each parallel thread does, and the amount of memory and register resources it requires. By operating within the MLIR compiler infrastructure, we are able to also target AMD GPUs by performing automatic translation from CUDA and simultaneously adjust the program granularity to fit the size of target GPUs. Combined with autotuning assisted by the platform-specific compiler, our approach demonstrates 27% geomean speedup on the Rodinia benchmark suite over baseline CUDA implementation as well as performance parity between similar NVIDIA and AMD GPUs executing the same CUDA program.
UR - http://www.scopus.com/inward/record.url?scp=85187214772&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85187214772&partnerID=8YFLogxK
U2 - 10.1109/CGO57630.2024.10444828
DO - 10.1109/CGO57630.2024.10444828
M3 - Conference contribution
AN - SCOPUS:85187214772
T3 - CGO 2024 - Proceedings of the 2024 IEEE/ACM International Symposium on Code Generation and Optimization
SP - 119
EP - 132
BT - CGO 2024 - Proceedings of the 2024 IEEE/ACM International Symposium on Code Generation and Optimization
A2 - Grosser, Tobias
A2 - Dubach, Christophe
A2 - Steuwer, Michel
A2 - Xue, Jingling
A2 - Ottoni, Guilherme
A2 - Pereira, Fernando Magno Quintao
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 22nd IEEE/ACM International Symposium on Code Generation and Optimization, CGO 2024
Y2 - 2 March 2024 through 6 March 2024
ER -