TY - GEN
T1 - An efficient compiler framework for cache bypassing on GPUs
AU - Xie, Xiaolong
AU - Liang, Yun
AU - Sun, Guangyu
AU - Chen, Deming
PY - 2013
Y1 - 2013
N2 - Graphics Processing Units (GPUs) have become ubiquitous for general purpose applications due to their tremendous computing power. Initially, GPUs only employ scratchpad memory as on-chip memory. Though scratchpad memory benefits many applications, it is not ideal for those general purpose applications with irregular memory accesses. Hence, GPU vendors have introduced caches in conjunction with scratchpad memory in the recent generations of GPUs. The caches on GPUs are highly-configurable. The programmer or the compiler can explicitly control cache access or bypass for global load instructions. This highly-configurable feature of GPU caches opens up the opportunities for optimizing the cache performance. In this paper, we propose an efficient compiler framework for cache bypassing on GPUs. Our objective is to efficiently utilize the configurable cache and improve the overall performance for general purpose GPU applications. In order to achieve this goal, we first characterize GPU cache utilization and develop performance metrics to estimate the cache reuses and memory traffic. Next, we present efficient algorithms that judiciously select global load instructions for cache access or bypass. Finally, we integrate our techniques into an automatic compiler framework that leverages PTX instruction set architecture. Experiments evaluation demonstrates that compared to cache-all and bypass-all solutions, our techniques can achieve considerable performance improvement.
AB - Graphics Processing Units (GPUs) have become ubiquitous for general purpose applications due to their tremendous computing power. Initially, GPUs only employ scratchpad memory as on-chip memory. Though scratchpad memory benefits many applications, it is not ideal for those general purpose applications with irregular memory accesses. Hence, GPU vendors have introduced caches in conjunction with scratchpad memory in the recent generations of GPUs. The caches on GPUs are highly-configurable. The programmer or the compiler can explicitly control cache access or bypass for global load instructions. This highly-configurable feature of GPU caches opens up the opportunities for optimizing the cache performance. In this paper, we propose an efficient compiler framework for cache bypassing on GPUs. Our objective is to efficiently utilize the configurable cache and improve the overall performance for general purpose GPU applications. In order to achieve this goal, we first characterize GPU cache utilization and develop performance metrics to estimate the cache reuses and memory traffic. Next, we present efficient algorithms that judiciously select global load instructions for cache access or bypass. Finally, we integrate our techniques into an automatic compiler framework that leverages PTX instruction set architecture. Experiments evaluation demonstrates that compared to cache-all and bypass-all solutions, our techniques can achieve considerable performance improvement.
KW - Cache Bypassing
KW - Compiler Optimization
KW - GPU
UR - http://www.scopus.com/inward/record.url?scp=84893396474&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84893396474&partnerID=8YFLogxK
U2 - 10.1109/ICCAD.2013.6691165
DO - 10.1109/ICCAD.2013.6691165
M3 - Conference contribution
AN - SCOPUS:84893396474
SN - 9781479910717
T3 - IEEE/ACM International Conference on Computer-Aided Design, Digest of Technical Papers, ICCAD
SP - 516
EP - 523
BT - 2013 IEEE/ACM International Conference on Computer-Aided Design, ICCAD 2013 - Digest of Technical Papers
T2 - 2013 32nd IEEE/ACM International Conference on Computer-Aided Design, ICCAD 2013
Y2 - 18 November 2013 through 21 November 2013
ER -