TY - JOUR
T1 - Inter-kernel Reuse-aware Thread Block Scheduling
AU - Huzaifa, Muhammad
AU - Alsop, Johnathan
AU - Mahmoud, Abdulrahman
AU - Salvador, Giordano
AU - Sinclair, Matthew D.
AU - Adve, Sarita V.
N1 - This work was supported in part by a Sohaib and Sara Abbasi Computer Science Fellowship for Huzaifa, the Applications Driving Architectures (ADA) Research Center, a JUMP Center co-sponsored by SRC and DARPA, the Center for Future Architectures Research (C-FAR), one of the six centers of STARnet, a Semiconductor Research Corporation program sponsored by MARCO and DARPA, the National Science Foundation under grants CCF 13-02641 and CCF 16-19245, and by a Google Faculty Research Award. Authors’ addresses: M. Huzaifa, A. Mahmoud, and S. V. Adve, University of Illinois at Urbana-Champaign, Department of Computer Science, 201 N. Goodwin Ave., Urbana, IL, 61801; emails: {huzaifa2, amahmou2, sadve}@illinois.edu; J. Alsop, AMD Research, 2002 156th Ave. NE Suite 300, Bellevue, WA, 98007; email: [email protected]; G. Salvador, Unaffiliated; email: [email protected]; M. D. Sinclair, University of Wisconsin-Madison, Computer Sciences Department, 1210 W. Dayton St., Madison, WI, 53706, AMD Research, 2002 - 156th Ave. NE Suite 300, Bellevue, WA, 98007; email: [email protected].
PY - 2020/8
Y1 - 2020/8
N2 - As GPUs have become more programmable, their performance and energy benefits have made them increasingly popular. However, while GPU compute units continue to improve in performance, on-chip memories lag behind and data accesses are becoming increasingly expensive in performance and energy. Emerging GPU coherence protocols can mitigate this bottleneck by exploiting data reuse in GPU caches across kernel boundaries. Unfortunately, current GPU thread block schedulers are typically not designed to expose such reuse. This article proposes new hardware thread block schedulers that optimize inter-kernel reuse while using work stealing to preserve load balance. Our schedulers are simple, decentralized, and have extremely low overhead. Compared to a baseline round-robin scheduler, the best performing scheduler reduces average execution time and energy by 19% and 11%, respectively, in regular applications, and 10% and 8%, respectively, in irregular applications.
AB - As GPUs have become more programmable, their performance and energy benefits have made them increasingly popular. However, while GPU compute units continue to improve in performance, on-chip memories lag behind and data accesses are becoming increasingly expensive in performance and energy. Emerging GPU coherence protocols can mitigate this bottleneck by exploiting data reuse in GPU caches across kernel boundaries. Unfortunately, current GPU thread block schedulers are typically not designed to expose such reuse. This article proposes new hardware thread block schedulers that optimize inter-kernel reuse while using work stealing to preserve load balance. Our schedulers are simple, decentralized, and have extremely low overhead. Compared to a baseline round-robin scheduler, the best performing scheduler reduces average execution time and energy by 19% and 11%, respectively, in regular applications, and 10% and 8%, respectively, in irregular applications.
KW - GPUs
KW - caches
KW - memory systems
KW - scheduling
UR - https://www.scopus.com/pages/publications/85090412037
UR - https://www.scopus.com/pages/publications/85090412037#tab=citedBy
U2 - 10.1145/3406538
DO - 10.1145/3406538
M3 - Article
AN - SCOPUS:85090412037
SN - 1544-3566
VL - 17
JO - ACM Transactions on Architecture and Code Optimization
JF - ACM Transactions on Architecture and Code Optimization
IS - 3
M1 - 3406538
ER -