TY - GEN
T1 - KLAP
T2 - 49th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2016
AU - Hajj, Izzat El
AU - Gomez-Luna, Juan
AU - Li, Cheng
AU - Chang, Li Wen
AU - Milojicic, Dejan
AU - Hwu, Wen Mei
PY - 2016/12/14
Y1 - 2016/12/14
N2 - Dynamic parallelism on GPUs simplifies the programming of many classes of applications that generate paral-lelizable work not known prior to execution. However, modern GPUs architectures do not support dynamic parallelism efficiently due to the high kernel launch overhead, limited number of simultaneous kernels, and limited depth of dynamic calls a device can support. In this paper, we propose Kernel Launch Aggregation and Promotion (KLAP), a set of compiler techniques that improve the performance of kernels which use dynamic parallelism. Kernel launch aggregation fuses kernels launched by threads in the same warp, block, or kernel into a single aggregated kernel, thereby reducing the total number of kernels spawned and increasing the amount of work per kernel to improve occupancy. Kernel launch promotion enables early launch of child kernels to extract more parallelism between parents and children, and to aggregate kernel launches across generations mitigating the problem of limited depth. We implement our techniques in a real compiler and show that kernel launch aggregation obtains a geometric mean speedup of 6.58x over regular dynamic parallelism. We also show that kernel launch promotion enables cases that were not originally possible, improving throughput by a geometric mean of 30.44 x.
AB - Dynamic parallelism on GPUs simplifies the programming of many classes of applications that generate paral-lelizable work not known prior to execution. However, modern GPUs architectures do not support dynamic parallelism efficiently due to the high kernel launch overhead, limited number of simultaneous kernels, and limited depth of dynamic calls a device can support. In this paper, we propose Kernel Launch Aggregation and Promotion (KLAP), a set of compiler techniques that improve the performance of kernels which use dynamic parallelism. Kernel launch aggregation fuses kernels launched by threads in the same warp, block, or kernel into a single aggregated kernel, thereby reducing the total number of kernels spawned and increasing the amount of work per kernel to improve occupancy. Kernel launch promotion enables early launch of child kernels to extract more parallelism between parents and children, and to aggregate kernel launches across generations mitigating the problem of limited depth. We implement our techniques in a real compiler and show that kernel launch aggregation obtains a geometric mean speedup of 6.58x over regular dynamic parallelism. We also show that kernel launch promotion enables cases that were not originally possible, improving throughput by a geometric mean of 30.44 x.
UR - http://www.scopus.com/inward/record.url?scp=85009382810&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85009382810&partnerID=8YFLogxK
U2 - 10.1109/MICRO.2016.7783716
DO - 10.1109/MICRO.2016.7783716
M3 - Conference contribution
AN - SCOPUS:85009382810
T3 - Proceedings of the Annual International Symposium on Microarchitecture, MICRO
BT - MICRO 2016 - 49th Annual IEEE/ACM International Symposium on Microarchitecture
PB - IEEE Computer Society
Y2 - 15 October 2016 through 19 October 2016
ER -