KLAP: Kernel launch aggregation and promotion for optimizing dynamic parallelism

Izzat El Hajj, Juan Gomez-Luna, Cheng Li, Li Wen Chang, Dejan Milojicic, Wen Mei Hwu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Dynamic parallelism on GPUs simplifies the programming of many classes of applications that generate paral-lelizable work not known prior to execution. However, modern GPUs architectures do not support dynamic parallelism efficiently due to the high kernel launch overhead, limited number of simultaneous kernels, and limited depth of dynamic calls a device can support. In this paper, we propose Kernel Launch Aggregation and Promotion (KLAP), a set of compiler techniques that improve the performance of kernels which use dynamic parallelism. Kernel launch aggregation fuses kernels launched by threads in the same warp, block, or kernel into a single aggregated kernel, thereby reducing the total number of kernels spawned and increasing the amount of work per kernel to improve occupancy. Kernel launch promotion enables early launch of child kernels to extract more parallelism between parents and children, and to aggregate kernel launches across generations mitigating the problem of limited depth. We implement our techniques in a real compiler and show that kernel launch aggregation obtains a geometric mean speedup of 6.58x over regular dynamic parallelism. We also show that kernel launch promotion enables cases that were not originally possible, improving throughput by a geometric mean of 30.44 x.

Original languageEnglish (US)
Title of host publicationMICRO 2016 - 49th Annual IEEE/ACM International Symposium on Microarchitecture
PublisherIEEE Computer Society
ISBN (Electronic)9781509035083
DOIs
StatePublished - Dec 14 2016
Event49th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2016 - Taipei, Taiwan, Province of China
Duration: Oct 15 2016Oct 19 2016

Publication series

NameProceedings of the Annual International Symposium on Microarchitecture, MICRO
Volume2016-December
ISSN (Print)1072-4451

Other

Other49th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2016
CountryTaiwan, Province of China
CityTaipei
Period10/15/1610/19/16

Fingerprint

Agglomeration
Electric fuses
Throughput
Graphics processing unit

ASJC Scopus subject areas

  • Hardware and Architecture

Cite this

Hajj, I. E., Gomez-Luna, J., Li, C., Chang, L. W., Milojicic, D., & Hwu, W. M. (2016). KLAP: Kernel launch aggregation and promotion for optimizing dynamic parallelism. In MICRO 2016 - 49th Annual IEEE/ACM International Symposium on Microarchitecture [7783716] (Proceedings of the Annual International Symposium on Microarchitecture, MICRO; Vol. 2016-December). IEEE Computer Society. https://doi.org/10.1109/MICRO.2016.7783716

KLAP : Kernel launch aggregation and promotion for optimizing dynamic parallelism. / Hajj, Izzat El; Gomez-Luna, Juan; Li, Cheng; Chang, Li Wen; Milojicic, Dejan; Hwu, Wen Mei.

MICRO 2016 - 49th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, 2016. 7783716 (Proceedings of the Annual International Symposium on Microarchitecture, MICRO; Vol. 2016-December).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Hajj, IE, Gomez-Luna, J, Li, C, Chang, LW, Milojicic, D & Hwu, WM 2016, KLAP: Kernel launch aggregation and promotion for optimizing dynamic parallelism. in MICRO 2016 - 49th Annual IEEE/ACM International Symposium on Microarchitecture., 7783716, Proceedings of the Annual International Symposium on Microarchitecture, MICRO, vol. 2016-December, IEEE Computer Society, 49th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2016, Taipei, Taiwan, Province of China, 10/15/16. https://doi.org/10.1109/MICRO.2016.7783716
Hajj IE, Gomez-Luna J, Li C, Chang LW, Milojicic D, Hwu WM. KLAP: Kernel launch aggregation and promotion for optimizing dynamic parallelism. In MICRO 2016 - 49th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society. 2016. 7783716. (Proceedings of the Annual International Symposium on Microarchitecture, MICRO). https://doi.org/10.1109/MICRO.2016.7783716
Hajj, Izzat El ; Gomez-Luna, Juan ; Li, Cheng ; Chang, Li Wen ; Milojicic, Dejan ; Hwu, Wen Mei. / KLAP : Kernel launch aggregation and promotion for optimizing dynamic parallelism. MICRO 2016 - 49th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, 2016. (Proceedings of the Annual International Symposium on Microarchitecture, MICRO).
@inproceedings{8509f8f518a34da4831dfcc428b59167,
title = "KLAP: Kernel launch aggregation and promotion for optimizing dynamic parallelism",
abstract = "Dynamic parallelism on GPUs simplifies the programming of many classes of applications that generate paral-lelizable work not known prior to execution. However, modern GPUs architectures do not support dynamic parallelism efficiently due to the high kernel launch overhead, limited number of simultaneous kernels, and limited depth of dynamic calls a device can support. In this paper, we propose Kernel Launch Aggregation and Promotion (KLAP), a set of compiler techniques that improve the performance of kernels which use dynamic parallelism. Kernel launch aggregation fuses kernels launched by threads in the same warp, block, or kernel into a single aggregated kernel, thereby reducing the total number of kernels spawned and increasing the amount of work per kernel to improve occupancy. Kernel launch promotion enables early launch of child kernels to extract more parallelism between parents and children, and to aggregate kernel launches across generations mitigating the problem of limited depth. We implement our techniques in a real compiler and show that kernel launch aggregation obtains a geometric mean speedup of 6.58x over regular dynamic parallelism. We also show that kernel launch promotion enables cases that were not originally possible, improving throughput by a geometric mean of 30.44 x.",
author = "Hajj, {Izzat El} and Juan Gomez-Luna and Cheng Li and Chang, {Li Wen} and Dejan Milojicic and Hwu, {Wen Mei}",
year = "2016",
month = "12",
day = "14",
doi = "10.1109/MICRO.2016.7783716",
language = "English (US)",
series = "Proceedings of the Annual International Symposium on Microarchitecture, MICRO",
publisher = "IEEE Computer Society",
booktitle = "MICRO 2016 - 49th Annual IEEE/ACM International Symposium on Microarchitecture",

}

TY - GEN

T1 - KLAP

T2 - Kernel launch aggregation and promotion for optimizing dynamic parallelism

AU - Hajj, Izzat El

AU - Gomez-Luna, Juan

AU - Li, Cheng

AU - Chang, Li Wen

AU - Milojicic, Dejan

AU - Hwu, Wen Mei

PY - 2016/12/14

Y1 - 2016/12/14

N2 - Dynamic parallelism on GPUs simplifies the programming of many classes of applications that generate paral-lelizable work not known prior to execution. However, modern GPUs architectures do not support dynamic parallelism efficiently due to the high kernel launch overhead, limited number of simultaneous kernels, and limited depth of dynamic calls a device can support. In this paper, we propose Kernel Launch Aggregation and Promotion (KLAP), a set of compiler techniques that improve the performance of kernels which use dynamic parallelism. Kernel launch aggregation fuses kernels launched by threads in the same warp, block, or kernel into a single aggregated kernel, thereby reducing the total number of kernels spawned and increasing the amount of work per kernel to improve occupancy. Kernel launch promotion enables early launch of child kernels to extract more parallelism between parents and children, and to aggregate kernel launches across generations mitigating the problem of limited depth. We implement our techniques in a real compiler and show that kernel launch aggregation obtains a geometric mean speedup of 6.58x over regular dynamic parallelism. We also show that kernel launch promotion enables cases that were not originally possible, improving throughput by a geometric mean of 30.44 x.

AB - Dynamic parallelism on GPUs simplifies the programming of many classes of applications that generate paral-lelizable work not known prior to execution. However, modern GPUs architectures do not support dynamic parallelism efficiently due to the high kernel launch overhead, limited number of simultaneous kernels, and limited depth of dynamic calls a device can support. In this paper, we propose Kernel Launch Aggregation and Promotion (KLAP), a set of compiler techniques that improve the performance of kernels which use dynamic parallelism. Kernel launch aggregation fuses kernels launched by threads in the same warp, block, or kernel into a single aggregated kernel, thereby reducing the total number of kernels spawned and increasing the amount of work per kernel to improve occupancy. Kernel launch promotion enables early launch of child kernels to extract more parallelism between parents and children, and to aggregate kernel launches across generations mitigating the problem of limited depth. We implement our techniques in a real compiler and show that kernel launch aggregation obtains a geometric mean speedup of 6.58x over regular dynamic parallelism. We also show that kernel launch promotion enables cases that were not originally possible, improving throughput by a geometric mean of 30.44 x.

UR - http://www.scopus.com/inward/record.url?scp=85009382810&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85009382810&partnerID=8YFLogxK

U2 - 10.1109/MICRO.2016.7783716

DO - 10.1109/MICRO.2016.7783716

M3 - Conference contribution

AN - SCOPUS:85009382810

T3 - Proceedings of the Annual International Symposium on Microarchitecture, MICRO

BT - MICRO 2016 - 49th Annual IEEE/ACM International Symposium on Microarchitecture

PB - IEEE Computer Society

ER -