A Compiler Framework for Optimizing Dynamic Parallelism on GPUs

Mhd Ghaith Olabi, Juan Gomez Luna, Onur Mutlu, Wen Mei Hwu, Izzat El Hajj

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Dynamic parallelism on GPUs allows GPU threads to dynamically launch other GPU threads. It is useful in applications with nested parallelism, particularly where the amount of nested parallelism is irregular and cannot be predicted beforehand. However, prior works have shown that dynamic parallelism may impose a high performance penalty when a large number of small grids are launched. The large number of launches results in high launch latency due to congestion, and the small grid sizes result in hardware underutilization.To address this issue, we propose a compiler framework for optimizing the use of dynamic parallelism in applications with nested parallelism. The framework features three key optimizations: Thresholding, coarsening, and aggregation. Thresholding involves launching a grid dynamically only if the number of child threads exceeds some threshold, and serializing the child threads in the parent thread otherwise. Coarsening involves executing the work of multiple thread blocks by a single coarsened block to amortize the common work across them. Aggregation involves combining multiple child grids into a single aggregated grid.Thresholding is sometimes applied manually by programmers in the context of dynamic parallelism. We automate it in the compiler and discuss the challenges associated with doing so. Coarsening is sometimes applied as an optimization in other contexts. We propose to apply coarsening in the context of dynamic parallelism and automate it in the compiler as well. Aggregation has been automated in the compiler by prior work. We enhance aggregation by proposing a new aggregation technique that uses multi-block granularity. We also integrate these three optimizations into an open-source compiler framework to simplify the process of optimizing dynamic parallelism code.Our evaluation shows that our compiler framework improves the performance of applications with nested parallelism by a geometric mean of 43.0× over applications that use dynamic parallelism, 8.7× over applications that do not use dynamic parallelism, and 3.6× over applications that use dynamic parallelism with aggregation alone as proposed in prior work.

Original languageEnglish (US)
Title of host publicationCGO 2022 - Proceedings of the 2022 IEEE/ACM International Symposium on Code Generation and Optimization
EditorsJae W. Lee, Sebastian Hack, Tatiana Shpeisman
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages1-13
Number of pages13
ISBN (Electronic)9781665405843
DOIs
StatePublished - 2022
Event20th IEEE/ACM International Symposium on Code Generation and Optimization, CGO 2022 - Seoul, Korea, Republic of
Duration: Apr 2 2022Apr 6 2022

Publication series

NameCGO 2022 - Proceedings of the 2022 IEEE/ACM International Symposium on Code Generation and Optimization

Conference

Conference20th IEEE/ACM International Symposium on Code Generation and Optimization, CGO 2022
Country/TerritoryKorea, Republic of
CitySeoul
Period4/2/224/6/22

ASJC Scopus subject areas

  • Hardware and Architecture
  • Software
  • Control and Optimization

Fingerprint

Dive into the research topics of 'A Compiler Framework for Optimizing Dynamic Parallelism on GPUs'. Together they form a unique fingerprint.

Cite this