TY - GEN
T1 - Improving Scalability with GPU-Aware Asynchronous Tasks
AU - Choi, Jaemin
AU - Richards, David F.
AU - Kale, Laxmikant V.
N1 - This work was performed under the auspices of the U.S. Department of Energy (DOE) by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344 (LLNL-CONF-832823).
This work used the Extreme Science and Engineering Discovery Environment (XSEDE), which is supported by National Science Foundation grant number ACI-1548562. XSEDE resources include Bridges-2 at Pittsburgh Supercomputing Center (PSC) and Expanse at San Diego Supercomputer Center (SDSC), used through allocation TG-ASC050039N.
This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. DOE under Contract No. DE-AC05-00OR22725.
This research was supported by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of the U.S. DOE Office of Science and the National Nuclear Security Administration.
PY - 2022
Y1 - 2022
N2 - Asynchronous tasks, when created with over-decomposition, enable automatic computation-communication overlap which can substantially improve performance and scal-ability. This is not only applicable to traditional CPU-based systems, but also to modern GPU -accelerated platforms. While the ability to hide communication behind computation can be highly effective in weak scaling scenarios, performance begins to suffer with smaller problem sizes or in strong scaling due to fine-grained overheads and reduced room for overlap. In this work, we integrate G PU -aware communication into asynchronous tasks in addition to computation-communication overlap, with the goal of reducing time spent in communication and further increasing GPU utilization. We demonstrate the performance impact of our approach using a proxy application that performs the Jacobi iterative method, Jacobi3D. In addition to optimizations to minimize synchronizations between the host and GPU devices and increase the concurrency of GPU operations, we explore techniques such as kernel fusion and CUDA Graphs to mitigate fine-grained overheads at scale.
AB - Asynchronous tasks, when created with over-decomposition, enable automatic computation-communication overlap which can substantially improve performance and scal-ability. This is not only applicable to traditional CPU-based systems, but also to modern GPU -accelerated platforms. While the ability to hide communication behind computation can be highly effective in weak scaling scenarios, performance begins to suffer with smaller problem sizes or in strong scaling due to fine-grained overheads and reduced room for overlap. In this work, we integrate G PU -aware communication into asynchronous tasks in addition to computation-communication overlap, with the goal of reducing time spent in communication and further increasing GPU utilization. We demonstrate the performance impact of our approach using a proxy application that performs the Jacobi iterative method, Jacobi3D. In addition to optimizations to minimize synchronizations between the host and GPU devices and increase the concurrency of GPU operations, we explore techniques such as kernel fusion and CUDA Graphs to mitigate fine-grained overheads at scale.
KW - GPU-aware communication
KW - asynchronous tasks
KW - computation-communication overlap
KW - overdecom-position
KW - scalability
UR - http://www.scopus.com/inward/record.url?scp=85136181623&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85136181623&partnerID=8YFLogxK
U2 - 10.1109/IPDPSW55747.2022.00097
DO - 10.1109/IPDPSW55747.2022.00097
M3 - Conference contribution
AN - SCOPUS:85136181623
T3 - Proceedings - 2022 IEEE 36th International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2022
SP - 569
EP - 578
BT - Proceedings - 2022 IEEE 36th International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2022
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 36th IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2022
Y2 - 30 May 2022 through 3 June 2022
ER -