TY - GEN
T1 - Accelerating Distributed-memory Autotuning via Statistical Analysis of Execution Paths
AU - Hutter, Edward
AU - Solomonik, Edgar
N1 - Funding Information:
ACKNOWLEDGMENTS The first author would like to acknowledge the Department of Energy (DOE) and Krell Institute for support via the DOE Computational Science Graduate Fellowship (grant No. DE-SC0019323). This work used the Extreme Science and Engineering Discovery Environment (XSEDE), which is supported by National Science Foundation grant number ACI-1548562. Via XSEDE, the authors made use of the TACC Stampede2 supercomputer (allocation TG-CCR180006). This research has also been supported by funding from the National Science Foundation (grant No. 2028861).
Publisher Copyright:
© 2021 IEEE.
PY - 2021/5
Y1 - 2021/5
N2 - The prohibitive expense of automatic performance tuning at scale has largely limited the use of autotuning to libraries for shared-memory and GPU architectures. We introduce a framework for approximate autotuning that achieves a desired confidence in each algorithm configuration's performance by constructing confidence intervals to describe the performance of individual kernels (subroutines of benchmarked programs). Once a kernel's performance is deemed sufficiently predictable for a set of inputs, subsequent invocations are avoided and replaced with a predictive model of the execution time. We then leverage online execution path analysis to coordinate selective kernel execution and propagate each kernel's statistical profile. This strategy is effective in the presence of frequently-recurring computation and communication kernels, which is characteristic to algorithms in numerical linear algebra. We encapsulate this framework as part of a new profiling tool, Critter, that automates kernel execution decisions and propagates statistical profiles along critical paths of execution. We evaluate performance prediction accuracy obtained by our selective execution methods using state-of-the-art distributed-memory implementations of Cholesky and QR factorization on Stampede2, and demonstrate speed-ups of up to 7.1x with 98% prediction accuracy.
AB - The prohibitive expense of automatic performance tuning at scale has largely limited the use of autotuning to libraries for shared-memory and GPU architectures. We introduce a framework for approximate autotuning that achieves a desired confidence in each algorithm configuration's performance by constructing confidence intervals to describe the performance of individual kernels (subroutines of benchmarked programs). Once a kernel's performance is deemed sufficiently predictable for a set of inputs, subsequent invocations are avoided and replaced with a predictive model of the execution time. We then leverage online execution path analysis to coordinate selective kernel execution and propagate each kernel's statistical profile. This strategy is effective in the presence of frequently-recurring computation and communication kernels, which is characteristic to algorithms in numerical linear algebra. We encapsulate this framework as part of a new profiling tool, Critter, that automates kernel execution decisions and propagates statistical profiles along critical paths of execution. We evaluate performance prediction accuracy obtained by our selective execution methods using state-of-the-art distributed-memory implementations of Cholesky and QR factorization on Stampede2, and demonstrate speed-ups of up to 7.1x with 98% prediction accuracy.
KW - Autotuning
KW - Dense linear algebra
KW - Performance profiling
UR - http://www.scopus.com/inward/record.url?scp=85113544487&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85113544487&partnerID=8YFLogxK
U2 - 10.1109/IPDPS49936.2021.00014
DO - 10.1109/IPDPS49936.2021.00014
M3 - Conference contribution
AN - SCOPUS:85113544487
T3 - Proceedings - 2021 IEEE 35th International Parallel and Distributed Processing Symposium, IPDPS 2021
SP - 46
EP - 57
BT - Proceedings - 2021 IEEE 35th International Parallel and Distributed Processing Symposium, IPDPS 2021
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 35th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2021
Y2 - 17 May 2021 through 21 May 2021
ER -