TY - GEN
T1 - Communication-avoiding CholeskyQR2 for rectangular matrices
AU - Hutter, Edward
AU - Solomonik, Edgar
N1 - Funding Information:
The first author would like to acknowledge the Department of Energy (DOE) and Krell Institute for support via the DOE Computational Science Graduate Fellowship (Grant number DE-SC0019323). This work used the Extreme Science and Engineering Discovery Environment (XSEDE), which is supported by National Science Foundation grant number ACI-1548562. Via XSEDE, the authors made use of the TACC Stampede2 supercomputer (allocation TG-CCR180006). This research is part of the Blue Waters sustained-petascale computing project, which is supported by the National Science Foundation (awards OCI-0725070 and ACI-1238993) and the state of Illinois. Blue Waters is a joint effort of the University of Illinois at Urbana-Champaign and its National Center for Supercomputing Applications. The authors would also like to acknowledge ALCF for providing HPC resources for preliminary benchmarking.
Publisher Copyright:
© 2019 IEEE
PY - 2019/5
Y1 - 2019/5
N2 - Scalable QR factorization algorithms for solving least squares and eigenvalue problems are critical given the increasing parallelism within modern machines. We introduce a more general parallelization of the CholeskyQR2 algorithm and show its effectiveness for a wide range of matrix sizes. Our algorithm executes over a 3D processor grid, the dimensions of which can be tuned to trade-off costs in synchronization, interprocessor communication, computational work, and memory footprint. We implement this algorithm, yielding a code that can achieve a factor of Θ(P1/6) less interprocessor communication on P processors than any previous parallel QR implementation. Our performance study on Intel Knights-Landing and Cray XE supercomputers demonstrates the effectiveness of this CholeskyQR2 parallelization on a large number of nodes. Specifically, relative to ScaLAPACK's QR, on 1024 nodes of Stampede2, our CholeskyQR2 implementation is faster by 2.6x-3.3x in strong scaling tests and by 1.1x-1.9x in weak scaling tests.
AB - Scalable QR factorization algorithms for solving least squares and eigenvalue problems are critical given the increasing parallelism within modern machines. We introduce a more general parallelization of the CholeskyQR2 algorithm and show its effectiveness for a wide range of matrix sizes. Our algorithm executes over a 3D processor grid, the dimensions of which can be tuned to trade-off costs in synchronization, interprocessor communication, computational work, and memory footprint. We implement this algorithm, yielding a code that can achieve a factor of Θ(P1/6) less interprocessor communication on P processors than any previous parallel QR implementation. Our performance study on Intel Knights-Landing and Cray XE supercomputers demonstrates the effectiveness of this CholeskyQR2 parallelization on a large number of nodes. Specifically, relative to ScaLAPACK's QR, on 1024 nodes of Stampede2, our CholeskyQR2 implementation is faster by 2.6x-3.3x in strong scaling tests and by 1.1x-1.9x in weak scaling tests.
UR - http://www.scopus.com/inward/record.url?scp=85072819853&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85072819853&partnerID=8YFLogxK
U2 - 10.1109/IPDPS.2019.00020
DO - 10.1109/IPDPS.2019.00020
M3 - Conference contribution
AN - SCOPUS:85072819853
T3 - Proceedings - 2019 IEEE 33rd International Parallel and Distributed Processing Symposium, IPDPS 2019
SP - 89
EP - 100
BT - Proceedings - 2019 IEEE 33rd International Parallel and Distributed Processing Symposium, IPDPS 2019
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 33rd IEEE International Parallel and Distributed Processing Symposium, IPDPS 2019
Y2 - 20 May 2019 through 24 May 2019
ER -