TY - GEN
T1 - Effective minimally-invasive GPU acceleration of distributed sparse Matrix factorization
AU - Gupta, Anshul
AU - Gimelshein, Natalia
AU - Koric, Seid
AU - Rennich, Steven
N1 - Publisher Copyright:
© Springer International Publishing Switzerland 2016.
PY - 2016
Y1 - 2016
N2 - Sparse matrix factorization, a critical algorithm in many science and engineering applications, has had difficulty leveraging the additional computational power afforded by the infusion of heterogeneous accelerators in HPC clusters. We present a minimally invasive approach to the GPU acceleration of a hybrid multifrontal solver, the Watson Sparse Matrix Package, which is already highly optimized for the CPU and exhibits leading performance on distributed architectures. The novel aspect of this work is to demonstrate techniques for achieving substantial GPU acceleration, up to 3.5x, of the sparse factorization with strategic, but contained changes to the original, CPU-only, code. Strong scaling results show that performance benefits scale to as many as 512 nodes (4096 cores) of the BlueWaters supercomputer at NCSA. The techniques presented here suggest that detailed code reorganization may not be necessary to achieve substantial acceleration from GPUs, even for complex algorithms with highly irregular compute and data access patterns, like those used for distributed sparse factorization.
AB - Sparse matrix factorization, a critical algorithm in many science and engineering applications, has had difficulty leveraging the additional computational power afforded by the infusion of heterogeneous accelerators in HPC clusters. We present a minimally invasive approach to the GPU acceleration of a hybrid multifrontal solver, the Watson Sparse Matrix Package, which is already highly optimized for the CPU and exhibits leading performance on distributed architectures. The novel aspect of this work is to demonstrate techniques for achieving substantial GPU acceleration, up to 3.5x, of the sparse factorization with strategic, but contained changes to the original, CPU-only, code. Strong scaling results show that performance benefits scale to as many as 512 nodes (4096 cores) of the BlueWaters supercomputer at NCSA. The techniques presented here suggest that detailed code reorganization may not be necessary to achieve substantial acceleration from GPUs, even for complex algorithms with highly irregular compute and data access patterns, like those used for distributed sparse factorization.
UR - http://www.scopus.com/inward/record.url?scp=84984804879&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84984804879&partnerID=8YFLogxK
U2 - 10.1007/978-3-319-43659-3_49
DO - 10.1007/978-3-319-43659-3_49
M3 - Conference contribution
AN - SCOPUS:84984804879
SN - 9783319436586
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 672
EP - 683
BT - Parallel Processing - 22nd International Conference on Parallel and Distributed Computing, Euro-Par 2016, Proceedings
A2 - Dutot, Pierre-François
A2 - Trystram, Denis
PB - Springer
T2 - 22nd International Conference on Parallel and Distributed Computing, Euro-Par 2016
Y2 - 24 August 2016 through 26 August 2016
ER -