TY - GEN
T1 - Communication-optimal parallel 2.5D matrix multiplication and LU factorization algorithms
AU - Solomonik, Edgar
AU - Demmel, James
PY - 2011
Y1 - 2011
N2 - Extra memory allows parallel matrix multiplication to be done with asymptotically less communication than Cannon's algorithm and be faster in practice. "3D" algorithms arrange the p processors in a 3D array, and store redundant copies of the matrices on each of p1/3 layers. '2D" algorithms such as Cannon's algorithm store a single copy of the matrices on a 2D array of processors. We generalize these 2D and 3D algorithms by introducing a new class of "2.5D algorithms". For matrix multiplication, we can take advantage of any amount of extra memory to store c copies of the data, for any c ε {1, 2,...,⌊p1/3⌋}, to reduce the bandwidth cost of Cannon's algorithm by a factor of c 1/2 and the latency cost by a factor c3/2. We also show that these costs reach the lower bounds, modulo polylog(p) factors. We introduce a novel algorithm for 2.5D LU decomposition. To the best of our knowledge, this LU algorithm is the first to minimize communication along the critical path of execution in the 3D case. Our 2.5D LU algorithm uses communication-avoiding pivoting, a stable alternative to partial-pivoting. We prove a novel lower bound on the latency cost of 2.5D and 3D LU factorization, showing that while c copies of the data can also reduce the bandwidth by a factor of c1/2, the latency must increase by a factor of c1/2, so that the 2D LU algorithm (c = 1) in fact minimizes latency. We provide implementations and performance results for 2D and 2.5D versions of all the new algorithms. Our results demonstrate that 2.5D matrix multiplication and LU algorithms strongly scale more efficiently than 2D algorithms. Each of our 2.5D algorithms performs over 2X faster than the corresponding 2D algorithm for certain problem sizes on 65,536 cores of a BG/P supercomputer.
AB - Extra memory allows parallel matrix multiplication to be done with asymptotically less communication than Cannon's algorithm and be faster in practice. "3D" algorithms arrange the p processors in a 3D array, and store redundant copies of the matrices on each of p1/3 layers. '2D" algorithms such as Cannon's algorithm store a single copy of the matrices on a 2D array of processors. We generalize these 2D and 3D algorithms by introducing a new class of "2.5D algorithms". For matrix multiplication, we can take advantage of any amount of extra memory to store c copies of the data, for any c ε {1, 2,...,⌊p1/3⌋}, to reduce the bandwidth cost of Cannon's algorithm by a factor of c 1/2 and the latency cost by a factor c3/2. We also show that these costs reach the lower bounds, modulo polylog(p) factors. We introduce a novel algorithm for 2.5D LU decomposition. To the best of our knowledge, this LU algorithm is the first to minimize communication along the critical path of execution in the 3D case. Our 2.5D LU algorithm uses communication-avoiding pivoting, a stable alternative to partial-pivoting. We prove a novel lower bound on the latency cost of 2.5D and 3D LU factorization, showing that while c copies of the data can also reduce the bandwidth by a factor of c1/2, the latency must increase by a factor of c1/2, so that the 2D LU algorithm (c = 1) in fact minimizes latency. We provide implementations and performance results for 2D and 2.5D versions of all the new algorithms. Our results demonstrate that 2.5D matrix multiplication and LU algorithms strongly scale more efficiently than 2D algorithms. Each of our 2.5D algorithms performs over 2X faster than the corresponding 2D algorithm for certain problem sizes on 65,536 cores of a BG/P supercomputer.
UR - http://www.scopus.com/inward/record.url?scp=80052339109&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=80052339109&partnerID=8YFLogxK
U2 - 10.1007/978-3-642-23397-5_10
DO - 10.1007/978-3-642-23397-5_10
M3 - Conference contribution
AN - SCOPUS:80052339109
SN - 9783642233968
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 90
EP - 109
BT - Euro-Par 2011 Parallel Processing - 17th International Conference, Proceedings
T2 - 17th International Conference on Parallel Processing, Euro-Par 2011
Y2 - 29 August 2011 through 2 September 2011
ER -