TY - JOUR
T1 - Trade-offs between synchronization, communication, and computation in parallel linear algebra computations
AU - Solomonik, Edgar
AU - Carson, Erin
AU - Knight, Nicholas
AU - Demmel, James
N1 - Funding Information:
Edgar Solomonik, Erin Carson, Nicholas Knight, and James Demmel. 2016. Trade-offs between synchronization, communication, and computation in parallel linear algebra computations. ACM Trans. Parallel Comput. 3, 1, Article 3 (June 2016), 47 pages. DOI: http://dx.doi.org/10.1145/2897188 The first author was supported by a DOE computational science graduate fellowship (DE-FG02-97ER25308) and an ETH Zurich postdoctoral fellowship. This material is based on work supported by the U.S. Department of Energy Office of Science, Office of Advanced Scientific Computing Research, Applied Mathematics program under awards DE-SC0004938, DE-SC0003959, and DE-SC0010200; by the U.S. Department of Energy Office of Science, Office of Advanced Scientific Computing Research, X-Stack program under awards DE-SC0005136, DE-SC0008700, and AC02-05CH11231; and by DARPA award HR0011-12-2-0016. Authors’ addresses: E. Solomonik, Department of Computer Science, ETH Zurich, CAB, Universitätsstrasse 6, 8092 Zürich, Switzerland; email: solomonik@inf.ethz.ch; E. Carson and N. Knight, Soda Hall, Department of Computer Science, University of California at Berkeley, Berkeley CA 94720-1776; emails: {ecc2Z, knight}@cs.berkeley.edu; J. Demmel, Soda Hall, Department of Mathematics and Department of Computer Science, University of California at Berkeley, Berkeley CA 94720-1776; email: demmel@cs.berkeley.edu. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481, or permissions@acm.org. ©c 2016 ACM 2329-4949/2016/06-ART3 $15.00 DOI: http://dx.doi.org/10.1145/2897188
Publisher Copyright:
© 2016 ACM.
PY - 2016
Y1 - 2016
N2 - This article derives trade-offs between three basic costs of a parallel algorithm: synchronization, data movement, and computational cost. These trade-offs are lower bounds on the execution time of the algorithm that are independent of the number of processors but dependent on the problem size. Therefore, they provide lower bounds on the execution time of any parallel schedule of an algorithm computed by a system composed of any number of homogeneous processors, each with associated computational, communication, and synchronization costs. We employ a theoretical model that measures the amount of work and data movement as a maximum over that incurred along any execution path during the parallel computation. By considering this metric rather than the total communication volume over the whole machine, we obtain new insights into the characteristics of parallel schedules for algorithms with nontrivial dependency structures. We also present reductions from BSP and LogGP algorithms to our execution model, extending our lower bounds to these two models of parallel computation. We first develop our results for general dependency graphs and hypergraphs based on their expansion properties, and then we apply the theorem to a number of specific algorithms in numerical linear algebra, namely triangular substitution, Cholesky factorization, and stencil computations. We represent some of these algorithms as families of dependency graphs. We derive their communication lower bounds by studying the communication requirements of the hypergraph structures shared by these dependency graphs. In addition to these lower bounds, we introduce a new communication-efficient parallelization for stencil computation algorithms, which is motivated by results of our lower bound analysis and the properties of previously existing parallelizations of the algorithms.
AB - This article derives trade-offs between three basic costs of a parallel algorithm: synchronization, data movement, and computational cost. These trade-offs are lower bounds on the execution time of the algorithm that are independent of the number of processors but dependent on the problem size. Therefore, they provide lower bounds on the execution time of any parallel schedule of an algorithm computed by a system composed of any number of homogeneous processors, each with associated computational, communication, and synchronization costs. We employ a theoretical model that measures the amount of work and data movement as a maximum over that incurred along any execution path during the parallel computation. By considering this metric rather than the total communication volume over the whole machine, we obtain new insights into the characteristics of parallel schedules for algorithms with nontrivial dependency structures. We also present reductions from BSP and LogGP algorithms to our execution model, extending our lower bounds to these two models of parallel computation. We first develop our results for general dependency graphs and hypergraphs based on their expansion properties, and then we apply the theorem to a number of specific algorithms in numerical linear algebra, namely triangular substitution, Cholesky factorization, and stencil computations. We represent some of these algorithms as families of dependency graphs. We derive their communication lower bounds by studying the communication requirements of the hypergraph structures shared by these dependency graphs. In addition to these lower bounds, we introduce a new communication-efficient parallelization for stencil computation algorithms, which is motivated by results of our lower bound analysis and the properties of previously existing parallelizations of the algorithms.
KW - Communication lower bounds
KW - Graph expansion
KW - Numerical linear algebra
KW - Stencil computations
UR - http://www.scopus.com/inward/record.url?scp=85054842014&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85054842014&partnerID=8YFLogxK
U2 - 10.1145/2897188
DO - 10.1145/2897188
M3 - Article
AN - SCOPUS:85054842014
SN - 2329-4949
VL - 3
JO - ACM Transactions on Parallel Computing
JF - ACM Transactions on Parallel Computing
IS - 1
M1 - 3
ER -