TY - JOUR
T1 - Collective algorithms for multiported torus networks
AU - Sack, Paul
AU - Gropp, William
N1 - Funding Information:
This research used resources of the Argonne Leadership Computing Facility at Argonne National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under contract DEAC02-06CH11357. This work was supported in part by the U.S. Department of Energy under contract DE-FG02-08ER25835 and by the National Science Foundation under grant 0837719. Authors’ addresses: P. Sack (corresponding author) and W. Gropp, Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL 61801; email: paulsack@illinois.edu. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. ©c 2015 ACM 2329-4949/2015/01-ART12 $15.00 DOI:http://dx.doi.org/10.1145/2686882
Funding Information:
This research used resources of the Argonne Leadership Computing Facility at Argonne National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under contract DEAC02-06CH11357. This work was supported in part by the U.S. Department of Energy under contract DE-FG02-08ER25835 and by the National Science Foundation under grant 0837719.
Publisher Copyright:
© 2015 ACM.
PY - 2015/1
Y1 - 2015/1
N2 - Modern supercomputers with torus networks allow each node to simultaneously pass messages on all of its links. However, most collective algorithms are designed to only use one link at a time. In this work, we present novel multiported algorithms for the scatter, gather, all-gather, and reduce-scatter operations. Our algorithms can be combined to create multiported reduce, all-reduce, and broadcast algorithms. Several of these algorithms involve a new technique where we relax the MPI message-ordering constraints to achieve high performance and restore the correct ordering using an additional stage of redundant communication. According to our models, on an n-dimensional torus, our algorithms should allow for nearly a 2n-fold improvement in communication performance compared to known, single-ported torus algorithms. In practice, we have achieved nearly 6x better performance on a 32k-node 3-dimensional torus.
AB - Modern supercomputers with torus networks allow each node to simultaneously pass messages on all of its links. However, most collective algorithms are designed to only use one link at a time. In this work, we present novel multiported algorithms for the scatter, gather, all-gather, and reduce-scatter operations. Our algorithms can be combined to create multiported reduce, all-reduce, and broadcast algorithms. Several of these algorithms involve a new technique where we relax the MPI message-ordering constraints to achieve high performance and restore the correct ordering using an additional stage of redundant communication. According to our models, on an n-dimensional torus, our algorithms should allow for nearly a 2n-fold improvement in communication performance compared to known, single-ported torus algorithms. In practice, we have achieved nearly 6x better performance on a 32k-node 3-dimensional torus.
KW - Collective algorithms
KW - Message-passing
UR - http://www.scopus.com/inward/record.url?scp=85013212617&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85013212617&partnerID=8YFLogxK
U2 - 10.1145/2686882
DO - 10.1145/2686882
M3 - Article
AN - SCOPUS:85013212617
SN - 2329-4949
VL - 1
JO - ACM Transactions on Parallel Computing
JF - ACM Transactions on Parallel Computing
IS - 2
M1 - a12
ER -