TY - GEN
T1 - Framework for scalable intra-node collective operations using shared memory
AU - Jain, Surabhi
AU - Kaleem, Rashid
AU - Balmana, Marc Gamell
AU - Langer, Akhil
AU - Durnov, Dmitry
AU - Sannikov, Alexander
AU - Garzaran, Maria
N1 - Publisher Copyright:
© 2018 IEEE.
PY - 2018/7/2
Y1 - 2018/7/2
N2 - Collective operations are used in MPI programs to express common communication patterns, collective computations, or synchronization. In many collectives, such as MPI-Allreduce, the intra-node component of the collective lies on the critical path, as the inter-node communication cannot start until the intra-node component has completed. With increasing number of core counts in each node, intra-node optimizations that leverage shared memory become more important. In this paper, we focus on the performance benefit of optimizing intra-node collectives using POSIX shared memory for synchronization and data sharing. We implement several collectives using basic primitives or steps as building blocks. Key components of our implementation include a dedicated intra- node collectives layer, careful layout of the data structures, as well as optimizations to exploit the memory hierarchy to balance parallelism and latencies of data movement. A comparison of our implementation on top of MPICH shows significant performance speedups with respect to the original MPICH implementation, MVAPICH, and OpenMPI.
AB - Collective operations are used in MPI programs to express common communication patterns, collective computations, or synchronization. In many collectives, such as MPI-Allreduce, the intra-node component of the collective lies on the critical path, as the inter-node communication cannot start until the intra-node component has completed. With increasing number of core counts in each node, intra-node optimizations that leverage shared memory become more important. In this paper, we focus on the performance benefit of optimizing intra-node collectives using POSIX shared memory for synchronization and data sharing. We implement several collectives using basic primitives or steps as building blocks. Key components of our implementation include a dedicated intra- node collectives layer, careful layout of the data structures, as well as optimizations to exploit the memory hierarchy to balance parallelism and latencies of data movement. A comparison of our implementation on top of MPICH shows significant performance speedups with respect to the original MPICH implementation, MVAPICH, and OpenMPI.
UR - http://www.scopus.com/inward/record.url?scp=85064136245&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85064136245&partnerID=8YFLogxK
U2 - 10.1109/SC.2018.00032
DO - 10.1109/SC.2018.00032
M3 - Conference contribution
AN - SCOPUS:85064136245
T3 - Proceedings - International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2018
SP - 374
EP - 385
BT - Proceedings - International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2018
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2018 International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2018
Y2 - 11 November 2018 through 16 November 2018
ER -