TY - CHAP
T1 - Minimizing synchronization overhead in the implementation of MPI one-sided communication
AU - Thakur, Rajeev
AU - Gropp, William D.
AU - Toonen, Brian
PY - 2004
Y1 - 2004
N2 - The one-sided communication operations in MPI are intended to provide the convenience of directly accessing remote memory and the potential for higher performance than regular point-to-point communication. Our performance measurements with three MPI implementations (IBM MPI, Sun MPI, and LAM) indicate, however, that one-sided communication can perform much worse than point-to-point communication if the associated synchronization calls are not implemented efficiently. In this paper, we describe our efforts to minimize the overhead of synchronization in our implementation of one-sided communication in MPICH-2. We describe our optimizations for all three synchronization mechanisms defined in MPI: fence, post-start-complete-wait, and lock-unlock. Our performance results demonstrate that, for short messages, MPICH-2 performs six times faster than LAM for fence synchronization and 50% faster for post-start-complete-wait synchronization, and it performs more than twice as fast as Sun MPI for all three synchronization methods.
AB - The one-sided communication operations in MPI are intended to provide the convenience of directly accessing remote memory and the potential for higher performance than regular point-to-point communication. Our performance measurements with three MPI implementations (IBM MPI, Sun MPI, and LAM) indicate, however, that one-sided communication can perform much worse than point-to-point communication if the associated synchronization calls are not implemented efficiently. In this paper, we describe our efforts to minimize the overhead of synchronization in our implementation of one-sided communication in MPICH-2. We describe our optimizations for all three synchronization mechanisms defined in MPI: fence, post-start-complete-wait, and lock-unlock. Our performance results demonstrate that, for short messages, MPICH-2 performs six times faster than LAM for fence synchronization and 50% faster for post-start-complete-wait synchronization, and it performs more than twice as fast as Sun MPI for all three synchronization methods.
UR - http://www.scopus.com/inward/record.url?scp=33646142466&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=33646142466&partnerID=8YFLogxK
U2 - 10.1007/978-3-540-30218-6_15
DO - 10.1007/978-3-540-30218-6_15
M3 - Chapter
AN - SCOPUS:33646142466
SN - 3540231633
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 57
EP - 67
BT - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
A2 - Kranzlmuller, Dieter
A2 - Kacsuk, Peter
A2 - Dongarra, Jack
PB - Springer
ER -