TY - GEN
T1 - High performance MPI-2 one-sided communication over InfiniBand
AU - Jiang, Weihang
AU - Liu, Jiuxing
AU - Jin, Hyun Wook
AU - Panda, Dhabaleswar K.
AU - Gropp, William
AU - Thakur, Rajeev
PY - 2004
Y1 - 2004
N2 - Many existing MPI-2 one-sided communication implementations are built on top of MPI send/receive operations. Although this approach can achieve good portability, it suffers from high communication overhead and dependency on remote process for communication progress. To address these problems, we propose a high performance MPI-2 one-sided communication design over the InfiniBand Architecture. In our design, MPI-2 one-sided communication operations such as MPI_Put, MPI_Get and MPI Accumulate are directly mapped to InfiniBand Remote Direct Memory Access (RDMA) operations. Our design has been implemented based on MPICH2 over InfiniBand. We present detailed design issues for this approach and perform a set of micro-benchmarks to characterize different aspects of its performance. Our performance evaluation shows that compared with the design based on MPI send/receive, our design can improve throughput up to 77%, and reduce lantency and synchronization overhead up to 19% and 13%, respectively. Under certain process skew, the bad impact can be significantly reduced by new design, from 41% to nearly 0%. It also can achieve better overlap of communication and computation.
AB - Many existing MPI-2 one-sided communication implementations are built on top of MPI send/receive operations. Although this approach can achieve good portability, it suffers from high communication overhead and dependency on remote process for communication progress. To address these problems, we propose a high performance MPI-2 one-sided communication design over the InfiniBand Architecture. In our design, MPI-2 one-sided communication operations such as MPI_Put, MPI_Get and MPI Accumulate are directly mapped to InfiniBand Remote Direct Memory Access (RDMA) operations. Our design has been implemented based on MPICH2 over InfiniBand. We present detailed design issues for this approach and perform a set of micro-benchmarks to characterize different aspects of its performance. Our performance evaluation shows that compared with the design based on MPI send/receive, our design can improve throughput up to 77%, and reduce lantency and synchronization overhead up to 19% and 13%, respectively. Under certain process skew, the bad impact can be significantly reduced by new design, from 41% to nearly 0%. It also can achieve better overlap of communication and computation.
UR - http://www.scopus.com/inward/record.url?scp=4544268140&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=4544268140&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:4544268140
SN - 078038430X
SN - 9780780384309
T3 - 2004 IEEE International Symposium on Cluster Computing and the Grid, CCGrid 2004
SP - 531
EP - 538
BT - 2004 IEEE International Symposium on Cluster Computing and the Grid, CCGrid 2004
T2 - 2004 IEEE International Symposium on Cluster Computing and the Grid, CCGrid 2004
Y2 - 19 April 2004 through 22 April 2004
ER -